The incident I think of when someone says "data contract" started with a column. An upstream service team renamed user_type to account_type and changed two of its enum values, shipped it on a Tuesday, and had no idea anyone downstream cared. By Wednesday the marketing attribution model was silently mis-bucketing a third of signups, the finance dashboard had a category that no longer existed, and three teams spent two days bisecting dbt runs to find a change that took the producer thirty seconds to make. Nobody did anything wrong, exactly. There was just no agreement about what that table promised, so there was nothing to break — and therefore nothing to catch it.
That gap is what a data contract closes. Not a wiki page nobody reads, not a Confluence "data dictionary" that drifts from reality the week after it's written — an executable agreement that fails a build when it's violated. I've now put contracts on the interfaces that hurt most across a few platforms, and the pattern that works is narrower and less glamorous than the conference talks suggest. This is what a data contract actually is, the open standard worth using, how to enforce it in dbt and on streaming sources, how it differs from the legal Data Use Agreement it gets confused with, and the ways it quietly fails.
What is a data contract?
A data contract is an explicit, versioned agreement between the producer of a dataset and its consumers that specifies the data's schema, semantics, quality expectations, and service levels — and is enforced automatically, so a violation is caught at the source instead of discovered downstream. That's the whole idea in one sentence: move the promise to the producer, write it in a machine-readable file, and check it in CI.
The word "contract" is doing real work. A schema is a description of shape. A contract is a description of shape plus an obligation: the producer agrees not to break it without versioning and notice, and the consumer agrees to depend only on what's written down. The mental model I keep coming back to is the API contract. We long ago stopped letting backend teams change a JSON response field whenever they felt like it — there's an OpenAPI spec, a version, a deprecation path. Data contracts are that same discipline applied to the tables and topics that have, for years, been treated as an internal implementation detail you could change at will. The whole movement is sometimes called "shift left": push accountability for data quality upstream to the people who produce the data, instead of leaving it to the analysts who are furthest from the change and least able to prevent it.
A data contract is a producer-owned, versioned, enforceable agreement about a data interface — schema, semantics, quality, and SLAs. Schema says what columns exist. The contract says what they mean, how good they'll be, how fresh, and what happens when the producer wants to change them.
The open standard: datacontract.com and ODCS
You don't have to invent the format. There's an open specification, and as of 2025 the ecosystem is converging onto a single one. The Data Contract Specification (datacontract.com) popularized a clean YAML format, and it's now being folded into the Open Data Contract Standard (ODCS), governed by the Bitol project under the Linux Foundation. The datacontract.com authors have said they'll deprecate their own spec in favor of ODCS to avoid the industry maintaining two competing standards — so if you're starting now, learn the concepts from datacontract.com's excellent tooling and target ODCS as the canonical format. Both describe the same handful of sections.
| Section | What it holds |
|---|---|
info | Title, version, owner, status, contact — who's responsible and how to reach them. |
servers | Where the data physically lives (S3, BigQuery, Snowflake, a Kafka topic) and how to connect. |
models / schema | The logical structure: tables/objects, fields, types, descriptions, constraints. |
definitions | Reusable field definitions so the same concept (an email, a country code) is identical everywhere. |
quality | Validation rules — SQL checks, metrics, thresholds, or natural-language expectations. |
servicelevels / SLA | Freshness, availability, retention, latency, frequency, support — the operational promise. |
terms | Usage rights, limitations, billing, notice period before breaking changes. |
A trimmed contract reads like something a human can review in a pull request, which is the point:
dataContractSpecification: 1.1.0
id: urn:datacontract:checkout:orders
info:
title: Orders
version: 2.1.0
owner: checkout-team
contact: { name: Checkout Data, email: checkout-data@acme.example }
servers:
production:
type: snowflake
database: PROD
schema: CHECKOUT
models:
orders:
type: table
fields:
order_id: { type: string, primaryKey: true, required: true }
account_type: { type: string, enum: [consumer, business], required: true }
amount_usd: { type: decimal, required: true, minimum: 0 }
placed_at: { type: timestamp, required: true }
servicelevels:
freshness: { threshold: 30m, description: "Loaded within 30 min of order" }
retention: { period: P3Y }
quality:
- type: sql
description: "No negative order amounts"
query: "SELECT count(*) FROM orders WHERE amount_usd < 0"
mustBe: 0
The tooling is the reason to care. The datacontract CLI (and the equivalent ODCS tools) will lint the file, test a live dataset against it — actually run those quality queries and the schema checks against Snowflake or BigQuery and tell you if reality matches the promise — and export the contract into other artifacts: a dbt model schema, an Avro schema, SQL DDL, a JSON Schema. That export ability is what stops the contract from becoming yet another document that drifts. The contract becomes the source of truth, and the schema definitions you actually deploy are generated from it.
graph TD
PROD["Producer team
owns the contract.yaml"]
CONTRACT["Data contract
(schema + quality + SLA, versioned)"]
CI["CI gate
(lint + test vs live data)"]
GEN["Generated artifacts
(dbt schema, Avro, SQL DDL)"]
C1["Consumer: BI / dbt marts"]
C2["Consumer: ML features"]
C3["Consumer: reverse-ETL"]
PROD --> CONTRACT
CONTRACT --> CI
CONTRACT --> GEN
CI -->|"blocks merge on breach"| PROD
GEN --> C1
GEN --> C2
GEN --> C3
The contract sits at the producer, not between teams as a document. CI tests it against live data and blocks a breaking change before it ships; downstream schemas are generated from the same file, so consumers depend on a promise that's actually checked rather than on whatever the table happens to contain today.
Enforcing it in dbt: the contract is in the build
For the warehouse-shaped half of the world, the enforcement point is dbt, and dbt has had first-class model contracts since v1.5. You declare contract: enforced on a model and list the columns with their types and constraints; dbt then refuses to build the model if the produced SQL doesn't match the declared shape. It catches the rename, the type change, the dropped column — at build time, in the producer's own PR, which is exactly where you want the failure.
models:
- name: dim_orders
config:
contract: { enforced: true }
columns:
- name: order_id
data_type: varchar
constraints: [{ type: not_null }, { type: primary_key }]
- name: account_type
data_type: varchar
tests:
- accepted_values: { values: ['consumer', 'business'] }
- name: amount_usd
data_type: numeric
The split worth understanding: dbt's contract enforces structure (column presence and type) and fails the build; dbt tests (not_null, accepted_values, relationships, plus packages like dbt-expectations) enforce data quality and fail the run. Together they cover most of what the ODCS models and quality sections describe. The clean pattern is to keep the ODCS contract as the human-readable, cross-team source of truth and generate (or reconcile) the dbt YAML from it — so the analytics-engineering layer and the formal contract can't silently disagree. If you only adopt one thing from this whole article, make it this: turn on contract: enforced for the handful of models other teams depend on. It's an afternoon of work and it ends an entire category of 3am page.
Streaming contracts: the schema registry was the first data contract
The streaming world solved a version of this years before "data contract" was a phrase, and it's worth seeing the lineage. In a Kafka system, producers and consumers are decoupled by design — they don't know about each other — which is exactly why an unmanaged schema change is so dangerous: there's no compile step that spans them. The Schema Registry (with Avro, Protobuf, or JSON Schema) is the contract enforcement point. As Confluent frames it in their data contract pattern, the registry lets applications "share events and understand how to process them without the sending and receiving application knowing any details about each other."
The mechanism that makes it a contract rather than just a schema is compatibility checking. When a producer registers a new schema version, the registry validates it against a configured rule and rejects an incompatible change:
| Compatibility mode | Allowed change | Who upgrades first |
|---|---|---|
BACKWARD (default) | Delete fields, add optional fields | Consumers first |
FORWARD | Add fields, delete optional fields | Producers first |
FULL | Add/remove only optional fields | Either order |
NONE | Anything (no checking) | — |
Modern Confluent data contracts go further than schema-plus-compatibility: they attach metadata (ownership, tags, sensitivity), domain rules (validation expressions like CEL that run on each message), and migration rules that transform messages between incompatible major versions so a producer can make a genuinely breaking change behind a version bump without orphaning consumers. The shape is identical to the warehouse case — schema, semantics, quality rules, evolution policy, ownership — just enforced at write/produce time against a stream instead of at build time against a table. Same handshake, different clock speed.
"Data delivery by SQL": the contract as the served interface
Here's a pattern I lean on that ties contracts to how consumers actually read data. The contract shouldn't just validate a physical table — it should define the interface, and the cleanest interface in a warehouse is a SQL view. You expose a stable, contracted view (analytics.orders_v2) whose columns, types, and semantics match the contract exactly; the messy physical tables and the refactors live behind it. Consumers query the view. The contract's quality checks are literally SQL that runs against it.
This buys two things. First, the producer can refactor storage — repartition, rename internal columns, change the load mechanism — without consumers noticing, because the contract is the view, not the table. Second, the version lives in the name: orders_v2 coexists with orders_v1 during a migration, so a breaking change becomes an additive one plus a deprecation window, never a rug-pull. "Data delivery by SQL" is just the recognition that for analytical consumers, the SQL surface is the contract, and a view is the most natural place to make the promise enforceable and the change non-breaking.
Where it lives on AWS: DataZone and the schema registry
On AWS the contract concept shows up in two layers. For streaming and event data, the AWS Glue Schema Registry plays the same role as Confluent's — registered Avro/Protobuf/JSON schemas with compatibility enforcement on the producer. For the discovery-and-access layer, Amazon DataZone (now folded into SageMaker as the next-gen catalog) implements the governance half of a contract: data is published as assets with schemas and metadata into a business catalog, and consumers subscribe — a request the producing domain explicitly approves, which grants access and records the agreement.
That subscription-and-approval flow is the contract's terms-of-use and ownership made operational: there's a named owner, an explicit grant, and an auditable record of who agreed to consume what. It pairs naturally with the schema/quality contract — DataZone answers "who may use this and under what terms," while the ODCS file and the dbt/registry enforcement answer "what is its shape and quality." This is the same division you see in any mature platform, including the ones built on Unity Catalog; contracts are most powerful as the producer-owned interface in a data mesh, where domains publish data as a product and the contract is the product's published surface.
Data contracts vs Data Use Agreements: don't conflate them
This trips people up, especially in healthcare and research, so it's worth a clean separation. A data contract is technical: schema, quality, SLA — a thing CI checks. A Data Use Agreement (DUA) is legal: a binding agreement between organizations governing how shared data may be used — for what purposes, by whom, with what privacy and security obligations, for how long. A DUA is signed by lawyers; a data contract is merged by engineers. They answer different questions and you generally need both.
They do meet at one interesting point: making the legal terms computable. In genomics and health data, the GA4GH Data Use Ontology (DUO) encodes permitted-use terms as machine-readable tags, so a DUA's restrictions ("research use only," "no commercial use," "disease-specific") can be matched automatically against a researcher's authorization instead of read off a PDF. That's the same instinct as a data contract — take an agreement humans used to enforce by hand and make a machine enforce it — applied to the legal layer. Keep them distinct in your head: the data contract governs the data's shape and quality; the DUA governs your right to use it at all.
| Data contract | Data Use Agreement (DUA) | |
|---|---|---|
| Governs | Shape, semantics, quality, SLA | Permitted use, purpose, privacy obligations |
| Enforced by | CI / schema registry / dbt build | Law, contracts, access controls, audit |
| Changed by | A pull request | Legal/compliance, re-signature |
| Owner | Producing engineering team | Legal + data governance + DPO |
The challenges, and how the contract answers them
Contracts earn their keep against specific, recurring failures — not as a generic "quality" gesture. The honest mapping:
- Silent schema drift. The rename that broke my Tuesday. The contract's enforcement turns it into a failed build in the producer's PR, before it ships.
- Ambiguous ownership. "Who owns this table?" with no answer is why nobody fixes the breakage. The
owner/contactfields make ownership a required, visible part of the artifact. - Semantic ambiguity. Two teams reading
statusdifferently. Field descriptions and shareddefinitionspin the meaning, not just the type. - Unmanaged evolution. Breaking changes shipped with no warning. Versioning plus compatibility rules (registry) or coexisting
_v2views (warehouse) make change additive and give consumers a deprecation window. - Quality discovered downstream. The null rate found in a dashboard a week late. The
qualitychecks run at the source and fail loudly there.
The failure mode is contract theatre: a folder of beautiful YAML that nothing enforces. A data contract that isn't wired into a CI gate, a registry compatibility check, or a build that actually fails is just documentation with extra syntax — and it will drift from reality within a sprint, then mislead people who trust it. The other failure is over-reach: putting a heavyweight contract on every table, including throwaway staging models nobody depends on, which buries the team in ceremony and trains everyone to ignore the process. Both come from the same mistake — treating the contract as a document to produce rather than a check that runs. If a violated contract doesn't turn something red, you don't have a contract. You have a wish.
Lessons learned and what actually works
What's held up across the rollouts I've been part of:
- Start at the highest-pain interface, not everywhere. Find the one or two tables/topics that, when they break, page multiple teams. Contract those first. The ROI is concentrated; chasing coverage is how the program stalls.
- Generate the first contract from reality. Don't hand-author it — point the tooling at the existing dataset, import the schema, then add the quality rules and SLAs a human actually cares about. A contract that already matches production is one people trust.
- Enforce in the producer's pipeline. The check has to fail the producer's build or block their schema registration. A check that only runs downstream just relocates the discovery of the breakage; it doesn't prevent it.
- Version, and make breaking changes additive. New required field, removed column, changed enum — that's a major version, a new
_v2view or a registry version bump, and a deprecation window. Never mutate in place. - Put ownership in the contract and mean it. The
ownerfield is only useful if that team is actually on the hook when the check goes red. Ownership without accountability is just a name. - Keep one source of truth. Pick ODCS as the canonical file and generate dbt schemas / Avro / DDL from it. The moment you maintain the contract in three places by hand, two of them are already wrong.
What to carry away
A data contract is the API-contract discipline applied to data: a producer-owned, versioned, enforceable agreement about a data interface's schema, semantics, quality, and SLA. Use the open standard — datacontract.com's concepts, converging on ODCS under Bitol — so you're not inventing YAML. Enforce it where the data lives: contract: enforced and tests in dbt for the warehouse, the schema registry with compatibility and migration rules for streaming, a versioned SQL view as the served interface, and DataZone-style subscription for governed access. Keep it distinct from the legal Data Use Agreement, which governs whether you may use the data at all.
The single load-bearing idea: a contract is a check that runs, not a document that's written. Wire it into a gate that turns red, start with the interface that hurts most, generate it from reality, and make every breaking change additive. Do that and the entire genre of silent, expensive, multi-team data breakage mostly stops happening. For where contracts fit the bigger picture, see data mesh, designing a data pipeline, and the producer-accountability theme in data strategy.