# Data Contracts in Practice: ODCS, dbt, Streaming, and the Producer Handshake

The incident I think of when someone says "data contract" started with a column. An upstream service team renamed `user_type` to `account_type` and changed two of its enum values, shipped it on a Tuesday, and had no idea anyone downstream cared. By Wednesday the marketing attribution model was silently mis-bucketing a third of signups, the finance dashboard had a category that no longer existed, and three teams spent two days bisecting dbt runs to find a change that took the producer thirty seconds to make. Nobody did anything wrong, exactly. There was just no agreement about what that table promised, so there was nothing to break — and therefore nothing to catch it.

That gap is what a data contract closes. Not a wiki page nobody reads, not a Confluence "data dictionary" that drifts from reality the week after it's written — an executable agreement that fails a build when it's violated. I've now put contracts on the interfaces that hurt most across a few platforms, and the pattern that works is narrower and less glamorous than the conference talks suggest. This is what a data contract actually is, the open standard worth using, how to enforce it in [dbt](dbt-internals) and on streaming sources, how it differs from the legal Data Use Agreement it gets confused with, and the ways it quietly fails.

## What is a data contract?

A data contract is an explicit, versioned agreement between the producer of a dataset and its consumers that specifies the data's schema, semantics, quality expectations, and service levels — and is enforced automatically, so a violation is caught at the source instead of discovered downstream. That's the whole idea in one sentence: move the promise to the producer, write it in a machine-readable file, and check it in CI.

The word "contract" is doing real work. A schema is a description of shape. A contract is a description of shape *plus an obligation*: the producer agrees not to break it without versioning and notice, and the consumer agrees to depend only on what's written down. The mental model I keep coming back to is the API contract. We long ago stopped letting backend teams change a JSON response field whenever they felt like it — there's an OpenAPI spec, a version, a deprecation path. Data contracts are that same discipline applied to the tables and topics that have, for years, been treated as an internal implementation detail you could change at will. The whole movement is sometimes called "shift left": push accountability for data quality upstream to the people who produce the data, instead of leaving it to the analysts who are furthest from the change and least able to prevent it.

**A data contract is a producer-owned, versioned, enforceable agreement about a data interface — schema, semantics, quality, and SLAs.** Schema says what columns exist. The contract says what they *mean*, how good they'll be, how fresh, and what happens when the producer wants to change them.

## The open standard: datacontract.com and ODCS

You don't have to invent the format. There's an open specification, and as of 2025 the ecosystem is converging onto a single one. The [Data Contract Specification](https://datacontract.com/) (datacontract.com) popularized a clean YAML format, and it's now being folded into the **Open Data Contract Standard (ODCS)**, governed by the **Bitol** project under the Linux Foundation. The datacontract.com authors have said they'll deprecate their own spec in favor of ODCS to avoid the industry maintaining two competing standards — so if you're starting now, learn the concepts from datacontract.com's excellent tooling and target ODCS as the canonical format. Both describe the same handful of sections.

| Section | What it holds |
| --- | --- |
| `info` | Title, version, owner, status, contact — who's responsible and how to reach them. |
| `servers` | Where the data physically lives (S3, BigQuery, Snowflake, a Kafka topic) and how to connect. |
| `models` / schema | The logical structure: tables/objects, fields, types, descriptions, constraints. |
| `definitions` | Reusable field definitions so the same concept (an email, a country code) is identical everywhere. |
| `quality` | Validation rules — SQL checks, metrics, thresholds, or natural-language expectations. |
| `servicelevels` / SLA | Freshness, availability, retention, latency, frequency, support — the operational promise. |
| `terms` | Usage rights, limitations, billing, notice period before breaking changes. |

A trimmed contract reads like something a human can review in a pull request, which is the point:

```yaml
dataContractSpecification: 1.1.0
id: urn:datacontract:checkout:orders
info:
  title: Orders
  version: 2.1.0
  owner: checkout-team
  contact: { name: Checkout Data, email: checkout-data@acme.example }
servers:
  production:
    type: snowflake
    database: PROD
    schema: CHECKOUT
models:
  orders:
    type: table
    fields:
      order_id:    { type: string, primaryKey: true, required: true }
      account_type: { type: string, enum: [consumer, business], required: true }
      amount_usd:  { type: decimal, required: true, minimum: 0 }
      placed_at:   { type: timestamp, required: true }
servicelevels:
  freshness: { threshold: 30m, description: "Loaded within 30 min of order" }
  retention: { period: P3Y }
quality:
  - type: sql
    description: "No negative order amounts"
    query: "SELECT count(*) FROM orders WHERE amount_usd < 0"
    mustBe: 0
```

The tooling is the reason to care. The `datacontract` CLI (and the equivalent ODCS tools) will **lint** the file, **test** a live dataset against it — actually run those quality queries and the schema checks against Snowflake or BigQuery and tell you if reality matches the promise — and **export** the contract into other artifacts: a dbt model schema, an Avro schema, SQL DDL, a JSON Schema. That export ability is what stops the contract from becoming yet another document that drifts. The contract becomes the source of truth, and the schema definitions you actually deploy are generated from it.

```mermaid
graph TD
    PROD["Producer teamowns the contract.yaml"]
    CONTRACT["Data contract(schema + quality + SLA, versioned)"]
    CI["CI gate(lint + test vs live data)"]
    GEN["Generated artifacts(dbt schema, Avro, SQL DDL)"]
    C1["Consumer: BI / dbt marts"]
    C2["Consumer: ML features"]
    C3["Consumer: reverse-ETL"]
    PROD --> CONTRACT
    CONTRACT --> CI
    CONTRACT --> GEN
    CI -->|"blocks merge on breach"| PROD
    GEN --> C1
    GEN --> C2
    GEN --> C3
          
```

The contract sits at the producer, not between teams as a document. CI tests it against live data and blocks a breaking change before it ships; downstream schemas are generated from the same file, so consumers depend on a promise that's actually checked rather than on whatever the table happens to contain today.

## Enforcing it in dbt: the contract is in the build

For the warehouse-shaped half of the world, the enforcement point is [dbt](dbt-internals), and dbt has had first-class **model contracts** since v1.5. You declare `contract: enforced` on a model and list the columns with their types and constraints; dbt then refuses to build the model if the produced SQL doesn't match the declared shape. It catches the rename, the type change, the dropped column — at build time, in the producer's own PR, which is exactly where you want the failure.

```yaml
models:
  - name: dim_orders
    config:
      contract: { enforced: true }
    columns:
      - name: order_id
        data_type: varchar
        constraints: [{ type: not_null }, { type: primary_key }]
      - name: account_type
        data_type: varchar
        tests:
          - accepted_values: { values: ['consumer', 'business'] }
      - name: amount_usd
        data_type: numeric
```

The split worth understanding: dbt's *contract* enforces structure (column presence and type) and fails the build; dbt *tests* (`not_null`, `accepted_values`, `relationships`, plus packages like dbt-expectations) enforce data quality and fail the run. Together they cover most of what the ODCS `models` and `quality` sections describe. The clean pattern is to keep the ODCS contract as the human-readable, cross-team source of truth and generate (or reconcile) the dbt YAML from it — so the analytics-engineering layer and the formal contract can't silently disagree. If you only adopt one thing from this whole article, make it this: turn on `contract: enforced` for the handful of models other teams depend on. It's an afternoon of work and it ends an entire category of 3am page.

## Streaming contracts: the schema registry was the first data contract

The streaming world solved a version of this years before "data contract" was a phrase, and it's worth seeing the lineage. In a [Kafka](kafka-internals) system, producers and consumers are decoupled by design — they don't know about each other — which is exactly why an unmanaged schema change is so dangerous: there's no compile step that spans them. The [Schema Registry](schema-registry-avro-protobuf) (with Avro, Protobuf, or JSON Schema) is the contract enforcement point. As Confluent frames it in their [data contract pattern](https://developer.confluent.io/patterns/event/data-contract/), the registry lets applications "share events and understand how to process them without the sending and receiving application knowing any details about each other."

The mechanism that makes it a contract rather than just a schema is **compatibility checking**. When a producer registers a new schema version, the registry validates it against a configured rule and rejects an incompatible change:

| Compatibility mode | Allowed change | Who upgrades first |
| --- | --- | --- |
| `BACKWARD` (default) | Delete fields, add optional fields | Consumers first |
| `FORWARD` | Add fields, delete optional fields | Producers first |
| `FULL` | Add/remove only optional fields | Either order |
| `NONE` | Anything (no checking) | — |

Modern Confluent data contracts go further than schema-plus-compatibility: they attach **metadata** (ownership, tags, sensitivity), **domain rules** (validation expressions like CEL that run on each message), and **migration rules** that transform messages between incompatible major versions so a producer can make a genuinely breaking change behind a version bump without orphaning consumers. The shape is identical to the warehouse case — schema, semantics, quality rules, evolution policy, ownership — just enforced at write/produce time against a stream instead of at build time against a table. Same handshake, different clock speed.

## "Data delivery by SQL": the contract as the served interface

Here's a pattern I lean on that ties contracts to how consumers actually read data. The contract shouldn't just validate a physical table — it should define the *interface*, and the cleanest interface in a warehouse is a SQL view. You expose a stable, contracted view (`analytics.orders_v2`) whose columns, types, and semantics match the contract exactly; the messy physical tables and the refactors live behind it. Consumers query the view. The contract's quality checks are literally SQL that runs against it.

This buys two things. First, the producer can refactor storage — repartition, rename internal columns, change the load mechanism — without consumers noticing, because the contract is the view, not the table. Second, the version lives in the name: `orders_v2` coexists with `orders_v1` during a migration, so a breaking change becomes an additive one plus a deprecation window, never a rug-pull. "Data delivery by SQL" is just the recognition that for analytical consumers, the SQL surface *is* the contract, and a view is the most natural place to make the promise enforceable and the change non-breaking.

## Where it lives on AWS: DataZone and the schema registry

On AWS the contract concept shows up in two layers. For streaming and event data, the **AWS Glue Schema Registry** plays the same role as Confluent's — registered Avro/Protobuf/JSON schemas with compatibility enforcement on the producer. For the discovery-and-access layer, **Amazon DataZone** (now folded into SageMaker as the next-gen catalog) implements the *governance* half of a contract: data is published as assets with schemas and metadata into a business catalog, and consumers **subscribe** — a request the producing domain explicitly approves, which grants access and records the agreement.

That subscription-and-approval flow is the contract's terms-of-use and ownership made operational: there's a named owner, an explicit grant, and an auditable record of who agreed to consume what. It pairs naturally with the schema/quality contract — DataZone answers "who may use this and under what terms," while the ODCS file and the dbt/registry enforcement answer "what is its shape and quality." This is the same division you see in any mature platform, including the ones built on [Unity Catalog](unity-catalog); contracts are most powerful as the producer-owned interface in a [data mesh](data-mesh), where domains publish data as a product and the contract is the product's published surface.

## Data contracts vs Data Use Agreements: don't conflate them

This trips people up, especially in healthcare and research, so it's worth a clean separation. A **data contract** is technical: schema, quality, SLA — a thing CI checks. A **Data Use Agreement (DUA)** is legal: a binding agreement between organizations governing *how shared data may be used* — for what purposes, by whom, with what privacy and security obligations, for how long. A DUA is signed by lawyers; a data contract is merged by engineers. They answer different questions and you generally need both.

They do meet at one interesting point: making the legal terms *computable*. In genomics and health data, the [GA4GH Data Use Ontology (DUO)](ga4gh-genomic-data-sharing) encodes permitted-use terms as machine-readable tags, so a DUA's restrictions ("research use only," "no commercial use," "disease-specific") can be matched automatically against a researcher's authorization instead of read off a PDF. That's the same instinct as a data contract — take an agreement humans used to enforce by hand and make a machine enforce it — applied to the legal layer. Keep them distinct in your head: the data contract governs the data's shape and quality; the DUA governs your right to use it at all.

|  | Data contract | Data Use Agreement (DUA) |
| --- | --- | --- |
| Governs | Shape, semantics, quality, SLA | Permitted use, purpose, privacy obligations |
| Enforced by | CI / schema registry / dbt build | Law, contracts, access controls, audit |
| Changed by | A pull request | Legal/compliance, re-signature |
| Owner | Producing engineering team | Legal + data governance + DPO |

## The challenges, and how the contract answers them

Contracts earn their keep against specific, recurring failures — not as a generic "quality" gesture. The honest mapping:

- **Silent schema drift.** The rename that broke my Tuesday. The contract's enforcement turns it into a failed build in the producer's PR, before it ships.

- **Ambiguous ownership.** "Who owns this table?" with no answer is why nobody fixes the breakage. The `owner`/`contact` fields make ownership a required, visible part of the artifact.

- **Semantic ambiguity.** Two teams reading `status` differently. Field descriptions and shared `definitions` pin the meaning, not just the type.

- **Unmanaged evolution.** Breaking changes shipped with no warning. Versioning plus compatibility rules (registry) or coexisting `_v2` views (warehouse) make change additive and give consumers a deprecation window.

- **Quality discovered downstream.** The null rate found in a dashboard a week late. The `quality` checks run at the source and fail loudly there.

**The failure mode is contract theatre: a folder of beautiful YAML that nothing enforces.** A data contract that isn't wired into a CI gate, a registry compatibility check, or a build that actually fails is just documentation with extra syntax — and it will drift from reality within a sprint, then mislead people who trust it. The other failure is over-reach: putting a heavyweight contract on every table, including throwaway staging models nobody depends on, which buries the team in ceremony and trains everyone to ignore the process. Both come from the same mistake — treating the contract as a document to produce rather than a check that runs. If a violated contract doesn't turn something red, you don't have a contract. You have a wish.

## Lessons learned and what actually works

What's held up across the rollouts I've been part of:

- **Start at the highest-pain interface, not everywhere.** Find the one or two tables/topics that, when they break, page multiple teams. Contract those first. The ROI is concentrated; chasing coverage is how the program stalls.

- **Generate the first contract from reality.** Don't hand-author it — point the tooling at the existing dataset, import the schema, then add the quality rules and SLAs a human actually cares about. A contract that already matches production is one people trust.

- **Enforce in the producer's pipeline.** The check has to fail the producer's build or block their schema registration. A check that only runs downstream just relocates the discovery of the breakage; it doesn't prevent it.

- **Version, and make breaking changes additive.** New required field, removed column, changed enum — that's a major version, a new `_v2` view or a registry version bump, and a deprecation window. Never mutate in place.

- **Put ownership in the contract and mean it.** The `owner` field is only useful if that team is actually on the hook when the check goes red. Ownership without accountability is just a name.

- **Keep one source of truth.** Pick ODCS as the canonical file and generate dbt schemas / Avro / DDL from it. The moment you maintain the contract in three places by hand, two of them are already wrong.

## What to carry away

A data contract is the API-contract discipline applied to data: a producer-owned, versioned, *enforceable* agreement about a data interface's schema, semantics, quality, and SLA. Use the open standard — datacontract.com's concepts, converging on **ODCS** under Bitol — so you're not inventing YAML. Enforce it where the data lives: `contract: enforced` and tests in [dbt](dbt-internals) for the warehouse, the [schema registry](schema-registry-avro-protobuf) with compatibility and migration rules for [streaming](kafka-internals), a versioned SQL view as the served interface, and DataZone-style subscription for governed access. Keep it distinct from the legal **Data Use Agreement**, which governs whether you may use the data at all.

The single load-bearing idea: a contract is a check that runs, not a document that's written. Wire it into a gate that turns red, start with the interface that hurts most, generate it from reality, and make every breaking change additive. Do that and the entire genre of silent, expensive, multi-team data breakage mostly stops happening. For where contracts fit the bigger picture, see [data mesh](data-mesh), [designing a data pipeline](designing-a-data-pipeline), and the producer-accountability theme in [data strategy](data-strategy).
