Testing Data Pipelines: Unit, Integration, and Contract Tests for ETL

"How do you test your pipelines?" gets the same answer from most data teams I've worked with: "we have alerts." That's not testing — that's finding out in production, after the bad data already shipped to a dashboard or a model. Software engineering settled this decades ago with the testing pyramid: cheap, fast unit tests at the base, fewer integration tests above that, and a thin layer of end-to-end checks at the top. Data engineering is finally catching up, and the tooling — dbt's own test framework, Great Expectations, Soda — has matured enough that there's no longer a good excuse for "we'll catch it in production" being the actual testing strategy.

The confusion I run into most is teams conflating two genuinely different things under the word "test": validating that your transformation logic is correct, and validating that your data is correct. They need different tools, run at different times, and catch different bugs. Mixing them up is why a lot of "we have dbt tests" setups still ship broken models.

What's the difference between a unit test and a data test in dbt?

A unit test validates your SQL transformation logic against static, predefined inputs — you hand-craft a few rows of fake input, run the model's logic against them, and assert the output matches what you expect. It runs in CI, before deploy, against no real data at all. A data test (what dbt originally just called "tests," before unit tests were added as a distinct first-class concept) validates the actual data in your warehouse after a model has run — uniqueness, not-null, accepted values, referential integrity, or a custom SQL assertion. Same word, two different jobs: unit tests catch "I wrote the join wrong," data tests catch "the upstream system sent me garbage today."

Unit testData test
ValidatesTransformation logic (the SQL/code itself)Actual data after a model runs
InputsStatic, hand-crafted fixturesReal production or staging data
RunsIn CI, on every PR, before deployOn every scheduled pipeline run, in production
CatchesLogic bugs, regressions from a refactorUpstream data quality drift, contract violations
SpeedSeconds — no warehouse computeDepends on data volume and warehouse

The mistake I see most often is teams writing data tests and believing they've covered logic correctness, because data tests run constantly and feel like coverage. They're not the same thing. A data test that checks "order_total is never null" will happily pass on a model where the discount calculation is silently wrong, as long as the column isn't null. You need a unit test with a known input (a $100 order, 10% discount, expect $90) to catch that — and it has to run in CI, before the bad logic ever touches a real warehouse.

# dbt unit test: static fixtures in, asserted output out — no warehouse data involved
unit_tests:
  - name: test_discount_applied_correctly
    model: fct_orders
    given:
      - input: ref('stg_orders')
        rows:
          - {order_id: 1, subtotal: 100.00, discount_pct: 0.10}
    expect:
      rows:
        - {order_id: 1, subtotal: 100.00, order_total: 90.00}

Where do Great Expectations and Soda fit if dbt already has tests?

dbt's built-in tests are good for assertions tied tightly to a model's schema — uniqueness, referential integrity, accepted values defined in the same YAML as the model. They get awkward once you need richer statistical checks (distribution shifts, anomaly thresholds, cross-source consistency) or you need quality checks on data that doesn't flow through dbt at all — a raw ingestion landing zone, an API response, a file drop. That's the gap Great Expectations and Soda fill, and they take different approaches worth knowing before you pick one.

  • Great Expectations is validation-as-code: you write "expectations" in Python (or YAML for some workflows) — expect_column_values_to_be_between, expect_column_mean_to_be_between — store them in version control alongside the rest of your code, and run them as part of CI or a pipeline step. It's the heavier, more programmable option, and it shines when checks need real logic or you want validation results to flow into a structured "data docs" artifact.
  • Soda uses SodaCL, a YAML-based check language designed to read more like a checklist than code, which lowers the bar for analysts and less Python-fluent team members to write and own checks. soda-dbt and dbt-expectations (a package that ports Great Expectations-style checks into dbt's own test syntax) mean you don't have to pick exactly one — plenty of teams run dbt tests for schema-level checks and Soda or Great Expectations for the richer statistical and cross-system checks.
graph TD
    PR["Pull request"] --> CI["CI pipeline"]
    CI --> UNIT["Unit tests
(static fixtures, seconds)"] UNIT --> INT["Integration tests
(ephemeral warehouse)"] INT --> MERGE["Merge + deploy"] MERGE --> RUN["Scheduled pipeline run
(production data)"] RUN --> DATA["Data tests + GE/Soda checks
(actual data quality)"] DATA --> OBS["Observability
(freshness, volume, anomaly)"]

The testing pyramid for data pipelines, left to right in time rather than stacked by volume. Unit tests run first and fastest, against fixtures, catching logic bugs before they ever touch a warehouse. Integration tests run against a real but disposable warehouse. Only after deploy do data tests and quality checks run against production data — and observability is the last line, watching for the drift no test anticipated.

How do you run integration tests without polluting production?

The honest answer most teams arrive at after trying a few approaches: don't test against a shared dev schema that drifts out of sync with production, and don't test against production itself. Two patterns actually work in practice. Ephemeral warehouse compute — Snowflake zero-copy clones are the cleanest version of this: clone the production schema (instant, no storage cost until data diverges), run your pipeline and assertions against the clone, then drop it. You get real data shapes and volumes without any risk to production, at the cost of needing warehouse credits for the clone's compute. Containerized test databases — spin up a DuckDB or Postgres instance in a test container, seed it with representative fixture data, run the pipeline against it in CI. This is faster and free, but only as good as your fixtures; it won't catch issues that only show up at production data volume or with production data's actual messiness.

# Snowflake: integration-test a pipeline against a same-second clone of prod,
# then discard it — no risk to the real schema, real data shapes and skew
snowsql -q "CREATE OR REPLACE DATABASE ci_test_clone CLONE prod_db;"
dbt run --target ci_clone --select tag:nightly_pipeline
dbt test --target ci_clone
snowsql -q "DROP DATABASE ci_test_clone;"

I lean toward zero-copy clones for anything where data skew or volume genuinely matters to the logic being tested (a window function, a dedup strategy, a join that fans out unpredictably), and containerized test databases for everything else, because the CI feedback loop is faster and it doesn't depend on warehouse availability.

What should you leave to production observability instead of testing?

This is the line teams get wrong most often, in both directions. Data observability tools catch things that are fundamentally unknowable at deploy time — an upstream API silently changing its response shape next Tuesday, a source system having an outage that delays a feed by six hours, a slow distributional drift that only becomes visible over weeks. You cannot write a pre-deploy test for "the vendor will change their API in three months." That's what continuous monitoring for freshness, volume, and schema drift is for, running against live data, all the time, with no human writing a new assertion for each possible failure.

What you can and should test before deploy: anything determined by your own code. Logic bugs, regressions from a refactor, edge cases in your transformation (nulls, empty strings, duplicate keys, boundary dates) — these are deterministic, knowable in advance, and cheap to catch with a unit test instead of expensive to catch after they've corrupted a day of production data. The rule of thumb I use: if the bug originates in code you control, it's a testing problem and belongs in CI; if it originates in data you don't control, it's an observability problem and belongs in production monitoring. Trying to write enough data tests to substitute for observability produces a wall of brittle, constantly-failing checks that people learn to ignore — which is worse than not testing at all.

A test suite that never fails is a test suite nobody trusts, and a test suite that fails on every PR for things that don't matter gets routed around just as fast. I've seen teams write hundreds of dbt data tests, get a sea of red on every run because of expected seasonal variance or known upstream noise, and watch engineers start merging with failing tests because "that one always fails." Calibrate thresholds to real tolerances, not zero-tolerance defaults, and delete or fix a flaky test the first time someone overrides it rather than the fifth.

What does this look like in CI, end to end?

A pipeline repo with a real testing discipline runs roughly this sequence on every pull request: unit tests first (seconds, no infrastructure), then a build against an ephemeral warehouse clone or container, then data tests and any Great Expectations/Soda checks against that ephemeral environment — never against production. Only after all three pass does the change merge and deploy. DORA metrics for data platforms — lead time for changes, change failure rate — are directly a function of how much of this is automated versus manually verified before a deploy; teams with thin or absent pre-deploy testing compensate with slow, careful manual review, which shows up as a worse lead-time number even when the eventual code quality is fine.

What to carry away

Unit tests and data tests answer different questions and need different tools: unit tests validate transformation logic against static fixtures in CI, before any real data is touched; data tests (plus Great Expectations or Soda for richer statistical and cross-system checks) validate actual data after a model runs. Don't let one substitute for the other — a data test that never goes null doesn't prove your logic is correct, and a unit test can't catch an upstream API that changes shape next month.

Run integration tests against ephemeral infrastructure — Snowflake zero-copy clones for real data shape and skew, containerized test databases for speed — never against a shared dev schema or production. And draw the line clearly: anything determined by your own code is a pre-deploy testing problem; anything determined by systems you don't control is a production observability problem. Teams that blur that line end up either with brittle test suites nobody trusts or with no tests at all and "we have alerts" as the whole strategy — and alerts firing after bad data has already shipped is not testing, it's a faster postmortem.