Data Observability: Freshness, Volume, and Catching Silent Breakage

I once spent a Friday afternoon explaining to a VP why the revenue dashboard had been quietly understating a region for eleven days. No job had failed. No alert had fired. An upstream team had changed how they stamped a currency field, our pipeline kept running green, and the number drifted just slowly enough that nobody eyeballed it. That eleven-day gap between data broke and someone noticed has a name now — data downtime — and reducing it is the entire point of data observability: the discipline of knowing your data is wrong before your stakeholders do.

If you've read my piece on data pipeline on-call, this is the systematic version of the lesson buried in it: a pipeline that "succeeded" tells you nothing about whether the data is fresh, complete, and correct. Observability is how you close that gap deliberately, at scale, across hundreds of tables — not by manually writing one check at a time.

The five pillars

Data observability is usually framed around five signals — the dimensions along which data silently goes wrong. The value of the framework is that it's a checklist for the ways trust erodes, most of which never trip a job-failure alert.

graph TD
    TABLE[("A production table")]
    F["Freshness
Is it up to date?"] V["Volume
Did row count change abnormally?"] S["Schema
Did columns/types change?"] D["Distribution
Are the values still sane?"] L["Lineage
What's upstream/downstream?"] TABLE --> F TABLE --> V TABLE --> S TABLE --> D TABLE --> L L -.->|"when something breaks,
lineage shows blast radius"| IMPACT["What else is affected?"]

The five pillars. Freshness, volume, schema, and distribution are the detection signals — the four ways a table silently goes wrong without any job failing. Lineage is the diagnosis signal: once something is off, it tells you what upstream caused it and what downstream is now contaminated (the blast radius). The first four answer "is something wrong?"; lineage answers "why, and what else do I need to fix?"

  • Freshness — is the data as recent as it should be? The most common and most user-visible failure: a table that stopped updating while everything looked fine.
  • Volume — did the row count move in an abnormal way? Half the expected rows (a partial load) or triple (a duplicate load) both signal breakage no schema check sees.
  • Schema — did columns get added, dropped, renamed, or retyped? The classic silent corruptor, usually an upstream change nobody announced.
  • Distribution — are the values still in their normal range? Null rate jumped from 1% to 40%, a category disappeared, amounts went negative. The data is "there" and structurally valid but semantically wrong — the hardest failure to catch and often the most damaging.
  • Lineage — the map of what feeds what. Not a detector but the thing that turns a single alert into a diagnosis and a blast-radius assessment.

Tests vs. anomaly detection: the key distinction

Here's the insight that separates real observability from "we have some dbt tests." There are two fundamentally different ways to catch bad data, and you need both because they catch different things.

Tests are explicit assertions you writenot_null, unique, accepted ranges, referential checks (the kind dbt made routine). They're precise, they document intent, and they're the right tool for known invariants and business rules. But they share one fatal limitation: a test only catches a failure you thought to predict. You can't write a test for the breakage you didn't imagine — and those are the eleven-day ones.

Anomaly detection learns what "normal" looks like from a table's history and alerts when a metric deviates — freshness later than usual, volume outside its typical band for this hour-of-week, null rates spiking beyond their historical variance. It catches the unknown unknowns: the failures nobody anticipated, automatically, across thousands of tables you'd never hand-write checks for. This is the core of what platforms like Monte Carlo (and the broader category) do — profile metadata and metrics over time, model the expected pattern, and surface statistically significant deviations.

Tests / assertionsAnomaly detection
CatchesFailures you predicted (known unknowns)Failures you didn't (unknown unknowns)
You defineThe exact ruleThe metric to watch; it learns "normal"
Best forBusiness rules, hard invariantsBroad coverage across many tables, drift
Failure modeBlind to the unexpectedFalse positives → alert fatigue
Effort to scaleLinear — one check at a timeAutomatic once pointed at the warehouse

You write tests for the rules you know matter, and you let anomaly detection blanket everything for the breakage you can't enumerate. Neither alone is enough — tests miss the unimaginable, and anomaly detection without explicit business rules misses violations that look statistically normal.

The detectors themselves are not exotic; a freshness/volume check is a few lines of SQL over metadata, which is what the platforms automate at scale:

-- the primitives behind freshness + volume monitoring, per table
SELECT
  max(updated_at)                          AS last_update,
  now() - max(updated_at)                  AS staleness,
  count(*)                                 AS row_count
FROM analytics.orders;
-- anomaly detection compares staleness & row_count to this table's
-- own history (same hour-of-week) instead of a hard-coded threshold

Alert fatigue will kill your observability program faster than any missing feature. The instant anomaly detection floods the channel with noisy, low-value alerts — a volume "anomaly" every Monday because Monday is always different, a freshness page for a table nobody depends on — people mute the channel, and a muted channel detects nothing. A real but ignored alert is worse than no alert, because now you have false confidence and the outage. So treat alert quality as the product: route by severity, suppress known seasonal patterns, tie alerts to ownership so they reach someone who can act, and ruthlessly tune or delete monitors that cry wolf. Coverage is worthless without signal-to-noise; the goal is fewer, truer alerts, not more.

Why lineage is the multiplier

Detection tells you a table is wrong; lineage tells you why and what else. When an alert fires, column- and table-level lineage answers the two questions that consume incident time: which upstream source caused this (so you fix the root, not the symptom), and which downstream dashboards, models, and reverse-ETL syncs are now serving bad data (so you can warn consumers before they make decisions on it). Without lineage, every incident is a manual archaeology dig through SQL to reconstruct dependencies. With it, the blast radius is a click. That's why the mature observability platforms invest so heavily in automatic lineage — it's what converts detection into fast, confident response.

Rolling it out without drowning

Start with freshness and volume on your most-trusted tables, then expand. Don't try to observe everything on day one — you'll generate noise faster than trust. Begin with automatic freshness and volume monitoring on the handful of tables that feed your most important dashboards and decisions (the ones where being wrong is most expensive), get the alerting clean and owned there, and only then widen coverage to distribution and to the long tail. Pair the automatic anomaly detection with a small set of hand-written tests for the business rules you know are sacred. Observability earns its keep by shrinking data downtime on what matters most — prove that on a few critical tables before scaling, or you'll spend your credibility on alerts about tables nobody reads.

What to carry away

Data observability exists to shrink data downtime — the gap between when data breaks and when someone notices — which is dangerous precisely because the worst data failures are silent: no job fails, no alert fires, the number just quietly drifts. The five pillars name the ways trust erodes: freshness, volume, schema, and distribution detect the problem; lineage diagnoses it and maps the blast radius.

The distinction that makes a program real is tests and anomaly detection together — tests catch the failures you predicted, anomaly detection catches the ones you didn't, and only the combination covers both. But coverage is nothing without signal: alert fatigue is the failure mode that quietly kills observability, so treat alert quality, ownership, and routing as the actual product. Start narrow on your most critical tables, get the signal-to-noise right, and expand from there. Done well, the eleven-day silent-drift incident becomes an eleven-minute alert — which is the whole difference between a data platform people trust and one they quietly route around.