Data Lineage at Scale: OpenLineage, Column-Level Tracking, and Impact Analysis

A Friday evening page: a finance dashboard is showing zeros. Nobody touched the dashboard. Three hours later, after grepping through Airflow logs and asking around in Slack, the cause turns out to be a column rename two hops upstream, in a table owned by a team that didn't know finance consumed it. Every minute of that three hours was spent doing, by hand, what a working lineage graph would have answered in one query: what feeds this table, and what does this table feed? That's the entire pitch for lineage. It's not a compliance artifact you build for an audit — it's the map you wish you had at 9pm on a Friday.

Data lineage is the record of how a piece of data moved and transformed from its source to where it's consumed — which tables, columns, and jobs sit between a raw event and the number on a dashboard. The hard part was never the definition. It's that lineage has to be captured, continuously, from every tool that touches data, or the graph silently goes stale the moment someone adds a pipeline outside the system you instrumented.

Why does most hand-drawn lineage documentation fail?

Because it's a snapshot of a moving target, maintained by people who have a dozen higher priorities. I've seen wikis with beautiful lineage diagrams that were accurate the week they were drawn and wrong for the two years after — a new join got added, a script got rewritten in a different tool, an analyst built a "temporary" transformation that became load-bearing. Manually maintained lineage rots at exactly the rate your pipelines change, which in a healthy data org is constantly.

The fix isn't "document better." It's capturing lineage as a byproduct of execution — every time a job runs, it emits what it read and what it wrote, and the graph builds itself. This is the idea behind OpenLineage, an open specification for lineage metadata collection that defines a standard event format (Job, Run, Dataset, and the facets attached to each) so that orchestrators, query engines, and transformation tools can all emit lineage in a shape that downstream consumers understand without bespoke parsers per tool.

How does OpenLineage actually capture a lineage event?

At job start and job completion, an instrumented system emits a structured event describing the run, the inputs it read, and the outputs it wrote, to whatever backend is listening. The instrumentation lives in integrations for the tools that already exist in most stacks — Airflow, Spark, Flink, and dbt all have OpenLineage providers that hook into the framework's own execution lifecycle, so you're not asking engineers to add manual lineage-emitting code to every job. Marquez, an LF AI & Data project, is the reference implementation: it's the API server, storage, and UI that ingests OpenLineage events and serves the query/graph API most teams actually browse. You can swap Marquez for a different OpenLineage-compatible backend (several commercial catalogs now consume the same event format), which is the actual value of standardizing the wire format rather than the storage.

graph LR
    AF["Airflow DAG
(OpenLineage provider)"] SP["Spark job
(OpenLineage listener)"] DBT["dbt run
(dbt-ol wrapper)"] EVT["OpenLineage events
(Job + Run + Dataset facets)"] MQ["Marquez
(API + storage + graph UI)"] Q["Impact analysis query:
'what breaks if this changes?'"] AF --> EVT SP --> EVT DBT --> EVT EVT --> MQ --> Q

Lineage as a byproduct of execution, not a document someone maintains. Each instrumented tool emits a standard OpenLineage event at run start and completion; Marquez (or any OpenLineage-compatible backend) assembles those events into a graph. The graph is only as complete as the set of tools that are actually instrumented — an uninstrumented notebook or a hand-run script is an invisible edge.

Why does column-level lineage matter more than table-level?

Because "table A feeds table B" tells you almost nothing about blast radius when B has sixty columns and the change only touches one of them. Column-level lineage tracks which specific input columns were used to produce each output column — so when someone proposes renaming customer.email, you can answer "which of the 40 downstream tables actually reference this column" instead of "which of the 40 downstream tables read from this table at all," which is a far less useful and far noisier answer. OpenLineage's column-level lineage support comes from two different mechanisms depending on the tool, and the distinction matters for how much you should trust the result:

  • SQL parsing (static analysis): dbt's OpenLineage integration parses the compiled SQL of each model to infer column-level dependencies. This works without running anything, but it's only as good as the parser — dynamic SQL, macros that generate column references at runtime, and engine-specific syntax can all defeat a static parser silently. You won't get an error; you'll get an incomplete graph.
  • Runtime instrumentation (execution-time capture): Spark's OpenLineage listener observes the actual logical plan the engine builds while executing, which is closer to ground truth because it reflects what the engine genuinely did with each column, including cases a static parser would miss. The cost is you only learn the lineage after the job has run at least once.

Neither is complete on its own, and the honest takeaway is that column-level lineage coverage is a spectrum, not a binary "we have it." I've found it most useful to treat column-level lineage as a recall problem: assume the graph under-reports edges, especially around any tool you haven't instrumented, and validate it against a known change before trusting it for a high-stakes migration.

Lineage for batch versus streaming pipelines

Batch lineage is the easier case — a dbt run or an Airflow task has a clean start and end, and the OpenLineage event maps neatly onto that lifecycle. Streaming lineage is messier because a Kafka consumer doesn't have a "run" in the same sense; it's a long-lived process continuously reading and writing. The practical pattern is to capture lineage at the level of the streaming job definition (a Flink job, a Kafka Streams topology, a Spark Structured Streaming query) rather than per-message, and to treat topic-to-topic or topic-to-table edges as the unit of lineage rather than trying to trace an individual event's journey. That's coarser than batch column-level lineage, and it's worth being explicit with stakeholders that streaming lineage in most current tooling answers "which topics feed this sink" reliably and "which specific field in which message" much less reliably.

What is impact analysis, and why is it lineage's best incident-response use case?

Impact analysis is the query "if I change or break this dataset, what downstream consumers are affected" — and it's the single highest-value thing a working lineage graph buys you, well ahead of the audit-trail use case most people pitch lineage on. Before a schema change, before deprecating a table, before debugging an incident like the Friday-evening one above, impact analysis turns "let's find out the hard way" into a graph traversal. The query itself is simple — walk downstream edges from the node in question — but its value is entirely a function of graph completeness, which loops back to the coverage problem: an impact analysis that misses an edge isn't just incomplete, it's actively dangerous, because it gives false confidence that a change is safe.

The gap between "we have lineage" and "our lineage graph is trustworthy for a deletion decision" is large, and most teams don't realize how large until they act on a wrong answer. Manual scripts that bypass the orchestrator, ungoverned notebooks pulling straight from production, ad hoc BI extracts, and any tool without an OpenLineage integration are all invisible edges. I've watched a team deprecate a table because lineage showed zero downstream consumers, only to discover a finance analyst had a scheduled notebook reading it directly — outside Airflow, outside dbt, invisible to the graph. Before trusting lineage for anything irreversible, audit your actual data-access paths (warehouse query logs are a good cross-check) against what the lineage graph claims, and treat any discrepancy as a coverage gap to close, not a one-off exception to ignore.

Lineage versus contracts versus the catalog: what's each one for?

Data contracts define the expected shape of an interface between producer and consumer — schema, semantics, SLAs — agreed in advance. Lineage is descriptive, not prescriptive: it records what actually happened, regardless of whether a contract existed. The two are complementary in a specific way I rely on operationally — a contract breach is detected at the boundary where it's declared, but lineage tells you the consequence: which dashboards, models, and downstream tables actually depend on the dataset whose contract just broke, which is exactly the blast-radius question contracts alone can't answer. Data observability tools, in turn, often use lineage internally to scope an anomaly alert to "these are the downstream tables likely affected," which is impact analysis wearing a different hat. If you're evaluating tools, it's worth knowing that lineage, contracts, and observability increasingly overlap in commercial products, but they answer three different questions: what's the agreement (contracts), what actually happened (lineage), and is the data currently healthy (observability).

Where lineage also pays for itself is wiring into your transformation layer. If you're running dbt in production, the dbt-ol wrapper emits OpenLineage events for every model run with effectively no code change — it's one of the lowest-effort lineage integrations available, which is part of why dbt shops are often the first to get real column-level coverage.

# Run dbt wrapped with the OpenLineage integration; events are emitted
# to whatever OPENLINEAGE_URL points at (Marquez, or another OL-compatible backend)
export OPENLINEAGE_URL=http://marquez:5000
export OPENLINEAGE_NAMESPACE=analytics_prod
dbt-ol run --select finance_daily_summary+

What to carry away

Lineage stops being a diagram and starts being useful the moment it's captured automatically, as a byproduct of execution, rather than hand-maintained. OpenLineage gives you a standard event format that orchestrators, Spark, and dbt can all emit without bespoke integrations, and Marquez (or another OpenLineage-compatible backend) turns those events into a queryable graph. Column-level lineage is the version worth having — table-level tells you too little about real blast radius — but it comes from a mix of static SQL parsing and runtime instrumentation, neither of which is complete alone, so treat coverage as a spectrum and validate before trusting it for anything irreversible.

The single best use of a working lineage graph is impact analysis before you change or delete something — and the single biggest risk is trusting a graph that has invisible edges from uninstrumented tools, manual scripts, or notebooks reading production directly. Build lineage in from the orchestrator and transformation layer outward, cross-check it against real query logs periodically, and treat any gap you find as something to close rather than an acceptable footnote.