# DORA Metrics Through a Data & AI Lens: Measuring Pipeline Delivery

An engineering director asked me a fair question: "Our app teams report DORA metrics. What's the data team's equivalent — how healthy is your delivery?" I started to answer and stopped, because the honest reply is that DORA was built to measure *software* delivery, and data pipelines break the assumptions underneath it. You can absolutely adapt the four DORA metrics to data and ML work — and you should, because they're the common language leadership already speaks — but a naive copy-paste produces numbers that look healthy while the data is quietly wrong. The adaptation is the interesting part, and it turns on one fact: data systems have a second axis of failure that software doesn't.

For grounding: **DORA** (DevOps Research and Assessment, the research behind *Accelerate*) found four metrics that distinguish high-performing software teams — **deployment frequency**, **lead time for changes**, **change failure rate**, and **time to restore service**. The first two measure speed (throughput); the last two measure stability. The research's enduring finding is that good teams aren't forced to trade one for the other — they get both. The question is what those four mean when what you ship is a dbt model or a retrained classifier, not a web service.

## Translating the four metrics to data and AI

Each metric maps, but the unit of "a change" and the definition of "a failure" both shift. Here's the translation I use.

| DORA metric | Software meaning | Data & AI lens |
| --- | --- | --- |
| **Deployment frequency** | How often you ship code to production | How often you ship *data model / pipeline / model-version* changes to production |
| **Lead time for changes** | Commit → running in production | A data model change (transform, schema, feature) committed → serving downstream |
| **Change failure rate** | % of deploys causing a failure | % of pipeline/model deploys that break downstream *or produce bad data* |
| **Time to restore** | Time to recover from a failed deploy | Time to restore correct, fresh data after a pipeline/model incident |

### Lead time for data model changes

This is the one leadership asks about most, and it's genuinely useful. If a change to a [dbt model](analytics-engineering-dbt) takes three weeks to reach production, that lead time is a diagnosis: it's dominated by something — a slow manual review, a fragile deploy, a CI suite that takes hours, a hand-off queue. Measuring lead time (commit timestamp to production-serving timestamp) doesn't just give you a number for a slide; it points at the bottleneck. Long lead times in data teams are usually *process*, not engineering — a deploy that needs three approvals and a manual run.

### Change failure rate and time to restore

These two come almost for free if you already run the on-call and [RCA process](data-pipeline-on-call-operations) I'd argue every data team needs. Change failure rate is the fraction of deploys that triggered an incident; time to restore is the timeline field from your RCAs (detected → resolved). The catch — and it's the whole point of this article — is the definition of "failure," which in data is broader than "the deploy errored."

## The axis software doesn't have

Here's where data and AI diverge from the original DORA world. A software deploy is the thing that changes; if it deploys cleanly and the service responds, you're largely done. A data pipeline has **two** things that change: the *code* (the transformation, the model) *and the data flowing through it*. A pipeline can deploy flawlessly — green checkmark, zero errors — and still produce wrong output the next day because the input data drifted, an upstream schema shifted, or the model degraded against a changing world. **DORA's delivery metrics measure the code-change axis and are blind to the data axis.**

```mermaid
graph TD
    subgraph DELIVERY["DORA delivery axis (the CODE change)"]
        D1["Deployment frequency"]
        D2["Lead time for changes"]
        D3["Change failure rate"]
        D4["Time to restore"]
    end
    subgraph DATA["Data-health axis (the DATA itself) — DORA is blind here"]
        Q1["Freshness SLO adherence"]
        Q2["Data-quality / null & volume checks"]
        Q3["Model performance / drift"]
    end
    HEALTH["True data-platform health= BOTH axes together"]
    DELIVERY --> HEALTH
    DATA --> HEALTH
          
```

Why DORA alone misleads for data teams. The four DORA metrics measure how well you *deliver changes to the code* — but a pipeline can deliver perfectly and still emit wrong data because the data drifted, an axis DORA never sees. Real data-platform health is both axes: fast, stable delivery *and* fresh, correct data. Report DORA without the data-health axis and you'll proudly show a green delivery scorecard while finance finds the numbers were wrong all week.

So the move isn't "don't use DORA" — it's "use DORA for the delivery axis, and pair it with data-health SLOs (freshness, quality, drift) for the axis DORA can't see." Reliability shows up here too: the DORA program later emphasized operational performance / reliability as a fifth dimension, and for data teams the natural proxy is **freshness- and quality-SLO adherence** — what fraction of the time your datasets met their promises. That's your pipeline-reliability number, and it's the bridge between the two axes.

## Instrumenting it without a six-month project

You don't need a platform to start; you need timestamps. Deployment frequency and lead time come from your CI/CD and version control: tag each production deploy, and lead time is the delta from the merged commit to that deploy. Change failure rate and time to restore come from your incident log (the RCAs). The one piece of discipline that makes this real is **tagging deploys and linking incidents to them**, so you can compute "of the deploys this month, how many caused an incident, and how long to restore."

```sql
-- lead time + change failure rate from a deploys table fed by CI/CD
SELECT
  date_trunc('week', deployed_at)                          AS wk,
  count(*)                                                 AS deploys,
  avg(deployed_at - committed_at)                          AS avg_lead_time,
  sum(CASE WHEN caused_incident THEN 1 ELSE 0 END)::float
    / count(*)                                             AS change_failure_rate
FROM deploys
GROUP BY 1 ORDER BY 1;
```

## How to use them — and how they get abused

**The instant you target a metric, someone games it (Goodhart's law).** Tell a team "raise deployment frequency" and you'll get more, smaller, emptier deploys that move the number without moving value. Make change failure rate a performance target and incidents quietly stop getting reported. DORA metrics are *diagnostics for a team to improve its own system*, not a leaderboard to rank individuals — that's the single most important thing the DORA researchers themselves stress. Use them to find your bottleneck (why is lead time three weeks?) and to see whether a change helped, never as a stick. And never report the four delivery metrics without the data-health axis beside them, or you incentivize shipping fast over shipping correct — the worst possible trade for a data team.

**Start with lead time and reliability — they're the most actionable pair.** Lead time exposes your delivery bottleneck (almost always a process step, not a tooling gap), and freshness/quality-SLO adherence is the reliability number leadership actually cares about because it maps to "can we trust the data." Those two, tracked as a trend over months, tell a truer story about a data team's health than all four DORA metrics reported once in isolation. Trend matters more than absolute value — you're looking for improvement, not a grade.

## What to carry away

DORA's four metrics — deployment frequency, lead time for changes, change failure rate, and time to restore — adapt cleanly enough to data and AI work: the unit of change becomes a data-model or model-version deploy, and they give a data team the delivery language leadership already understands. Lead time and the two stability metrics fall out of a version-control and RCA discipline you should have anyway.

But adapt them with eyes open to the axis software doesn't have: data systems fail not only when the code deploy breaks but when the *data* drifts, and DORA is blind to that second axis. Pair the delivery metrics with data-health SLOs — freshness, quality, drift — and treat the combined picture as the real measure of platform health. Use the numbers as a team's own diagnostic to find and fix bottlenecks, watch the trend rather than the absolute, and never let a green delivery scorecard stand in for "the data is correct." That pairing is DORA done honestly for data — the delivery axis and the data axis, together.
