The first time I carried the pager for our ingestion pipelines, I expected it to feel like app on-call: a service falls over, an alert fires, you restart it, you go back to bed. It's not like that. Data on-call is its own discipline, because data pipelines have a special talent for failing silently — the job reports success, the dashboard keeps rendering, and three days later someone in finance notices the numbers have been quietly wrong since Tuesday. By then it's not an incident, it's an archaeology project. Owning the on-call rotation for Kafka and AWS Glue ingestion taught me a triage framework I still use, organized around the three ways these pipelines actually break, and a way of writing RCAs that makes the next rotation quieter than the last.
The core insight up front: a pipeline that "ran successfully" can still be a production incident. Success means the code didn't throw — it says nothing about whether the data is fresh, complete, or correct. That gap is where data on-call lives, and most of what follows is about closing it.
The three failure classes
Almost every page I got fell into one of three buckets, and knowing which bucket you're in tells you where to look first. Pattern-matching the failure class is the whole first move of triage.
graph TD
PAGE["Page fires / data looks wrong"]
Q{"What kind of failure?"}
F1["1. Pipeline failure
(the job died / errored)"]
F2["2. Schema mismatch
(upstream changed shape)"]
F3["3. Throughput degradation
(running, but falling behind)"]
A1["Check logs, retry idempotently,
reprocess from checkpoint / DLQ"]
A2["Diff schema vs contract,
quarantine bad records,
call the upstream owner"]
A3["Check consumer lag / DPU saturation,
scale out, find the skew or hot partition"]
PAGE --> Q
Q --> F1 --> A1
Q --> F2 --> A2
Q --> F3 --> A3
The triage tree. The first question on any data page is which of three classes you're in, because each has a different first action. A dead job is loud and usually quick; a schema mismatch is the one that silently corrupts data and needs an upstream conversation; throughput degradation is the slow-motion failure where everything "works" while freshness quietly rots. Classify first, then act.
1. Pipeline failures (the job died)
The loud, honest failure: a Glue job errored, a Spark stage threw, a Kafka consumer crashed. These are the easy ones because they announce themselves. Triage is: read the actual error (not the wrapper exception), decide whether it's transient (a throttled API, a flaky network call — retry with backoff) or deterministic (a bad record, an OOM — retrying won't help), and recover. The thing that makes recovery safe is the same property that makes Kafka consumers safe: idempotent reprocessing. If reprocessing a batch can't double-count or duplicate, you can confidently replay from the last checkpoint or drain the dead-letter queue. If it can't, your "fix" risks being worse than the failure — which is why idempotency is an operational property, not just a design nicety.
2. Schema mismatches (upstream changed the shape)
This is the dangerous one, because it's the failure most likely to be silent. An upstream team renames a column, changes a type, or drops a field — usually without telling you — and your pipeline either crashes (annoying but honest) or, far worse, keeps running and quietly drops or null-fills the affected data. A Glue job whose target schema no longer matches the source can write garbage with a green checkmark. Triage: diff the incoming schema against the expected one, quarantine the affected records rather than letting them pollute downstream, and — the real fix — get on a call with the upstream owner. The durable prevention is a schema registry with compatibility enforcement and, organizationally, data contracts so a breaking change is caught at the producer, not discovered at 2am in your consumer.
3. Throughput degradation (running, but falling behind)
The slow-motion failure. Nothing errored — the pipeline is just falling behind, and freshness is rotting. On Kafka that's consumer lag trending upward; on Glue it's jobs taking longer each run until they overrun their window or saturate their DPUs. Causes I've chased: a volume spike upstream, a skewed key creating a hot partition that drowns one consumer while others idle, the small-files problem ballooning a Glue job's planning time, or a downstream sink that slowed and backed everything up. Triage: confirm it's lag (not a stall), scale out if the work is genuinely larger, and hunt for the skew or hot partition if throughput dropped without a volume change.
Alert on symptoms, not just job status
The deadliest monitoring mistake is alerting only on "did the job fail?" That catches failure class 1 and completely misses classes 2 and 3 — the silent ones that cause the worst incidents. A job that succeeds while dropping half its rows, or a pipeline quietly 6 hours behind, both pass a job-status check with flying colors. You have to alert on the symptoms a data consumer would feel: freshness (is the latest data older than its SLA?), volume (did row count drop off a cliff vs. the same hour last week?), lag (is consumer lag trending up?), and quality (did null rates or schema spike?). If your only signal is the green checkmark, you will keep finding out about incidents from finance instead of from a pager — and that's the worst possible detector.
Concretely, the alerts that actually saved me were freshness and lag SLAs, expressed as "this dataset must be no more than N minutes old" and "consumer lag must not exceed M for more than T minutes." Here's the shape of a freshness-style check the on-call actually trusts:
-- freshness SLA: page if the latest event is older than the promise
SELECT
max(event_ts) AS latest,
now() - max(event_ts) AS staleness,
(now() - max(event_ts)) > interval '30 minutes' AS breaching_sla
FROM curated.orders;
| Failure class | The signal that catches it | First action on-call |
|---|---|---|
| Job failure | Job status / error alert | Read error, classify transient vs deterministic, retry idempotently or fix |
| Schema mismatch | Schema-change / null-rate / volume alert | Diff vs contract, quarantine bad records, contact upstream owner |
| Throughput degradation | Freshness & consumer-lag SLA | Confirm lag, scale out, hunt skew / hot partition / small files |
Runbooks: the difference between a 5-minute and a 2-hour page
The single highest-leverage on-call investment is a runbook per pipeline — a short doc that says: what this pipeline does, who owns the upstream source, where the logs and dashboards are, how to safely reprocess, and the known failure modes with their fixes. The test of a good runbook is brutal and simple: can someone who didn't build the pipeline resolve a common failure at 3am using only the runbook? If the knowledge lives only in one engineer's head, every page that engineer doesn't answer is an outage. Runbooks are how you make on-call survivable for the whole team instead of a hostage situation for the person who wrote the code.
The RCA: turning a page into a fix that sticks
Resolving the incident stops the bleeding; the root cause analysis is what stops it from recurring. A good RCA isn't a blame document or a formality — it's the mechanism by which on-call gets quieter over time instead of staying equally painful forever. The structure I use:
- Impact: what was wrong, for whom, for how long — in business terms, not just technical ones.
- Timeline: when it started, when it was detected, when resolved — and the gap between started and detected is itself a finding (a big gap means your alerting missed it).
- Root cause: the actual why, found by asking "why" until you hit something systemic — not "the job failed" but "an upstream schema change wasn't gated, because we have no contract."
- Action items: specific, owned, dated changes that prevent recurrence — a new freshness alert, a schema check, a runbook entry.
Keep RCAs blameless, and treat detection gaps as first-class findings. The moment an RCA becomes about who messed up, people stop being honest and you stop learning — write it about the system that allowed the failure, not the person who tripped it. And pay special attention to the time between failure started and failure detected: if data was wrong for three days before anyone noticed, the most important action item isn't fixing that one bug, it's adding the freshness/volume alert that would have caught it in minutes. The best RCAs convert one painful 3am page into a detector that makes the next ten failures boring.
What to carry away
Data on-call is different from app on-call because pipelines fail silently — "the job succeeded" tells you nothing about whether the data is fresh, complete, and correct. Triage starts by classifying the failure: a dead job (loud, recover with idempotent reprocessing), a schema mismatch (silent and dangerous — quarantine, diff against the contract, call upstream), or throughput degradation (the slow rot — chase lag, skew, and small files). Each class has a different first move, so naming it is half the battle.
The two practices that make the rotation survivable: alert on the symptoms a data consumer would actually feel — freshness, volume, lag, quality — not just on job status, because job status misses exactly the silent failures that hurt most; and write blameless RCAs that treat the detection gap as a finding and turn each page into a detector. Owning on-call well isn't heroics at 3am — it's the unglamorous work of observability, runbooks, and follow-through that is, in the end, the DataOps undercurrent made real.