Designing a Data Pipeline: Batch vs Streaming, Idempotency, and Backfills

The data pipeline is the workhorse of data engineering, and it's also where the gap between "works in the demo" and "survives production" is widest. A pipeline that moves data from A to B on a sunny day is a weekend project. A pipeline that does it correctly when a source is late, a job dies halfway, a record is malformed, the same event arrives twice, and someone needs last month reprocessed after a bug fix β€” that's engineering. Designing one well is less about the transform logic and more about how it behaves when things go wrong, which they will.

I'll design a pipeline with the system-design framework, and spend most of the time on the parts that separate robust pipelines from fragile ones: the batch-vs-streaming choice, idempotency, backfills, orchestration, and failure handling. The happy path is the easy part.

Requirements: freshness picks the paradigm

The first question decides the whole shape: how fresh does the data need to be? The answer maps almost directly onto batch vs streaming, and it's the highest-leverage decision in the design:

BatchStreaming
FreshnessMinutes to dailySeconds or less
ModelProcess bounded chunks on a scheduleProcess events continuously as they arrive
Complexity / costLower β€” simpler to build, test, reason aboutHigher β€” state, watermarks, always-on infra
ReprocessingStraightforward β€” just rerun the windowHarder β€” replay from the log

Default to batch; reach for streaming only when the freshness requirement truly demands it. Streaming is genuinely harder to build, operate, and debug, and a huge fraction of "real-time" requirements evaporate under the question "what decision is made on this data, and how often?" If the answer is a dashboard people check each morning, hourly batch is fine and a tenth the trouble. Build the real-time path when a real-time decision depends on it β€” fraud blocking, live ops β€” not because real-time sounds better. The honest freshness answer is the cheapest optimization in the whole design.

Architecture and the layered transform

Whatever the paradigm, the pipeline has the same skeleton β€” ingest, transform in stages, load, all driven by orchestration β€” and the transform is best done in layers (raw β†’ cleaned β†’ curated), the medallion pattern, so each stage is debuggable and the raw landing zone lets you reprocess from source:

graph TD
    SRC["Source
(DB via CDC, events, API, files)"] RAW["Raw / landing
(immutable, append-only)"] CLEAN["Cleaned
(validated, typed, deduped)"] CUR["Curated
(business logic, joins, aggregates)"] SINK["Sink
(warehouse, serving store)"] DLQ["Dead-letter queue
(bad records)"] ORCH["Orchestrator
(schedule, dependencies, retries, SLA)"] SRC --> RAW --> CLEAN --> CUR --> SINK CLEAN -.->|"fails validation"| DLQ ORCH -. drives .-> RAW ORCH -. drives .-> CLEAN ORCH -. drives .-> CUR

A pipeline's skeleton. Data lands raw and immutable, then flows through validated and curated stages to the sink, with an orchestrator driving each step (and its retries) and a dead-letter queue catching records that fail validation. The raw layer is the reprocessing insurance; the DLQ is what keeps one bad record from halting the whole flow.

Idempotency: the most important property

If a pipeline is going to be production-grade, it must be idempotent: running the same step on the same input twice produces the same result, not duplicates. This isn't optional polish β€” failures and retries are normal, so any stage will eventually run more than once, and a non-idempotent pipeline double-counts revenue or duplicates rows the first time it's retried. The standard techniques:

  • Upsert / merge on a key instead of blind insert, so a reprocessed record overwrites rather than duplicates.
  • Deterministic partitions β€” overwrite the whole partition for a date rather than appending, so rerunning a day replaces it cleanly.
  • Deduplication keys β€” track processed event IDs so at-least-once delivery doesn't become at-least-once counting.

This connects to delivery guarantees: most messaging is at-least-once (a message may be redelivered), so the consumer must be idempotent to achieve effectively-once results. Even Kafka's exactly-once holds within Kafka; the moment you write to an external sink, idempotent writes are what make the end-to-end result correct. Design every stage to be safe to rerun, and most failure scenarios become non-events.

Backfills: reprocessing is a feature, not an emergency

You will need to reprocess history β€” a transformation bug shipped, a new column must be computed for past data, a source corrected its records. If the pipeline wasn't designed for it, a backfill becomes a terrifying bespoke operation; if it was, it's routine. Designing for backfills means: keep the raw layer so history is replayable, make stages parameterized by time window so you can run "reprocess 2024-03" exactly like a normal run, and lean on the idempotency above so the backfill cleanly overwrites instead of duplicating. A pipeline you can't safely rerun for an arbitrary past window isn't finished.

Orchestration: dependencies, retries, SLAs

The pipeline's steps need a conductor, and that's the orchestrator (Airflow and kin). It models the pipeline as a DAG of tasks with dependencies (don't build marts before the core loads), handles retries with backoff for transient failures, enforces SLAs and alerts when a job is late or fails, and provides the scheduling and visibility to operate the whole thing. The orchestrator is also where backfills are triggered (rerun a DAG for a past date range) and where idempotency pays off β€” because the orchestrator will retry a failed task, and that retry must be safe.

Quality gates and failure handling

Two more disciplines separate robust pipelines from fragile ones. Data-quality checks belong in the pipeline, not in a dashboard discovered after the fact: validate schema, null rates, row-count deltas, and business rules between stages, and fail (or quarantine) loudly when data is wrong β€” a pipeline that cheerfully loads garbage is worse than one that stops. (Data contracts push this upstream to the producer.) And failure handling needs a plan for the records that can't be processed: a malformed event shouldn't crash the run or loop forever β€” route it to a dead-letter queue for inspection and let the good data flow. Late and out-of-order events, in streaming, are handled by the watermark and windowing model rather than ignored.

The fragile pipeline assumes the happy path; the robust one assumes failure. The design that breaks in production is the one that presumes each job runs exactly once, in order, on perfectly-shaped, on-time data β€” and has no answer for retries, malformed records, late arrivals, or reprocessing. Every one of those will happen. The senior instinct is to ask, for each stage, "what happens when this runs twice, gets bad input, or needs to be rerun for last month?" β€” and to design idempotency, a DLQ, and time-windowed reprocessing in from the start, not bolt them on after the first incident.

Observability: know it's healthy before users do

Finally, a production pipeline is observable: freshness/latency metrics (is data arriving on time?), volume metrics (did row counts move suspiciously?), quality metrics (null/error rates), and lineage (when something's wrong downstream, trace it to the source). The goal is to detect a problem from your monitoring before a consumer detects it from a wrong number β€” a silent pipeline that's quietly broken for a week is the outcome observability exists to prevent.

What to carry away

Designing a data pipeline is mostly about behavior under failure. Let the freshness requirement pick batch (default) vs streaming (only when truly needed). Build the same skeleton — ingest, layered transform (raw→cleaned→curated), load — driven by an orchestrator with retries and SLAs. Make every stage idempotent so retries and reruns don't duplicate, design backfills as routine (raw layer + time-windowed, idempotent stages), gate on data quality, route bad records to a dead-letter queue, and make the whole thing observable with lineage.

The through-line: the happy path is the easy 20%; the design lives in the 80% that handles retries, bad data, late arrivals, and reprocessing. For the framework see system design for data engineers; for the streaming time-handling, the Dataflow model; and for where the pipeline lands, designing a data warehouse.