System Design for Data Engineers: A Framework for Designing Data Systems

Put a data engineer in a system-design interview — or in front of a blank design doc for a real platform — and the failure mode is almost always the same: they jump straight to drawing boxes (Kafka here, Spark there, Snowflake at the end) before anyone has established what the system is for. The boxes might even be reasonable, but a design that starts from components instead of requirements is a guess. The senior move is to slow down at the start and let the requirements and the access patterns dictate the architecture — because in data systems, they really do.

This is the framework piece; the companion articles design specific systems — a data warehouse, a data pipeline, a data API — using exactly this approach. Here I'll lay out the method: clarify requirements, derive access patterns, place the reference components, and reason about the trade-offs that decide everything. It works whether you're whiteboarding for 45 minutes or architecting for real.

Step 1: clarify requirements (don't skip this)

Before any box goes on the board, pin down two kinds of requirements. Functional: what does the system need to do? Who produces the data, who consumes it, what questions must it answer, what products does it feed? Non-functional: the numbers that actually shape the architecture. The non-functional ones are where designs are won or lost:

Scale — data volume and growth, event/query rate. "1 GB/day" and "10 TB/day" are different systems.
Latency & freshness — how current must the data be? Sub-second, minutes, hourly, daily? This single answer decides batch vs streaming more than anything else.
Query latency — how fast must reads be, and at what concurrency? An internal daily report and a user-facing dashboard at 10k QPS are opposite problems.
Consistency — is eventual consistency acceptable, or must reads be exact and transactional?
Reliability — what's the SLA, and what happens on failure or reprocessing?
Cost — the constraint that quietly vetoes the "ideal" design.

In an interview this is also the highest-scoring phase, and most candidates rush it. Asking "what's the data volume? how fresh does it need to be? what's the read pattern and QPS? is eventual consistency okay?" signals seniority instantly — and the answers genuinely change the design. The same is true in real life: half the over-engineered data platforms I've seen exist because nobody asked "does this actually need to be real-time?" — and the honest answer was no.

Step 2: access patterns drive everything

If there's one principle that separates data system design from generic system design, it's this: the access pattern — how data is written and read — dictates the storage and processing choices, not the other way around. You don't pick Cassandra because it's cool; you pick it because your access pattern is high-volume writes and known-key reads. You don't pick a columnar warehouse because everyone has one; you pick it because your reads are large scans and aggregations over few columns.

So after requirements, ask: how is this data written (append-only events? updates in place? bulk loads?) and how is it read (point lookups by key? big analytical scans? full-text search? graph traversals?). Those two answers map almost mechanically onto the storage engine and the data model — and getting them backwards (a transactional row store for analytics, a columnar warehouse for single-row lookups) is the most common architectural mistake there is. The B-tree vs LSM trade-off is this principle at the storage-engine level.

Step 3: the reference components

Almost every data system is assembled from the same building blocks. Knowing them as a kit lets you compose a design quickly and defend each piece:

graph TD
    SRC["Sources
(apps, DBs, events, APIs)"]
    ING["Ingestion
(CDC, events, batch loads)"]
    LAKE["Storage
(lake / warehouse / lakehouse)"]
    PROC["Processing
(batch + stream)"]
    SERVE["Serving
(warehouse queries, data API, cache)"]
    CONS["Consumers
(BI, apps, ML, agents)"]
    ORCH["Orchestration (schedules, dependencies, retries)"]
    GOV["Governance & observability (catalog, lineage, quality, cost)"]
    SRC --> ING --> LAKE --> PROC --> SERVE --> CONS
    ORCH -. drives .-> ING
    ORCH -. drives .-> PROC
    GOV -. spans .-> LAKE
    GOV -. spans .-> SERVE

The reference data architecture. Data flows sources → ingestion → storage → processing → serving → consumers, with orchestration driving the moving parts and governance/observability spanning the whole thing. Almost any data system design is a specific instantiation of this skeleton — the design work is choosing each box and justifying it from the requirements.

Ingestion — getting data in: log-based CDC from operational DBs, event streams via Kafka, or batch loads.
Storage — a data lake (raw, cheap, flexible), a warehouse (structured, fast analytics), or a lakehouse that blends both over open table formats.
Processing — batch (Spark, dbt) and/or stream (Flink), usually in a medallion-style raw→refined→curated layering.
Serving — warehouse queries for BI, a data API for applications, a low-latency store (Redis) for online reads.
Orchestration — Airflow-style scheduling, dependencies, retries, backfills.
Governance & observability — catalog, lineage, data quality, and cost monitoring across all of it.

Step 4: the trade-offs that decide everything

Good designs aren't about knowing the components — they're about defending the choices, and every defense is a trade-off. The ones that come up in nearly every data design:

Trade-off	One side	Other side
Freshness	Batch — simpler, cheaper, higher latency	Streaming — fresh, complex, costlier
Data model	Normalized — less duplication, more joins	Denormalized — fast reads, more storage/maintenance
Consistency	Strong — correct, costlier/slower	Eventual — fast/available, temporarily stale
Schema	Schema-on-write — safe, rigid	Schema-on-read — flexible, risky
Build vs buy	Managed — fast, ongoing cost	Self-host — control, operational burden

The mark of a strong design isn't picking the "right" side — it's stating the trade-off and tying the choice to a requirement: "freshness requirement is hourly, so batch; if it tightens to seconds, we'd add a streaming path." That sentence is what separates a senior design from a junior one. "It depends" is the correct instinct; the skill is saying what it depends on.

Step 5: scale and reliability

Once the happy path is designed, pressure-test it. Scale in data systems is mostly about partitioning — splitting data (by time, by key) so work parallelizes and queries prune to what they need; the partition/shard key is the highest-leverage decision, and a bad one (hotspots, skew) doesn't scale no matter how many machines you add. Reliability centers on a few disciplines that distinguish toy pipelines from production:

Idempotency — reruns must not double-count. Design so processing the same input twice yields the same result.
Backfills — you will need to reprocess history (bug fix, new column); design for it from day one, not as an emergency.
Retries & dead-lettering — transient failures retry; poison records go to a dead-letter queue, not into an infinite loop.
Late & out-of-order data — for streaming, the watermark/windowing model is how you handle it correctly.

Idempotency and backfills are where "works in the demo" meets "wakes you at 3 a.m." The pipeline that assumes it runs exactly once, in order, on perfect data is the one that double-bills customers after a retry and can't be safely rerun after a fix. Senior data system design treats reprocessing as a first-class requirement: every stage should be safe to run again, and recomputing last month should be a routine operation, not a heroic one. If a design can't answer "what happens when this runs twice?" it isn't done.

What to carry away

Data system design is a repeatable method, not a memorized stack. Clarify requirements first — especially the non-functional numbers (scale, freshness, query latency, consistency, cost) — because they shape everything. Derive the access patterns (how data is written and read) and let them dictate storage and model, never the reverse. Compose from the reference components (ingestion, storage, processing, serving, orchestration, governance). Defend each choice as a trade-off tied to a requirement. Then pressure-test for scale (partitioning) and reliability (idempotency, backfills, retries, late data).

Do that and a blank whiteboard stops being intimidating — you're not inventing an architecture, you're deriving one. The companion pieces apply this end to end to a data warehouse, a data pipeline, and a data API; the internals deep-dives elsewhere on this blog are where you go to defend the component choices in detail.