# Designing a Data Warehouse: Layers, Modeling, Storage, and Serving

"Design a data warehouse for an e-commerce company" is one of the most common data system-design prompts, and it's a good one because a warehouse touches every part of the discipline: ingestion, modeling, storage, transformation, and serving. It's also where people most often produce a tangle — one giant layer where raw data, business logic, and reporting tables all blur together, impossible to debug or trust. A good warehouse design is mostly about *separation*: clear layers, each with one job. I'll design one using the [system-design framework](system-design-for-data-engineers), requirements first.

The shape we're building toward: data lands raw, gets cleaned and integrated into a conformed core, and is reshaped into purpose-built marts for consumption — with columnar storage, incremental loads, and a semantic layer on top. Let's derive it.

## Requirements: it's an analytical system

The defining requirement of a warehouse is the access pattern: **large analytical reads** — scans and aggregations over many rows and few columns — not single-row transactional lookups. That one fact drives most decisions: columnar storage, denormalized marts, and a compute engine built for scans. The non-functional questions that refine it:

- **Freshness** — do analysts need yesterday's data each morning (batch is fine) or near-real-time (add a streaming path)? Usually the former; don't build real-time you don't need.

- **Concurrency** — a handful of analysts, or thousands of dashboard users? This sizes the compute and may push you toward separating workloads.

- **Volume & growth** — gigabytes or petabytes — decides whether one node suffices or you need a distributed/cloud warehouse.

- **Sources & history** — how many source systems, and how much history must be kept and made queryable.

## The layered architecture

The single most important design decision is to **separate the warehouse into layers, each with one responsibility**. The widely-used three-layer (medallion-style) shape:

```mermaid
graph LR
    SRC["Sources(OLTP DBs, events, SaaS APIs)"]
    STG["Staging / Rawland as-is, append-only,1:1 with source"]
    CORE["Core / Integratedcleaned, conformed,business keys & history"]
    MART["Martsdenormalized star schemasper domain / use case"]
    BI["Serving: BI, semantic layer,data API, ML"]
    SRC --> STG --> CORE --> MART --> BI
          
```

The layered warehouse. Raw data lands untouched in staging (so you can always reprocess from source), gets cleaned and conformed in the core (one version of each entity, with history), and is reshaped into denormalized marts for fast, intuitive consumption. Each layer has one job, so failures are localized and logic is debuggable — the opposite of one blurred mega-layer.

- **Staging / raw** — land source data as-is, append-only. This layer is your insurance: because the raw history is preserved, you can always rebuild everything downstream after a bug or a logic change.

- **Core / integrated** — clean, deduplicate, conform (one definition of "customer" across sources), apply business keys and track history. This is the trusted, source-agnostic foundation.

- **Marts** — reshape the core into denormalized, query-friendly tables for specific domains or use cases — typically star schemas.

## ELT, not ETL

A modern cloud warehouse flips the old order: **ELT** (extract, load, *then* transform) rather than ETL. You load raw data into the warehouse first and transform it *inside* the warehouse using its own scalable compute (SQL, dbt). This wins because the warehouse's engine is more powerful and elastic than a separate transform tier, raw data is preserved for reprocessing, and transformations become version-controlled SQL rather than opaque pipeline code. The transform step maps directly onto the layers: staging→core→marts is a DAG of ELT models. (The exception remains genuine streaming or heavy pre-load cleansing, where some transform happens in flight.)

## Modeling: normalized core, dimensional marts

The two layers want opposite modeling. The **core** leans more normalized — less duplication, a clean single source of truth, easier to maintain as the integrated foundation. The **marts** are deliberately **denormalized** into [star schemas](dimensional-modeling-kimball) — facts surrounded by dimensions — because that's what makes analytical queries fast and intuitive for BI tools and humans. This is the framework's normalize-vs-denormalize trade-off resolved per layer: normalize where you maintain truth, denormalize where you serve reads.

And because warehouses must answer "what was true then," the marts carry **slowly changing dimensions** (usually Type 2) so a fact recorded last year still joins to last year's version of the customer. Grain discipline matters most here: declare what one fact row represents and never mix grains, or every aggregate silently double-counts.

## Storage: columnar, partitioned, clustered

The storage choices follow mechanically from "large scans over few columns." **Columnar storage** (the warehouse's native format, or [Parquet/ORC](parquet-orc-internals) on a lakehouse) reads only the referenced columns and compresses them hard — the foundation of analytical speed. On top of that, two layout levers cut what's read:

- **Partitioning** — physically split tables (usually by date) so time-filtered queries prune to the relevant partitions. The most impactful single optimization on a large fact table.

- **Clustering / sort keys** — order data within storage by frequently-filtered columns so the engine skips blocks that can't match (zone maps / min-max pruning).

This is exactly the machinery the cloud warehouses implement internally — [Snowflake's micro-partitions](snowflake-internals) and [BigQuery's Dremel/Capacitor](bigquery-internals-dremel) — and the reason they separate storage from compute, so you scale the scan power independently of the data sitting cheaply in object storage.

## Incremental loads and cost

Reprocessing the entire warehouse every night doesn't scale, so transformations should be **incremental** — process only new or changed data since the last run (by a watermark column or via [CDC](debezium-cdc)), merging into the target. This is where idempotency matters: the merge must be safe to rerun. And because cloud-warehouse cost scales with compute consumed, **cost is a design constraint**, not an afterthought — incremental processing, partition pruning, right-sized compute, and killing runaway queries are the [FinOps](finops-data-platforms) levers that keep the bill sane.

**The anti-pattern that kills warehouses: collapsing the layers.** Under deadline pressure, teams write reporting queries directly against raw source tables, or bury business logic inside BI tools, or build marts that read from other marts in a tangled web. The result is a warehouse nobody can debug, where one definition of "revenue" exists in five slightly different forms and no one trusts the numbers. Keep the layers honest — raw is sacred and untouched, business logic lives in the core, marts serve reads — even when it feels like overhead. The discipline is what makes the warehouse trustworthy and maintainable at year three.

## Serving: the semantic layer

Finally, the marts are consumed — by BI tools, a [data API](designing-a-data-api), or ML. The piece that prevents chaos here is a **semantic layer**: centrally-defined metrics and dimensions ("revenue," "active customer") so every tool computes them the same way, instead of each dashboard re-deriving them and disagreeing. It's the serving-side equivalent of the conformed core — one definition, consumed many ways — and it's what turns a pile of tables into a warehouse the business actually trusts.

## What to carry away

Designing a warehouse is mostly about **separation and the analytical access pattern**. Land data raw (your reprocessing insurance), conform it in a trusted core, and reshape it into denormalized **dimensional marts** for consumption — transforming in-warehouse with **ELT**. Store it **columnar**, **partitioned** by time and **clustered** by common filters so queries read as little as possible; load **incrementally** and idempotently to control cost; and serve through a **semantic layer** so metrics mean one thing everywhere.

The recurring lesson is that the layers aren't bureaucracy — they're what make the warehouse debuggable, reprocessable, and trusted. For the framework behind this walkthrough see [system design for data engineers](system-design-for-data-engineers); for how the warehouses implement the storage internally, the [Snowflake](snowflake-internals) and [BigQuery](bigquery-internals-dremel) deep-dives.
