# Dimensional Modeling: Kimball, Star Schemas, and Slowly Changing Dimensions

Hand a fresh analyst a fully normalized transactional database and ask for "revenue by region by month, year over year," and watch what happens: a six-table join, a wrong number, and an afternoon gone. The schema that's perfect for an application — every fact stored exactly once, no redundancy — is hostile to analysis. Dimensional modeling exists because the question "how do humans want to slice and aggregate data?" has a different answer than "how do we safely record a transaction?" It's the oldest idea in this blog, and it quietly underpins every warehouse, BI tool, and semantic model that came after.

The framework is Ralph Kimball's, and its core is almost embarrassingly simple: split the world into **facts** (the measurements — what happened) and **dimensions** (the context — who, what, where, when), and arrange them as a **star**. The subtlety is all in the details that bite you later: grain, surrogate keys, and how history changes. I'll take them in order.

## Facts and dimensions: the two kinds of data

Dimensional modeling starts by sorting every column into one of two buckets. **Facts** are the numeric measurements of a business process — a sale's amount and quantity, a shipment's weight, a call's duration. They're what you sum, average, and count. **Dimensions** are the descriptive context you filter and group by — the product, the customer, the store, the date. The test is the question itself: in "revenue *by region by month*," revenue is the fact, region and month are dimensions.

A **fact table** holds the measurements plus foreign keys to the dimensions; it's long and narrow and grows forever (one row per event). A **dimension table** holds the descriptive attributes; it's wide and relatively short (one row per product, per customer). This split is the whole game — once you've decided what's a fact and what's a dimension, the schema almost designs itself.

## The star schema: why denormalization wins here

Arrange one fact table in the center with dimension tables radiating out, each joined by a single key, and you have a **star schema**. Crucially, the dimensions are *denormalized* — a product dimension carries category, subcategory, brand, and supplier all in one flat table, with the redundancy that a normalized design would forbid. That redundancy is the point.

```mermaid
graph TD
    F["FACT_SALES(grain: one row per order line)date_key, product_key, store_key,customer_key, quantity, amount"]
    D1["DIM_DATEdate, month, quarter, year, weekday"]
    D2["DIM_PRODUCTproduct, category, brand, supplier"]
    D3["DIM_STOREstore, city, region, country"]
    D4["DIM_CUSTOMERcustomer, segment, since"]
    D1 --> F
    D2 --> F
    D3 --> F
    D4 --> F
          
```

A star schema. One central fact table holds the measurements and foreign keys; denormalized dimension tables surround it. A query like "revenue by region by quarter" is a single fact-to-two-dimensions join — simple for the engine and simple for a human to reason about. The shape is deliberately flat so analytical queries stay shallow.

Why denormalize? Two reasons. First, **query simplicity**: any analytical question becomes a fact table joined to a few dimensions — never a deep chain of joins through bridge tables. Second, **performance**: fewer, simpler joins are far cheaper, and the predictable shape is something query engines (and later, columnar warehouses) optimize for aggressively. You trade storage and some update complexity for fast, legible analytics — exactly the right trade for a system that's read far more than written.

Normalizing the dimensions instead — splitting product into product → subcategory → category tables — gives you a **snowflake schema**. It saves a little space and pleases the part of your brain trained on OLTP, but it adds joins and rarely pays off. Default to the star; snowflake only a dimension when it's genuinely enormous and shared.

## Grain: the decision everything depends on

Before you add a single column to a fact table, you must declare its **grain** — what exactly one row represents. "One row per order line." "One row per daily account balance." "One row per call." This is the most important sentence in the whole model, and getting it wrong is the most common, most expensive mistake I see.

Why it matters so much: the grain fixes which dimensions can attach (you can only join dimensions that make sense at that grain) and, fatally, which facts are additive. If you accidentally mix grains — storing both order-line rows and order-header totals in one table — every `SUM` double-counts and no one notices until the numbers are challenged in a board meeting. Declare the grain first, make it as atomic (fine) as you can afford, and never let two grains share a table.

**Beware non-additive and semi-additive facts.** Most facts add up across every dimension (sales amount sums across product, store, and time — great). But some don't. A *ratio* or *percentage* is non-additive — you can't sum margins and get a meaningful number; you must sum the components and recompute. An *account balance* or *inventory level* is semi-additive — it sums across product and store but *not* across time (summing Monday's and Tuesday's balance is nonsense; you average or take period-end). Storing these as if they were fully additive is how dashboards end up confidently wrong.

## Surrogate keys: don't trust the source's IDs

Each dimension row gets a **surrogate key** — a meaningless integer the warehouse generates — as its primary key, and the fact table references that, not the source system's natural key (the SKU, the customer number). This feels like pointless indirection until you hit the reasons it's non-negotiable:

- **Source keys change and collide.** Systems get migrated, merged, and reissue IDs; a surrogate key insulates the warehouse from all of it.

- **You need to track history.** A natural key can only point to one version of a customer; a surrogate key lets you keep many versions of the same customer over time — which is exactly what slowly changing dimensions need.

- **Performance.** A narrow integer join key beats a wide composite or string natural key, especially as the fact table grows into the billions.

## Slowly changing dimensions: handling history

Here's the question that separates a toy model from a real one: when a customer moves from the West region to the East, what happens to last year's sales? Were they "East" all along, or should history stay "West"? There's no universally right answer — it's a business decision — and Kimball's **slowly changing dimension (SCD)** types are the menu of ways to handle it.

| Type | Behavior | Effect on history |
| --- | --- | --- |
| **Type 1** | Overwrite the old value | History is lost — past sales now show "East." Simple; use when the old value was just wrong. |
| **Type 2** | Add a new dimension row (new surrogate key) with effective-from/to dates and a "current" flag | History is preserved — old facts keep pointing at the "West" row, new facts at "East." The workhorse. |
| **Type 3** | Add a "previous value" column | Keeps only the prior value — rare, for when you need limited before/after analysis. |

**Type 2 is the one you'll use most**, and it's why surrogate keys matter: each version of the customer is a separate row with its own key, so a fact recorded in 2017 forever joins to the 2017 version. The "current flag" and date range let you ask either "what was true then" or "what's true now." It's more machinery, but it's the only honest way to keep history when the world changes underneath you.

```sql
-- A Type 2 dimension row carries versioning columns
-- DIM_CUSTOMER
-- customer_key (surrogate) | customer_id (natural) | name | region
--                          | valid_from | valid_to | is_current

-- The same customer, two versions:
-- 5001 | C-42 | Acme | West | 2015-01-01 | 2017-06-30 | false
-- 7012 | C-42 | Acme | East | 2017-07-01 | 9999-12-31 | true
-- A 2016 sale joins on customer_key = 5001 → correctly "West"
```

## What to carry away

Dimensional modeling answers a different question than transactional design: not "how do we record this safely?" but "how do humans want to slice and aggregate it?" Split data into **facts** (measurements) and **dimensions** (context), arrange them as a denormalized **star** so analytical queries stay shallow and fast, and make three decisions deliberately — declare the **grain** first and never mix it, use **surrogate keys** so the warehouse owns its identities, and pick an **SCD strategy** (usually Type 2) so history survives change.

None of this has aged out. The star schema is what a BI semantic model expects, what columnar warehouses are tuned for, and the shape the marts at the top of a [Data Vault](data-vault-snowflake-dbt) are built into. Whether you're laying out tables in [Redshift](redshift-schema-design) or designing a model behind [Power BI](power-bi-semantic-models), you're doing dimensional modeling — so it's worth doing on purpose.
