# ClickHouse Internals: How a Columnar OLAP Engine Actually Hits Billions of Rows a Second

🟡 This is Part 1 of a 3-part series: ClickHouse Deep Dive

1. Internals: How a Columnar OLAP Engine Actually Works (you are here)

1. [Schema Design, ORDER BY, and Query Optimization](clickhouse-schema-query-optimization)

1. [Insert Performance & Real-Time Streaming](clickhouse-ingestion-streaming)

The first time you run a `GROUP BY` over a billion-row table in ClickHouse and it comes back in under a second, the honest reaction is suspicion. Surely it's lying — caching, sampling, something. It isn't. ClickHouse really did scan the columns it needed, decompress them, and aggregate, all in the time a row-store would still be planning. The speed isn't a trick; it's a set of design decisions that compound. This article is about those decisions, because once you can see them, ClickHouse stops being magic and starts being predictable — which is exactly what you need when a query that should be fast isn't.

I'm going to open four boxes in order: **columnar storage** (why reading a column is cheap), the **MergeTree part-and-merge model** (how rows land on disk and stay sorted), the **sparse primary index and granules** (how ClickHouse decides what *not* to read), and **vectorized execution** (how it crunches what's left). Then we trace one query through all of them. [Part 2](clickhouse-schema-query-optimization) turns this into schema and query tuning; [Part 3](clickhouse-ingestion-streaming) turns it into an ingestion and streaming playbook.

## Columnar storage: read the column, not the row

A row store (Postgres, MySQL) keeps all of a row's fields together on a page. Great for "fetch this one order and everything about it." Terrible for "average `amount` across 800 million orders," because to read one column you drag every other column through memory with it.

ClickHouse stores each column in its own file. The `amount` column is a contiguous stream of `amount` values; `customer_id` is a separate stream; and so on. An analytical query touches a handful of columns out of dozens, so it reads a handful of files and ignores the rest entirely. You pay I/O only for the data the query actually references — the single biggest reason OLAP wants columns.

Storing values of one type together has a second payoff that matters as much as the first: **compression gets dramatically better**. A column of timestamps, or of a few hundred distinct status strings, is far more regular than a row's worth of mixed types. Better compression means fewer bytes off disk, and disk (or object storage) bandwidth is usually the real ceiling. Columns turn one logical read into less physical I/O twice over — fewer columns, and each column smaller.

## MergeTree: parts, sorting, and background merges

**MergeTree** is the storage engine behind essentially every serious ClickHouse table, and the entire engine family (`ReplacingMergeTree`, `SummingMergeTree`, `AggregatingMergeTree`, and the rest, all covered in [Part 2](clickhouse-schema-query-optimization)) is built on its mechanics. Understand MergeTree and you understand 90% of ClickHouse's on-disk behavior.

The model is this. Every `INSERT` writes a new **part** — an immutable, self-contained directory on disk holding the inserted rows, already **sorted by the table's `ORDER BY` key**. A part contains the per-column data files (`.bin`), the marks files that index into them (`.mrk2`), and the primary index (`primary.idx`). Parts are never modified in place. To change data you write new parts; to delete you write tombstones that take effect on the next merge.

Left alone, frequent inserts would produce thousands of small parts, and a query would have to open all of them. So a background process continuously **merges** small parts into larger ones — reading several sorted parts and producing one bigger sorted part, then dropping the originals. This is conceptually like LSM-tree compaction, with one important difference: ClickHouse doesn't buffer rows in an in-memory memtable first. Each insert is flushed directly as a sorted part on disk, and merging happens entirely in the background afterward.

```mermaid
graph TD
    I1["INSERT #1part: rows sorted by ORDER BY"]
    I2["INSERT #2part"]
    I3["INSERT #3part"]
    I4["INSERT #4part"]
    M1["Merged part(1 + 2)"]
    M2["Merged part(3 + 4)"]
    BIG["Larger merged part(still sorted by ORDER BY)"]
    I1 --> M1
    I2 --> M1
    I3 --> M2
    I4 --> M2
    M1 --> BIG
    M2 --> BIG
          
```

Each insert lands as its own sorted, immutable part. A background merge process combines parts into progressively larger ones, keeping the part count low and the data sorted. Two facts fall out of this and drive everything in Part 3: inserts must be batched (one part per insert — tiny inserts make tiny parts and trigger the "too many parts" error), and a great deal of work, including deduplication and aggregation in the specialized engines, happens *at merge time*, not at insert time.

**The number-one operational mistake** with ClickHouse is inserting row-by-row or in tiny batches. Each insert is a part; thousands of tiny parts overwhelm the merge process and you hit `Too many parts`. ClickHouse wants few, large inserts — thousands to hundreds of thousands of rows at a time. [Part 3](clickhouse-ingestion-streaming) is largely about how to guarantee that, including async inserts and the Kafka engine.

## The sparse primary index and granules: how ClickHouse skips data

Here is the part that surprises people coming from row-store databases. ClickHouse's primary index is **not** a B-tree, and it does **not** point at individual rows. It's a **sparse** index, and that design is central to the speed.

Within a part, rows are divided into **granules** — blocks of `index_granularity` rows, **8192 by default**. The granule is the smallest unit ClickHouse reads. The primary index stores just one entry per granule: the value of the sorting key at the *first* row of that granule. So a part with 8 million rows has roughly 1,000 index entries, not 8 million. The whole index is tiny and lives in memory.

Because the data is physically sorted by the `ORDER BY` key, those sparse index entries are monotonic, and ClickHouse can binary-search them. For a query with `WHERE event_date = '2024-09-01'`, it finds the range of granules whose key range could contain that date and **reads only those granules** — every other granule is skipped without being touched. The **marks file** (`.mrk2`) is the bridge: it maps each granule to the exact byte offset of its compressed block in each column's `.bin` file, so ClickHouse can seek straight to the granules it wants and decompress only those.

```mermaid
graph LR
    Q["Query:WHERE date = '2024-09-01'"]
    PI["Sparse primary index(1 entry per 8192-row granule,in memory)"]
    SEL["Binary search →select matching granules"]
    MRK["Marks file (.mrk2)granule → byte offset"]
    BIN["Read + decompressONLY selected granulesfrom column .bin files"]
    Q --> PI --> SEL --> MRK --> BIN
          
```

The query path before any data is crunched. The sparse index narrows billions of rows to a handful of granules; the marks file turns those granules into precise byte offsets; only those compressed blocks are read and decompressed. This is why the `ORDER BY` key is the single most important schema decision in ClickHouse — it decides what can be skipped. Part 2 is built on this sentence.

This is the crux: **the `ORDER BY` key is the primary index.** It defines the physical sort order, and therefore which queries get to skip most of the table and which are forced into a full scan. A query whose filter aligns with the leading `ORDER BY` columns reads a sliver; a query that filters on something the data isn't sorted by reads everything. Every schema-design lever in [Part 2](clickhouse-schema-query-optimization) is downstream of this fact.

### Skip indexes: a second layer of pruning

The primary index only helps for the leading sort columns. For secondary columns you frequently filter on, ClickHouse offers **data-skipping indexes** — `minmax`, `set`, and bloom-filter variants — that store a small summary (min/max, a value set, a probabilistic membership filter) per block of granules. Before reading, ClickHouse consults the skip index and discards granule-blocks that provably can't match. They don't point at rows; like the primary index, they only ever let ClickHouse read *less*. We tune these in Part 2.

## Compression and codecs: fewer bytes off disk

Every column's data is stored compressed, in blocks. The default codec is **LZ4** — extremely fast to decompress, which is the right default when you're I/O-bound. **ZSTD** trades CPU for a higher compression ratio, worthwhile for cold or rarely read columns where storage and bandwidth dominate.

Beyond general-purpose compression, ClickHouse has **specialized codecs** that exploit the structure of a column before the general codec runs — and they're a genuine superpower for the right data:

| Codec | What it does | Ideal column |
| --- | --- | --- |
| `Delta` | Stores differences between consecutive values | Slowly increasing IDs, counters |
| `DoubleDelta` | Stores the difference of differences | Regular timestamps (near-constant intervals) |
| `Gorilla` | XOR-based encoding of floats that change slightly | Metrics / sensor gauges |
| `T64` | Transposes and bit-packs integers to drop unused high bits | Integers with limited real range |

You stack them: `CODEC(DoubleDelta, ZSTD)` on a timestamp column will often compress an order of magnitude better than the default, because `DoubleDelta` turns regular timestamps into a stream of near-zeros that ZSTD then crushes. This is the kind of per-column tuning that quietly halves a cluster's storage bill — and we make it concrete in [Part 2](clickhouse-schema-query-optimization).

## Vectorized execution: crunching what survives

Skipping decides how little ClickHouse reads. **Vectorized execution** decides how fast it processes what's left. Classic databases interpret a query row by row, paying per-row function-call and branching overhead millions of times. ClickHouse processes data in **blocks** — chunks of many thousands of rows from each column — running each operation across a whole block in a tight loop.

That tight, type-specialized loop over a contiguous array of one type is exactly what a modern CPU is good at: it stays in cache, the branch predictor wins, and the compiler emits SIMD instructions that apply one operation to multiple values at once. Filtering, arithmetic, and aggregation all run column-at-a-time over blocks rather than row-at-a-time. Columnar storage and vectorized execution are two halves of the same idea — store data in columns so you can process it in vectors.

## Tracing one query end to end

Put the boxes together. You run, against a billion-row events table with `ORDER BY (event_date, customer_id)`:

```sql
SELECT customer_id, count() AS events, sum(amount) AS total
FROM events
WHERE event_date = '2024-09-01'
GROUP BY customer_id
ORDER BY total DESC
LIMIT 100;
```

The journey:

1. **Analyze & plan.** ClickHouse parses the query and sees the filter on `event_date`, the leading `ORDER BY` column — the ideal case for primary-index pruning.

1. **Select parts and granules.** Across all active parts, it binary-searches the sparse primary index and selects only the granules whose key range covers `2024-09-01`. A billion rows collapses to the few thousand granules of that day. Any applicable skip indexes prune further.

1. **Resolve byte offsets via marks.** The marks files turn the selected granules into exact offsets in the `.bin` files of the three columns the query touches — `event_date`, `customer_id`, `amount`. No other column is read.

1. **Read & decompress.** Only those compressed blocks are pulled off disk and decompressed (LZ4 by default — fast).

1. **Vectorized aggregate.** ClickHouse streams the blocks through a vectorized `GROUP BY`, building per-`customer_id` count and sum, in parallel across CPU cores and across parts.

1. **Merge, sort, limit.** Per-core partial aggregates merge, the top 100 by `total` are selected, and the result returns.

Almost all of the runtime was decided at step 2 — how many granules survived pruning. That's the recurring lesson: in ClickHouse, the fastest work is the work you never do, and the `ORDER BY` key governs how much you get to skip.

## What to take into Part 2

Three sentences carry the whole article. **Columns plus compression mean you read only the data the query references, and less of it.** **The sparse primary index over the `ORDER BY` key decides which granules you read at all** — pruning is most of performance, and it's a schema decision. **Vectorized block execution makes the surviving work fast**, in cache-friendly SIMD loops.

Hold those and ClickHouse tuning stops being guesswork. In [Part 2](clickhouse-schema-query-optimization) we turn them into concrete schema decisions — choosing the right engine from the MergeTree family, designing the `ORDER BY` key and partitioning, picking data types and codecs that shrink and skip, adding skip indexes, and navigating ClickHouse's genuinely unusual JOIN behavior.

🟡 Continue the series

1. Internals: How a Columnar OLAP Engine Actually Works (this article)

1. [Schema Design, ORDER BY, and Query Optimization →](clickhouse-schema-query-optimization)

1. [Insert Performance & Real-Time Streaming →](clickhouse-ingestion-streaming)

## Frequently asked questions

### How does ClickHouse skip data it doesn't need to read?

Its primary index is sparse — one mark per granule of about 8,192 rows rather than per row. For a query filtered on the sorting key, ClickHouse searches the marks to find only the granules that can contain matches, so most of the table is never read.

### What is the MergeTree parts-and-merges model?

Every insert writes a new immutable, sorted part, and a background process continually merges smaller parts into larger ones while keeping data ordered by the ORDER BY key. This is why many tiny inserts hurt — they create too many parts — and why you batch inserts.

### Why is ClickHouse so fast on aggregations?

Columnar storage reads only the referenced columns and compresses them with type-aware codecs, the sparse index skips irrelevant granules, and a vectorized engine processes data in cache-friendly blocks using SIMD. Most of the runtime is decided before any data is actually read.
