ClickHouse Internals: How a Columnar OLAP Engine Actually Hits Billions of Rows a Second

🟡 This is Part 1 of a 3-part series: ClickHouse Deep Dive

Internals: How a Columnar OLAP Engine Actually Works (you are here)
Schema Design, ORDER BY, and Query Optimization
Insert Performance & Real-Time Streaming

The first time you run a GROUP BY over a billion-row table in ClickHouse and it comes back in under a second, the honest reaction is suspicion. Surely it's lying — caching, sampling, something. It isn't. ClickHouse really did scan the columns it needed, decompress them, and aggregate, all in the time a row-store would still be planning. The speed isn't a trick; it's a set of design decisions that compound. This article is about those decisions, because once you can see them, ClickHouse stops being magic and starts being predictable — which is exactly what you need when a query that should be fast isn't.

I'm going to open four boxes in order: columnar storage (why reading a column is cheap), the MergeTree part-and-merge model (how rows land on disk and stay sorted), the sparse primary index and granules (how ClickHouse decides what not to read), and vectorized execution (how it crunches what's left). Then we trace one query through all of them. Part 2 turns this into schema and query tuning; Part 3 turns it into an ingestion and streaming playbook.

Columnar storage: read the column, not the row

A row store (Postgres, MySQL) keeps all of a row's fields together on a page. Great for "fetch this one order and everything about it." Terrible for "average amount across 800 million orders," because to read one column you drag every other column through memory with it.

ClickHouse stores each column in its own file. The amount column is a contiguous stream of amount values; customer_id is a separate stream; and so on. An analytical query touches a handful of columns out of dozens, so it reads a handful of files and ignores the rest entirely. You pay I/O only for the data the query actually references — the single biggest reason OLAP wants columns.

Storing values of one type together has a second payoff that matters as much as the first: compression gets dramatically better. A column of timestamps, or of a few hundred distinct status strings, is far more regular than a row's worth of mixed types. Better compression means fewer bytes off disk, and disk (or object storage) bandwidth is usually the real ceiling. Columns turn one logical read into less physical I/O twice over — fewer columns, and each column smaller.

MergeTree: parts, sorting, and background merges

MergeTree is the storage engine behind essentially every serious ClickHouse table, and the entire engine family (ReplacingMergeTree, SummingMergeTree, AggregatingMergeTree, and the rest, all covered in Part 2) is built on its mechanics. Understand MergeTree and you understand 90% of ClickHouse's on-disk behavior.

The model is this. Every INSERT writes a new part — an immutable, self-contained directory on disk holding the inserted rows, already sorted by the table's ORDER BY key. A part contains the per-column data files (.bin), the marks files that index into them (.mrk2), and the primary index (primary.idx). Parts are never modified in place. To change data you write new parts; to delete you write tombstones that take effect on the next merge.

Left alone, frequent inserts would produce thousands of small parts, and a query would have to open all of them. So a background process continuously merges small parts into larger ones — reading several sorted parts and producing one bigger sorted part, then dropping the originals. This is conceptually like LSM-tree compaction, with one important difference: ClickHouse doesn't buffer rows in an in-memory memtable first. Each insert is flushed directly as a sorted part on disk, and merging happens entirely in the background afterward.

graph TD
    I1["INSERT #1
part: rows sorted by ORDER BY"]
    I2["INSERT #2
part"]
    I3["INSERT #3
part"]
    I4["INSERT #4
part"]
    M1["Merged part
(1 + 2)"]
    M2["Merged part
(3 + 4)"]
    BIG["Larger merged part
(still sorted by ORDER BY)"]
    I1 --> M1
    I2 --> M1
    I3 --> M2
    I4 --> M2
    M1 --> BIG
    M2 --> BIG

Each insert lands as its own sorted, immutable part. A background merge process combines parts into progressively larger ones, keeping the part count low and the data sorted. Two facts fall out of this and drive everything in Part 3: inserts must be batched (one part per insert — tiny inserts make tiny parts and trigger the "too many parts" error), and a great deal of work, including deduplication and aggregation in the specialized engines, happens at merge time, not at insert time.

The number-one operational mistake with ClickHouse is inserting row-by-row or in tiny batches. Each insert is a part; thousands of tiny parts overwhelm the merge process and you hit Too many parts. ClickHouse wants few, large inserts — thousands to hundreds of thousands of rows at a time. Part 3 is largely about how to guarantee that, including async inserts and the Kafka engine.

The sparse primary index and granules: how ClickHouse skips data

Here is the part that surprises people coming from row-store databases. ClickHouse's primary index is not a B-tree, and it does not point at individual rows. It's a sparse index, and that design is central to the speed.

Within a part, rows are divided into granules — blocks of index_granularity rows, 8192 by default. The granule is the smallest unit ClickHouse reads. The primary index stores just one entry per granule: the value of the sorting key at the first row of that granule. So a part with 8 million rows has roughly 1,000 index entries, not 8 million. The whole index is tiny and lives in memory.

Because the data is physically sorted by the ORDER BY key, those sparse index entries are monotonic, and ClickHouse can binary-search them. For a query with WHERE event_date = '2024-09-01', it finds the range of granules whose key range could contain that date and reads only those granules — every other granule is skipped without being touched. The marks file (.mrk2) is the bridge: it maps each granule to the exact byte offset of its compressed block in each column's .bin file, so ClickHouse can seek straight to the granules it wants and decompress only those.

graph LR
    Q["Query:
WHERE date = '2024-09-01'"]
    PI["Sparse primary index
(1 entry per 8192-row granule,
in memory)"]
    SEL["Binary search →
select matching granules"]
    MRK["Marks file (.mrk2)
granule → byte offset"]
    BIN["Read + decompress
ONLY selected granules
from column .bin files"]
    Q --> PI --> SEL --> MRK --> BIN

The query path before any data is crunched. The sparse index narrows billions of rows to a handful of granules; the marks file turns those granules into precise byte offsets; only those compressed blocks are read and decompressed. This is why the ORDER BY key is the single most important schema decision in ClickHouse — it decides what can be skipped. Part 2 is built on this sentence.

This is the crux: the ORDER BY key is the primary index. It defines the physical sort order, and therefore which queries get to skip most of the table and which are forced into a full scan. A query whose filter aligns with the leading ORDER BY columns reads a sliver; a query that filters on something the data isn't sorted by reads everything. Every schema-design lever in Part 2 is downstream of this fact.

Skip indexes: a second layer of pruning

The primary index only helps for the leading sort columns. For secondary columns you frequently filter on, ClickHouse offers data-skipping indexes — minmax, set, and bloom-filter variants — that store a small summary (min/max, a value set, a probabilistic membership filter) per block of granules. Before reading, ClickHouse consults the skip index and discards granule-blocks that provably can't match. They don't point at rows; like the primary index, they only ever let ClickHouse read less. We tune these in Part 2.

Compression and codecs: fewer bytes off disk

Every column's data is stored compressed, in blocks. The default codec is LZ4 — extremely fast to decompress, which is the right default when you're I/O-bound. ZSTD trades CPU for a higher compression ratio, worthwhile for cold or rarely read columns where storage and bandwidth dominate.

Beyond general-purpose compression, ClickHouse has specialized codecs that exploit the structure of a column before the general codec runs — and they're a genuine superpower for the right data:

Codec	What it does	Ideal column
`Delta`	Stores differences between consecutive values	Slowly increasing IDs, counters
`DoubleDelta`	Stores the difference of differences	Regular timestamps (near-constant intervals)
`Gorilla`	XOR-based encoding of floats that change slightly	Metrics / sensor gauges
`T64`	Transposes and bit-packs integers to drop unused high bits	Integers with limited real range

You stack them: CODEC(DoubleDelta, ZSTD) on a timestamp column will often compress an order of magnitude better than the default, because DoubleDelta turns regular timestamps into a stream of near-zeros that ZSTD then crushes. This is the kind of per-column tuning that quietly halves a cluster's storage bill — and we make it concrete in Part 2.

Vectorized execution: crunching what survives

Skipping decides how little ClickHouse reads. Vectorized execution decides how fast it processes what's left. Classic databases interpret a query row by row, paying per-row function-call and branching overhead millions of times. ClickHouse processes data in blocks — chunks of many thousands of rows from each column — running each operation across a whole block in a tight loop.

That tight, type-specialized loop over a contiguous array of one type is exactly what a modern CPU is good at: it stays in cache, the branch predictor wins, and the compiler emits SIMD instructions that apply one operation to multiple values at once. Filtering, arithmetic, and aggregation all run column-at-a-time over blocks rather than row-at-a-time. Columnar storage and vectorized execution are two halves of the same idea — store data in columns so you can process it in vectors.

Tracing one query end to end

Put the boxes together. You run, against a billion-row events table with ORDER BY (event_date, customer_id):

SELECT customer_id, count() AS events, sum(amount) AS total
FROM events
WHERE event_date = '2024-09-01'
GROUP BY customer_id
ORDER BY total DESC
LIMIT 100;

The journey:

Analyze & plan. ClickHouse parses the query and sees the filter on event_date, the leading ORDER BY column — the ideal case for primary-index pruning.
Select parts and granules. Across all active parts, it binary-searches the sparse primary index and selects only the granules whose key range covers 2024-09-01. A billion rows collapses to the few thousand granules of that day. Any applicable skip indexes prune further.
Resolve byte offsets via marks. The marks files turn the selected granules into exact offsets in the .bin files of the three columns the query touches — event_date, customer_id, amount. No other column is read.
Read & decompress. Only those compressed blocks are pulled off disk and decompressed (LZ4 by default — fast).
Vectorized aggregate. ClickHouse streams the blocks through a vectorized GROUP BY, building per-customer_id count and sum, in parallel across CPU cores and across parts.
Merge, sort, limit. Per-core partial aggregates merge, the top 100 by total are selected, and the result returns.

Almost all of the runtime was decided at step 2 — how many granules survived pruning. That's the recurring lesson: in ClickHouse, the fastest work is the work you never do, and the ORDER BY key governs how much you get to skip.

What to take into Part 2

Three sentences carry the whole article. Columns plus compression mean you read only the data the query references, and less of it. The sparse primary index over the ORDER BY key decides which granules you read at all — pruning is most of performance, and it's a schema decision. Vectorized block execution makes the surviving work fast, in cache-friendly SIMD loops.

Hold those and ClickHouse tuning stops being guesswork. In Part 2 we turn them into concrete schema decisions — choosing the right engine from the MergeTree family, designing the ORDER BY key and partitioning, picking data types and codecs that shrink and skip, adding skip indexes, and navigating ClickHouse's genuinely unusual JOIN behavior.

🟡 Continue the series

Internals: How a Columnar OLAP Engine Actually Works (this article)
Schema Design, ORDER BY, and Query Optimization →
Insert Performance & Real-Time Streaming →

Frequently asked questions

How does ClickHouse skip data it doesn't need to read?

Its primary index is sparse — one mark per granule of about 8,192 rows rather than per row. For a query filtered on the sorting key, ClickHouse searches the marks to find only the granules that can contain matches, so most of the table is never read.

What is the MergeTree parts-and-merges model?

Every insert writes a new immutable, sorted part, and a background process continually merges smaller parts into larger ones while keeping data ordered by the ORDER BY key. This is why many tiny inserts hurt — they create too many parts — and why you batch inserts.

Why is ClickHouse so fast on aggregations?

Columnar storage reads only the referenced columns and compresses them with type-aware codecs, the sparse index skips irrelevant granules, and a vectorized engine processes data in cache-friendly blocks using SIMD. Most of the runtime is decided before any data is actually read.