ClickHouse Optimization: Engines, ORDER BY, Data Types, and the JOIN Trap

🟡 This is Part 2 of a 3-part series: ClickHouse Deep Dive

Internals: How a Columnar OLAP Engine Actually Works
Schema Design, ORDER BY, and Query Optimization (you are here)
Insert Performance & Real-Time Streaming

Most ClickHouse performance problems I'm asked to look at aren't query problems. They're schema problems wearing a query costume. The ORDER BY key doesn't match how anyone queries the table, the data types are twice as wide as they need to be, someone wrote a JOIN with the big table on the right, and no amount of query rewriting will fully fix any of that. ClickHouse rewards getting the table right up front and punishes fixing it later — changing a sort key means rebuilding the table.

Part 1 established the mechanics: columnar storage, the MergeTree part-and-merge model, and the sparse primary index over the ORDER BY key that decides how many granules a query reads. This part turns those mechanics into decisions — engine, sort key, partitioning, types, indexes, and joins — roughly in the order you'd make them when designing a table. Part 3 then covers getting data in.

Choosing an engine from the MergeTree family

Every variant inherits MergeTree's parts, sorting, and merges — they differ in what extra work happens at merge time. Picking the right one means the engine maintains a useful invariant for you in the background instead of you doing it in every query.

Engine	What it does at merge	Use for
`MergeTree`	Nothing extra — just sorts and merges	Immutable event/log data (the default)
`ReplacingMergeTree`	Keeps the last row per sorting key (by optional version column)	Mutable rows / upserts, CDC mirrors
`SummingMergeTree`	Sums numeric columns for rows sharing the sorting key	Pre-aggregated counters by dimension
`AggregatingMergeTree`	Merges aggregate states (with `AggregateFunction` columns)	Materialized-view rollups (avg, uniq, quantiles)
`CollapsingMergeTree`	Cancels paired rows via a `sign` column (+1/−1)	Frequently mutating state where you track deltas

The merge-time gotcha that bites everyone: "keeps the last row" and "sums the rows" happen during background merges, on an unknown schedule — not at insert and not at query time. Until parts merge, a ReplacingMergeTree can still return both the old and new version of a row. So you must query as if deduplication hasn't happened yet: use FINAL (correct but forces a merge-on-read and is slower), or aggregate around it with argMax(col, version) and a GROUP BY on the key. Treat these engines as "eventually consolidated," never "consistent on read."

The ORDER BY key: the most important decision you'll make

From Part 1: the ORDER BY key is the primary index, it defines the physical sort order, and it determines which granules a query can skip. Get it right and queries read slivers; get it wrong and they full-scan. Three rules carry most of the value.

1. Lead with the columns you filter on most. Pruning only works on a prefix of the sort key. If 90% of queries filter by event_date and then tenant_id, that prefix order is your sort key. A column you never filter on doesn't belong near the front.

2. Order from lower to higher cardinality. Putting low-cardinality columns first keeps long runs of equal values together, which both compresses better and keeps granule key-ranges tight (so skipping is sharper). A common, effective shape is ORDER BY (tenant_id, event_date, user_id) — coarse filter, then time, then the high-cardinality identifier.

3. The primary key can be a prefix of ORDER BY. ClickHouse lets PRIMARY KEY be shorter than ORDER BY — the sort uses all columns, but the in-memory index only stores the prefix. Useful when you want fine sorting (good compression, dedup keys) without bloating the index with a high-cardinality trailing column.

CREATE TABLE events
(
    tenant_id     UInt32,
    event_date    Date,
    user_id       UInt64,
    event_type    LowCardinality(String),
    amount        Decimal(18, 2),
    ts            DateTime CODEC(DoubleDelta, ZSTD(1))
)
ENGINE = MergeTree
PARTITION BY toYYYYMM(event_date)         -- coarse: one part-group per month
ORDER BY (tenant_id, event_date, user_id) -- the index: filter prefix, low→high cardinality
SETTINGS index_granularity = 8192;

PARTITION BY is not for query speed

This trips up people arriving from other warehouses. In ClickHouse, PARTITION BY is primarily a data-management tool, not a query-acceleration tool — it controls how data is grouped for TTL expiry, efficient DROP PARTITION, and merge locality. Granule-level skipping via the ORDER BY index is what makes queries fast. Partition coarsely — by month is the standard — and resist the urge to partition by day or by a high-cardinality column. Over-partitioning creates an explosion of tiny parts that can't merge across partition boundaries, and you're back to the "too many parts" pain from Part 1.

Data types: narrower is faster, and Nullable isn't free

Because everything is columnar and compressed, type choices directly move both storage and scan cost. A few rules earn their keep on every table:

Use the narrowest integer that fits. UInt32 over UInt64, Int16 over Int32 — half the type is roughly half the bytes scanned. Don't reach for Int64 reflexively.
LowCardinality(String) for columns with few distinct values. Status, country, event_type, device. It dictionary-encodes the column — storing small integer codes plus a dictionary — which slashes storage and speeds up GROUP BY and filtering on that column. This single change is one of the highest-leverage edits in ClickHouse, often a multiple-x win on a wide string column.
Avoid Nullable unless you truly need to distinguish null from a default. Nullable adds a separate bitmap column and blocks some optimizations. A sentinel value (0, empty string, an epoch date) is frequently cheaper and just as correct.
Reach for codecs on the obvious columns. DoubleDelta for regular timestamps, Delta for monotonic IDs, Gorilla for slowly drifting gauges — stacked with ZSTD, as covered in Part 1.

Skip indexes: pruning beyond the sort prefix

The primary index only prunes on leading ORDER BY columns. For a column you filter on that isn't in that prefix, a data-skipping index gives ClickHouse a second chance to read less. You attach one to a column with a type and a GRANULARITY (how many granules each index block summarizes):

ALTER TABLE events
  ADD INDEX idx_amount amount TYPE minmax GRANULARITY 4;

ALTER TABLE events
  ADD INDEX idx_evtype event_type TYPE set(100) GRANULARITY 4;

-- Probabilistic membership for high-cardinality equality lookups
ALTER TABLE events
  ADD INDEX idx_uid user_id TYPE bloom_filter(0.01) GRANULARITY 4;

Skip index type	Stores per block	Best for
`minmax`	Min and max of the column	Range filters on correlated columns (e.g. a second timestamp)
`set(N)`	Up to N distinct values	Low-cardinality columns filtered by equality
`bloom_filter`	Probabilistic membership filter	High-cardinality equality / `IN` lookups
`tokenbf_v1 / ngrambf_v1`	Bloom filter over tokens / n-grams	`LIKE` / token search in text columns

Skip indexes only help when the column's values are correlated with the table's physical order — if matching values are smeared evenly across every granule, the index discards nothing and you've added overhead for no gain. Add them in response to a measured slow filter, then confirm with EXPLAIN indexes = 1 that granules are actually being dropped. Don't sprinkle them speculatively.

The JOIN trap: which table goes on the right

This is the single most surprising thing for engineers arriving from Postgres or Spark, and it causes more ClickHouse out-of-memory crashes than anything else. ClickHouse's default join is a hash join, and it builds the hash table from the right-hand table — loading that entire side into memory.

So the rule is the inverse of the SQL habit many people have: put the small table on the right. big_facts JOIN small_dim is good; small_dim JOIN big_facts can try to load the fact table into RAM and fall over. ClickHouse does not silently reorder this for you the way a cost-based optimizer might — the physical side matters.

-- GOOD: large fact table on the left, small dimension on the right
SELECT f.event_date, d.name, sum(f.amount)
FROM events AS f
INNER JOIN dim_product AS d ON f.product_id = d.product_id
GROUP BY f.event_date, d.name;

-- The right-hand table is what gets loaded into the hash table in memory.

When the right side is genuinely large, you have levers via the join_algorithm setting. partial_merge and full_sorting_merge trade speed for a much smaller memory footprint; grace_hash spills to disk in partitions; parallel_hash speeds up the build when memory is ample. Setting join_algorithm = 'auto' lets ClickHouse pick and fall back under memory pressure. But reaching for these is often a sign the data model should change — which brings us to the better answer.

Dictionaries: the join you should usually avoid having

For the classic star-schema pattern — enriching fact rows with attributes from a small, slowly changing dimension — the idiomatic ClickHouse move is frequently not a join at all. Load the dimension as a dictionary (an in-memory key-value structure ClickHouse keeps resident and refreshes on a schedule) and look values up inline with dictGet:

SELECT
    event_date,
    dictGet('dim_product_dict', 'name', product_id) AS product_name,
    sum(amount) AS revenue
FROM events
GROUP BY event_date, product_name;

dictGet is a fast in-memory hash lookup per row, with no join planning, no right-table build, and the dictionary shared across all queries. For dimension enrichment it's typically faster and lighter than a join, and it sidesteps the memory trap entirely. In a distributed cluster, dictionaries also dodge the GLOBAL JOIN data-shipping cost that a naive distributed join incurs.

Projections and materialized views: precompute the expensive part

When one table must serve queries with conflicting access patterns — some filter by date, some by user — a single ORDER BY can't be optimal for all of them. Two features resolve that.

Projections store an alternative sort order (or a pre-aggregation) inside the same table's parts. The optimizer transparently picks the projection when it helps and falls back to the base data otherwise. You get a second physical ordering without a second table to keep in sync:

ALTER TABLE events ADD PROJECTION proj_by_user
(
    SELECT * ORDER BY (user_id, event_date)
);
ALTER TABLE events MATERIALIZE PROJECTION proj_by_user;

Materialized views in ClickHouse are insert-time triggers, not stored query snapshots: each insert into the source table runs the view's SELECT and writes the result into a target table — usually a SummingMergeTree or AggregatingMergeTree that keeps a rollup continuously fresh. This is the standard way to make dashboard queries instant: query the small pre-aggregated rollup, not the billion-row base table. Because materialized views are the backbone of real-time ingestion pipelines too, they get a full treatment in Part 3.

A diagnostic checklist

When a ClickHouse query is slow, I walk this list in order before touching anything:

Symptom	Likely cause	First move
Scans far more rows than the filter should match	Filter doesn't align with the `ORDER BY` prefix	`EXPLAIN indexes=1`; redesign sort key or add a skip index
Query OOMs on a join	Large table on the right of the join	Swap join sides; or use a dictionary; or change `join_algorithm`
Wide string column dominates storage / `GROUP BY`	Plain `String` where values repeat	`LowCardinality(String)`
Duplicate rows in results	Reading a `ReplacingMergeTree` before merge	`FINAL` or `argMax(..., version)` + `GROUP BY`
Huge part count, slow everything	Over-partitioning or tiny inserts	Partition by month; batch inserts (Part 3)
Dashboard aggregation always slow	Aggregating the raw table every time	Materialized view into a rollup table

What to take into Part 3

The schema decisions all trace back to one Part 1 idea — read less. The ORDER BY key decides what you skip; narrow types and LowCardinality shrink what you read; skip indexes prune secondary filters; dictionaries and the right join side keep memory sane; projections and materialized views precompute the expensive part. Design the table for how it's queried and most "tuning" never becomes necessary.

But none of it matters if data can't get in cleanly. Part 3 tackles the other half of production ClickHouse: insert performance, the "too many parts" failure mode and how async inserts and buffering defeat it, and the Kafka-engine-plus-materialized-view pattern that turns ClickHouse into a real-time analytics database.

🟡 Continue the series

Internals: How a Columnar OLAP Engine Actually Works
Schema Design, ORDER BY, and Query Optimization (this article)
Insert Performance & Real-Time Streaming →