Compaction Strategies in LSM Trees: Size-Tiered vs Leveled vs Time-Window

The default compaction strategy is rarely the wrong choice by accident — it's the wrong choice because nobody revisited it once the workload stopped looking like whatever it was tuned for on day one. I've inherited a Cassandra cluster running size-tiered compaction on a table that had grown into a heavy point-lookup workload; reads were checking a dozen SSTables per query because nothing about the write pattern had ever forced a rethink, and the fix wasn't more hardware, it was switching the compaction strategy to one built for that access pattern. LSM trees need compaction to survive at all — that's the baseline. What actually determines whether your write path or your read path absorbs the pain is which compaction strategy you run, and that choice gets made once, quietly, and then forgotten.

This assumes you already know why LSM trees compact in the first place — merging SSTables to bound read amplification and reclaim space from overwrites and deletes, covered in the storage-engines piece linked above. What follows is the part that gets skipped: the three named strategies, what each one actually optimizes for, and the amplification math that makes the trade-offs concrete instead of vibes.

What is size-tiered compaction, and why is it usually the default?

Size-tiered compaction (STCS) groups SSTables of similar size and merges them together once enough similarly-sized files accumulate, producing progressively larger SSTables over time. It's write-optimized almost by construction — new data flushes from the memtable into small SSTables, and compaction only merges files when there are enough peers of roughly the same size, which keeps the total rewrite volume relative to incoming writes comparatively low. That's exactly why it was Cassandra's original default: it's the strategy that costs the write path the least.

The bill comes due on the read side. Because STCS doesn't guarantee any ordering or non-overlap between SSTables — the same partition key can legitimately exist in many different SSTables at once, scattered across tiers — a point read in the worst case has to check every SSTable that might contain the key, and a bloom filter per SSTable (see the earlier piece on probabilistic data structures for how that works) only helps rule out files, not guarantee which one actually has the answer. STCS also has a real space-amplification cost: because it merges by size rather than guaranteeing bounded overlap, a table can temporarily hold multiple full-size copies of overlapping data mid-compaction, which is a real operational concern on disk-constrained clusters.

How does leveled compaction fix the read problem, and what does it cost instead?

Leveled compaction (LCS) — the strategy pioneered by Google's LevelDB and adopted by RocksDB, and available in Cassandra since 1.0 — organizes SSTables into levels with a key property STCS doesn't have: within any level except the very first (L0), SSTables cover non-overlapping key ranges. That single guarantee is what fixes the read-amplification problem — a point lookup for a given key touches at most one SSTable per level, instead of potentially every SSTable in the table, because the non-overlapping ranges mean the key can only live in one place per level.

That guarantee isn't free. Maintaining non-overlapping ranges as new data arrives means compaction has to actively reorganize data into the correct level far more aggressively than STCS's "merge similar sizes when convenient" approach — which shows up as meaningfully higher write amplification: the same logical write ends up physically rewritten more times over its lifetime as it migrates down through levels to keep the non-overlap invariant intact. RocksDB's real-world implementation is itself a hybrid worth knowing — L0 (the level closest to the memtable flush) actually uses tiered-style compaction, and only the levels below L0 use strict leveled compaction, specifically to reduce write amplification and memory pressure during high write load rather than paying leveled compaction's full cost at every layer.

StrategyWrite amplificationRead amplificationSpace amplificationBest for
Size-tieredLowHigh (many SSTables may hold a key)High (temporary duplicate copies mid-merge)Write-heavy workloads tolerant of slower reads
LeveledHigh (data rewritten repeatedly across levels)Low (at most one SSTable per level)LowRead-heavy or point-lookup-heavy workloads
Time-windowLow within a window, near-zero at expiryLow for recent-data queriesLow (whole expired SSTables just get dropped)Time-series and TTL'd data

Why does time-series or TTL'd data want a completely different strategy?

Time-window compaction (TWCS) is the recommended Cassandra strategy specifically for time-series and expiring-TTL workloads, and the insight behind it is almost embarrassingly simple once you see it: if data is written roughly in timestamp order and expires wholesale after a fixed retention period, why ever compact data that's about to be deleted anyway? TWCS groups SSTables into time-bucketed windows — during the active window, it compacts newly-flushed SSTables together using STCS-style merging within that bucket; once a time window closes, its SSTables get compacted into one final SSTable for that window and then, critically, left alone. When a whole window's TTL expires, the entire SSTable for that window can simply be dropped — no read, no rewrite, no partial deletion scan, just an atomic file removal.

graph LR
    W1["Time window 1
(oldest, TTL expired)"] -->|"drop entire SSTable,
no compaction needed"| GONE["Reclaimed"] W2["Time window 2
(closed, compacted once)"] -->|"left alone until TTL"| W2 W3["Time window 3
(active, STCS-style merging)"] -->|"new writes
land here"| W3

Time-window compaction's core trick: once a window closes, its SSTable is left alone rather than repeatedly re-merged with newer data — and when the whole window expires, dropping it is a single file deletion instead of a scan-and-rewrite. This is why TWCS avoids the read and write amplification of both STCS and LCS for genuinely time-bucketed, TTL'd workloads.

The trade-off that makes TWCS a poor fit outside its lane: it assumes writes arrive in roughly timestamp order and that a whole time bucket expires together. Feed it out-of-order writes (late-arriving data landing in an already-closed window) or a workload where individual rows need to be deleted independently of a uniform TTL, and the strategy's core assumption breaks — you lose the "drop the whole file" win and end up paying compaction costs TWCS was specifically designed to avoid.

How do you actually decide which strategy fits a given table?

Start from the access pattern, not the write volume. A table dominated by point lookups on primary key — the classic OLTP-adjacent pattern Cassandra and HBase both serve well — wants leveled compaction's bounded read amplification, and it's worth paying the extra write cost to get it, because read latency is usually the metric users actually feel. A table under heavy, bursty write load where reads are comparatively rare or tolerant of scanning a few extra SSTables wants size-tiered's lower write cost — this is the workload STCS was actually built for, not a fallback default that happens to exist. And any table with a genuine, uniform TTL and roughly time-ordered writes — metrics, logs, event streams, anything that "expires" as a unit — is exactly what time-window compaction exists for, and running STCS or LCS on that workload instead is paying real compaction cost to solve a problem TWCS makes almost free.

Compaction strategy is not something you set once at table creation and forget — a table's access pattern changes as the product changes, and the strategy that was right at launch quietly becomes wrong. The Cassandra table I mentioned at the top started as an append-mostly event log (STCS made sense) and organically turned into a point-lookup-heavy service backend over eighteen months, with nobody revisiting the compaction setting as that shift happened. Changing compaction strategy on an existing table is itself an expensive, disk-and-CPU-intensive operation — which is exactly why it's worth reviewing periodically against current access patterns, on a schedule, rather than only after read latency has already become a visible problem.

What to carry away

All three strategies are solving the same underlying LSM problem — bound how many SSTables a read has to check while controlling how much data compaction rewrites — but they make genuinely different bets. Size-tiered keeps write amplification low and lets read amplification and space amplification absorb the cost, which made it the sensible original default for write-heavy workloads. Leveled compaction inverts that trade: it guarantees at most one SSTable per level for a given key, at the cost of meaningfully higher write amplification from the constant reorganization needed to maintain non-overlapping ranges — RocksDB's own hybrid L0-tiered-plus-leveled design exists specifically to soften that cost. Time-window compaction is the specialist tool: for genuinely time-bucketed, TTL'd data, it turns expiry into a free file drop instead of a compaction problem, but it only works when the workload's assumptions (roughly ordered writes, uniform TTL per bucket) actually hold.

Pick the strategy from the access pattern the table actually serves, not from whatever the database's default happened to be — and revisit that choice as the table's real-world usage evolves, because compaction strategy tends to be set once at creation and never looked at again until read latency forces the question.