Storage Engines: B-Trees vs LSM-Trees, and the Read/Write Trade-off

Ask why Postgres and Cassandra feel so different to operate and you'll get a dozen surface answers — SQL vs NoSQL, single-node vs distributed, relational vs wide-column. But the deepest answer is one level down, in a component most people never look at: the storage engine, the part of the database that decides how bytes actually land on disk. There are essentially two families of design, they make opposite bets about reads versus writes, and once you understand the bet, a huge amount of database behavior stops being arbitrary and starts being predictable.

The two families are B-trees (update in place; the engine behind Postgres, MySQL's InnoDB, and most traditional databases) and LSM-trees (append and merge; behind Cassandra, RocksDB, LevelDB, HBase, and the write paths of many modern stores). This isn't database trivia — it's the single decision that propagates up into write throughput, read latency, disk usage, and the operational surprises that bite you. I'll explain both, the three "amplifications" that frame the trade, and how to choose.

The B-tree: update in place

The B-tree has been the default for decades, and the model is intuitive: data lives in fixed-size pages arranged as a balanced tree, sorted by key. To find a row, you walk from the root down to a leaf — a handful of page reads even for an enormous table. To change a row, you find its page and modify it in place, writing the page back where it was.

That in-place update is the defining property, and it cuts both ways. Reads are excellent and predictable: any key is a short, bounded traversal, and there's exactly one place each row lives. But writes mean random I/O — seek to the right page, rewrite it — and a few hard problems come along: when a page fills, it splits (more random writes); and because a crash mid-page-write could corrupt data, B-tree databases pay for a write-ahead log, writing every change twice (once to the log, once to the page). The structure that makes reads cheap makes writes comparatively expensive. (This is the engine under Postgres, whose MVCC and WAL sit right on top of it.)

The LSM-tree: never update in place

The log-structured merge-tree starts from a different premise: sequential writes are vastly faster than random writes, so never seek to update — only ever append. A write goes into an in-memory sorted structure (the memtable); when that fills, it's flushed to disk as an immutable sorted file (an SSTable). New writes go to a new memtable, then a new SSTable. Nothing on disk is ever modified — files only accumulate.

Writes are therefore blazing fast: an in-memory insert plus, eventually, one big sequential flush — no seeks, no page splits, no double-write. But two costs appear immediately. First, a single key's history can be spread across the memtable and several SSTables, so a read may have to check multiple places and reconcile by timestamp. Second, those immutable files pile up forever unless something merges them — and that something is compaction, a background process that merges SSTables, drops superseded values and delete markers (tombstones), and keeps the file count bounded. This is the engine under Cassandra and HBase; the same memtable/SSTable/compaction pattern recurs everywhere write throughput matters.

graph TD
    subgraph BT["B-tree — update in place"]
        BW["Write"] --> BWAL["WAL (durability)"]
        BW --> BPAGE["Find page → modify in place
(random I/O, page splits)"]
        BR["Read"] --> BTRAV["Root → leaf traversal
(one place per key — fast)"]
    end
    subgraph LSM["LSM-tree — append & merge"]
        LW["Write"] --> LMEM["Memtable (in-memory, sorted)"]
        LMEM -->|"flush when full"| LSST["Immutable SSTables
(sequential write)"]
        LSST --> LCMP["Compaction:
merge, drop tombstones"]
        LR["Read"] --> LMERGE["Check memtable + several
SSTables, reconcile
(bloom filters skip misses)"]
    end

The two designs side by side. The B-tree modifies pages in place — great reads, random-write cost, plus WAL double-writes and page splits. The LSM-tree only appends (fast sequential writes) and pays later: reads merge across files (helped by bloom filters), and background compaction keeps the SSTable count and dead data under control.

The framing that makes it click: three amplifications

The clean way to compare storage engines is through three "amplification" factors — how much extra work each design does relative to the logical operation. Every engine trades them off; you can't minimize all three at once.

Amplification	Meaning	B-tree	LSM-tree
Read	Extra reads per logical read	Low — one place per key	Higher — may check several SSTables
Write	Extra bytes written per logical write	Higher — WAL + page rewrite + splits	Trade-off — cheap initial write, but compaction rewrites data repeatedly
Space	Extra disk vs logical data size	Fragmentation, half-empty pages	Stale/duplicate values until compaction; better compression of sorted files

Two subtleties worth internalizing. LSM write amplification is sneaky: the first write is cheap, but compaction may rewrite the same data several times over its life merging it into ever-larger files — so "write-optimized" doesn't mean "writes nothing extra," it means the extra work is sequential and deferred. And LSM read amplification is tamed by bloom filters — a tiny probabilistic structure per SSTable that answers "this key is definitely not here" cheaply, letting reads skip most SSTables instead of scanning all of them. Without bloom filters, LSM reads would be far worse than they are.

How to choose

The decision follows directly from your workload's read/write shape:

Read-heavy, mutation-light, latency-sensitive point lookups and range scans → B-tree. Predictable low-latency reads and strong transactional support are its home turf (most OLTP relational databases).
Write-heavy, high-ingest, append-mostly (time series, event logs, sensor data, metrics) → LSM-tree. It absorbs writes the B-tree would choke on, and it compresses sorted immutable files well.
Deletes and overwrites matter → look hard at the LSM tombstone behavior. On an LSM engine a delete is a marker that lingers until compaction, so delete-heavy or queue-like workloads accumulate tombstones and slow reads — a B-tree, which removes in place, handles churn more gracefully.

The mental shortcut I use: B-trees pay on write to stay cheap on read; LSM-trees pay on read (and on compaction) to stay cheap on write. There's no free lunch and no universally "faster" engine — only an engine matched to a workload. When a database surprises you (writes stalling, reads degrading over time, disk filling faster than data), the explanation is almost always the storage engine paying one of its amplifications, and knowing which one tells you where to look.

Watch the compaction cliff on LSM engines. Compaction is background work, but it's not free — it competes with your live traffic for disk I/O and CPU. Push writes faster than compaction can keep up and you get the classic failure: SSTables (or "parts") pile up, read amplification climbs as every read checks more files, and eventually the engine throttles or errors. This is the same "too many parts" wall that hits ClickHouse on tiny inserts and that stresses Cassandra under write floods. Size for compaction headroom, not just for peak write rate.

What to carry away

Underneath the SQL-vs-NoSQL surface, databases make one foundational choice. B-trees update pages in place: excellent, predictable reads, at the cost of random writes, page splits, and WAL double-writes — the right call for read-heavy transactional workloads. LSM-trees never update in place; they append to a memtable, flush immutable SSTables, and merge them with compaction: superb write throughput and good compression, at the cost of read amplification (mitigated by bloom filters) and the ever-present compaction tax. The three amplifications — read, write, space — are the scorecard, and no engine wins all three.

Learn to ask "which storage engine, and which amplification is it paying?" and the behavior of half the systems on this blog becomes legible — why Postgres reads are steady, why Cassandra swallows writes, why time-series stores are almost always LSM underneath. The engine is the foundation; everything above it inherits its trade-offs.