Postgres Internals for Data Engineers: MVCC, WAL, and the Planner

Almost every data engineer ends up running Postgres whether they planned to or not — it's under the application, under the metadata store of half the tools you use, and it's the first thing reached for when someone says "we just need a database." It's also the system people understand the least past SELECT. The behavior that confuses teams — a table that keeps growing on disk even though the row count is flat, a query that ignores the index you built for it, replication that's somehow tied to a log nobody configured — all of it falls out of three internals. Learn those and Postgres stops surprising you.

The three are MVCC (how Postgres lets many transactions read and write the same table without blocking each other), the write-ahead log (how it survives a crash and feeds replicas), and the cost-based planner (how it decides the actual steps to run your query). I'll take each in turn and connect them to the things that bite you in production.

MVCC: why readers never block writers

PostgreSQL uses multiversion concurrency control: instead of locking a row so only one transaction can touch it, it keeps multiple versions of the row and shows each transaction the version that was current when it started. The payoff is the rule that makes Postgres pleasant under concurrency — readers don't block writers, and writers don't block readers. A long analytical SELECT and a burst of UPDATEs can run over the same table at the same time without fighting for locks.

The mechanism is simpler than it sounds. Every row version (Postgres calls a stored row a tuple) carries two hidden system columns: xmin, the id of the transaction that created it, and xmax, the id of the transaction that deleted or superseded it. A transaction sees a tuple only if xmin is committed-and-visible to it and xmax is not. So:

An INSERT writes a new tuple with xmin = your transaction, xmax empty.
A DELETE doesn't erase anything — it stamps xmax on the existing tuple, marking it dead for transactions that come after you.
An UPDATE is a delete plus an insert: it stamps xmax on the old tuple and writes a brand-new tuple with the changed values. The old version stays on disk so transactions that started earlier still see it.

graph TD
    U["UPDATE users SET email=... WHERE id=42"]
    OLD["Old tuple (id=42)
xmin=100, xmax=205
now dead, still on disk"]
    NEW["New tuple (id=42)
xmin=205, xmax=∅
the live version"]
    R1["Txn 180 (started before 205)
still sees the OLD tuple"]
    R2["Txn 210 (started after 205)
sees the NEW tuple"]
    U --> OLD
    U --> NEW
    OLD --> R1
    NEW --> R2

An UPDATE under MVCC. The old row version isn't overwritten — it's marked dead with an xmax and a new version is appended. Concurrent transactions each see the version consistent with when they began. This is why writers and readers don't block each other — but also why an UPDATE-heavy table accumulates dead tuples that someone has to clean up later.

Here's the consequence that catches people: every update and delete leaves a corpse behind. Those dead tuples still occupy pages in the table (the "heap"). A table you update a million times has a million dead versions sitting in it until something reclaims the space. That something is vacuum.

Autovacuum, dead tuples, and bloat

VACUUM is the garbage collector for MVCC. It scans tables, finds tuples that are dead to every running transaction, and marks that space reusable. autovacuum is the background daemon that runs it for you when a table accumulates enough dead tuples (by default, when roughly 20% of the table has changed). Plain vacuum makes space reusable in place; it doesn't usually shrink the file and hand disk back to the OS — that's VACUUM FULL, which rewrites the whole table and takes an exclusive lock you don't want in production.

Bloat is the number-one Postgres operational surprise. If autovacuum can't keep up — high update churn, or worse, a single very long-running transaction that holds back the "oldest visible" horizon so nothing can be cleaned — dead tuples pile up. The table and its indexes grow, every scan reads more pages, and queries quietly slow down with no schema change to blame. The first thing I check on a mysteriously fat table: n_dead_tup in pg_stat_user_tables and whether some forgotten idle in transaction session is pinning the vacuum horizon.

-- Find tables with the most dead tuples and when they were last (auto)vacuumed
SELECT relname,
       n_live_tup,
       n_dead_tup,
       last_autovacuum
FROM pg_stat_user_tables
ORDER BY n_dead_tup DESC
LIMIT 10;

-- The query that quietly blocks vacuum across the whole database:
-- a session sitting "idle in transaction" holds the visibility horizon open.
SELECT pid, state, xact_start, query
FROM pg_stat_activity
WHERE state = 'idle in transaction'
ORDER BY xact_start;

One more piece worth knowing: because updates can place the new tuple on a different page, indexes would need updating on every change. Postgres softens this with HOT (heap-only tuple) updates — if the changed columns aren't indexed and the new version fits on the same page, it skips touching the indexes entirely. It's a big reason "update only what you must, and don't over-index hot tables" is real advice, not folklore.

The write-ahead log: durability and replication from one idea

The write-ahead log (WAL) is the rule that Postgres never changes a data page on disk without first recording the change in a sequential log. Commit means "the WAL record is flushed to disk," not "the table file is updated." The actual table and index pages are updated lazily in memory (the shared buffers) and written back later by background processes. That ordering — log first, data later — is what makes crash recovery possible: on restart, Postgres replays the WAL forward from the last checkpoint and reconstructs any changes that hadn't reached the data files.

Why design it this way? Because turning every commit into a small sequential append is dramatically faster than forcing random writes to scattered data pages on every transaction. It's the same insight that shows up across the storage world — Cassandra's commit log and the redo logs of every serious relational engine make the same trade. Sequential writes are cheap; random writes are not.

graph LR
    C["Transaction commits"]
    WAL["WAL record
flushed to disk first
(sequential append)"]
    BUF["Data pages in
shared buffers
(changed in memory)"]
    CKPT["Checkpoint:
dirty pages written
back to data files"]
    REP["Streaming replica
replays WAL records"]
    C --> WAL
    C --> BUF
    WAL --> CKPT
    BUF --> CKPT
    WAL -->|"shipped over network"| REP

The WAL is written before the data files change. A commit only needs the WAL record durably on disk; the heap pages catch up at the next checkpoint. The same stream of WAL records, shipped to another server and replayed, is physical replication — durability and replication are two uses of one log.

Replication is just shipping the WAL

Once you see that the WAL is a complete record of every physical change, replication stops being a separate feature. A streaming replica connects to the primary, receives the WAL as it's generated, and replays it to stay byte-for-byte identical. That's physical (binary) replication, and it's the backbone of high-availability Postgres. Set synchronous_commit and you can even require a replica to confirm receipt before the primary reports commit — trading a little latency for the guarantee that an acknowledged transaction survives the loss of the primary.

The same WAL also underpins point-in-time recovery: take a base backup, keep the archived WAL, and you can restore to any moment by replaying the log up to a chosen timestamp. And logical replication (decoding WAL into row-level changes for selective, cross-version replication) builds on the same foundation — the mechanism that feeds change-data-capture tools downstream.

The planner: how a query actually runs

SQL is declarative — you describe the result, not the steps. The planner (Postgres's cost-based optimizer) turns that description into a concrete execution plan: which access method to use for each table, what order to join them in, which join algorithm to apply. It does this by estimating the cost of alternative plans using statistics about your data and picking the cheapest. Understanding it is how you stop guessing about performance.

Why it sometimes ignores your index

The most common planner complaint — "I built an index and it's doing a sequential scan anyway" — is usually the planner being right. An index scan is random I/O: jump to the index, then jump to the heap for each match. A sequential scan reads the table straight through. If a query will match a large fraction of the rows, reading sequentially is genuinely cheaper than thousands of random jumps. The planner estimates the fraction from column statistics (collected by ANALYZE) and chooses accordingly. When it chooses wrong, the cause is almost always stale or insufficient statistics, not a missing index.

Plan node	What it does	When the planner picks it
Seq Scan	Reads every page of the table	Query matches a large share of rows, or no useful index exists
Index Scan	Walks the index, fetches matching heap rows	Highly selective filter; few rows match
Bitmap Heap Scan	Collects matching row locations, then reads heap pages in order	Medium selectivity — too many for index scan, too few for seq scan
Nested Loop	For each outer row, probe the inner relation	One side is small or has a selective index
Hash Join	Builds a hash table on one side, probes with the other	Large, unsorted inputs joined on equality
Merge Join	Merges two inputs already sorted on the join key	Both sides sorted (or cheaply sortable)

The tool that ends every argument is EXPLAIN ANALYZE. EXPLAIN shows the plan and the planner's estimates; EXPLAIN ANALYZE actually runs it and shows real row counts and timing next to the estimates. The gap between the two is the diagnosis:

EXPLAIN ANALYZE
SELECT o.id, o.total
FROM orders o
JOIN customers c ON c.id = o.customer_id
WHERE c.country = 'DE'
  AND o.created_at >= DATE '2020-10-01';

-- Read it from the most-indented node outward. The thing to hunt for:
-- a node where "rows=1" (estimate) sits next to "actual rows=480000".
-- That estimate gap is why the planner chose a bad join — fix the stats
-- (ANALYZE, or raise the column's statistics target) before adding indexes.

When estimates are wildly off on a column, raise its sampling with ALTER TABLE … ALTER COLUMN … SET STATISTICS 1000; then ANALYZE. And for filters on combinations of columns that are correlated (city and country, say), the planner assumes independence and underestimates badly — extended statistics (CREATE STATISTICS, available since Postgres 10) teach it the correlation. Most "the planner is dumb" cases are really "the planner is uninformed."

What to carry away

Three internals explain most of what Postgres does. MVCC keeps versions of every row so readers and writers never block each other — at the cost of dead tuples that autovacuum must reclaim, which is why update-heavy tables bloat and why a stray long transaction can quietly wreck performance. The WAL records every change sequentially before touching data files, which buys both crash recovery and, by simply shipping that log elsewhere, replication and point-in-time recovery from the same mechanism. And the cost-based planner turns your declarative SQL into real operations by estimating costs from statistics — so when it picks a seq scan over your index, suspect the statistics before you suspect the planner.

None of this is exotic. It's the machinery running under the most-deployed database in the world, and a working data engineer who can read an EXPLAIN ANALYZE, spot bloat in pg_stat_user_tables, and explain why commit is fast will debug Postgres problems that leave SQL-only colleagues stuck. If you want the columnar, scan-everything contrast to Postgres's row-store design, the BigQuery internals piece is the other half of the picture.