Write-Ahead Logs: Durability, fsync, Group Commit, and Checkpointing

Pull the power on a database mid-write and the question that decides whether you have a company tomorrow is simple: did that transaction survive? Every database that answers "yes" reliably answers it the same way — not with a clever recovery algorithm bolted on afterward, but with one small, boring discipline applied before anything else happens: write down what you're about to do, make sure that record survives a crash, and only then touch the real data. That's the entire idea behind the write-ahead log, and once you see it, you start noticing it's the load-bearing wall under nearly every durable system you've used, not just the database sitting behind your application.

This is the WAL across systems — the append-before-apply idea in Postgres and MySQL InnoDB, the same idea wearing a different name inside Kafka, why fsync is the actual bottleneck durability runs into, group commit as the standard fix, checkpointing as the reason the log doesn't grow forever, and WAL shipping as the mechanism behind most database replication.

What problem does a write-ahead log actually solve?

A write-ahead log (WAL) is a sequential, append-only record of every change, written and made durable before that change is applied to the actual data structure — a B-tree page, a row, a table. The guarantee it buys: if the process crashes at any point, replaying the log from the last checkpoint reconstructs exactly the state that should exist, because nothing was ever applied to the real data without first being safely recorded in the log. Without a WAL, a crash mid-write leaves you with an unanswerable question — was that page half-written, fully written, or not written at all — and no principled way to recover. With one, recovery becomes mechanical: read the log forward, redo everything after the last confirmed checkpoint.

Postgres's own WAL is the textbook version of this — every change gets appended to the WAL before the corresponding heap or index page is modified, and that's covered in depth as its own section there. What's less obvious is how far the same idea travels. Kafka's partition log is architecturally a write-ahead log that Kafka simply exposes as its primary interface instead of hiding it as an internal durability mechanism — a producer's message is appended sequentially to the log, and that append-only, sequential-write structure is exactly what a database's internal WAL looks like, just promoted to be the product itself rather than plumbing underneath one. MongoDB keeps its own journal for the identical reason WiredTiger needs it — durability of writes between checkpoints. Redis's AOF (append-only file) persistence mode is the same idea again, applied to an in-memory store: append every write command to a log before (or as) it's applied to memory, so a restart can replay the log and reconstruct state.

graph LR
    W["Write request"] --> LOG["Append to WAL
(sequential write)"]
    LOG --> SYNC["fsync WAL to disk"]
    SYNC --> ACK["Acknowledge commit"]
    SYNC -.->|"later, async"| APPLY["Apply to actual
data structure
(B-tree page, row)"]

The core WAL discipline: the log is synced to durable storage before the transaction is acknowledged as committed — the actual data-structure mutation can happen afterward, even asynchronously, because the log alone is enough to reconstruct it after a crash. Acknowledging before the sync, not after, is the single most common way to accidentally lose the durability guarantee a WAL is supposed to provide.

Why is fsync the actual bottleneck, not the log write itself?

Appending to a sequential file is cheap — that's the whole design win of a WAL over randomly updating pages in place. The expensive part is fsync(), the system call that forces the operating system to actually flush the write from its buffer cache to physical storage rather than trusting the OS to get around to it eventually. Skip fsync and you're fast but lying: the OS page cache holds the write in memory, reports success, and a power loss before that cache is flushed loses data the database already told a client was durably committed. Call fsync on every single commit and you're honest but slow — each one is a synchronous round-trip to physical storage, and on spinning disks especially (and to a lesser but real degree on SSDs) that puts a hard ceiling on transaction throughput no amount of application-level optimization can get around.

What is group commit, and why does every serious database implement it?

Group commit batches multiple pending transactions' WAL records into a single fsync call instead of paying the fsync cost once per transaction. The mechanism, as Postgres implements it: the first transaction ready to commit becomes the "leader" for that batch and waits a configurable, typically very short interval; any other transactions that become ready to commit during that window get folded into the same flush; when the leader's fsync completes, every transaction in the batch is durable at once, for the cost of one flush instead of several. The trade is a small, bounded amount of added latency per individual transaction (you might wait a few extra milliseconds for the batch window) in exchange for dramatically higher aggregate throughput under concurrent load — and the busier the system, the better the trade gets, because more transactions arrive within each batching window and the fsync cost gets amortized across more of them.

-- Postgres exposes the group-commit knob directly:
-- how long the leader waits to let followers join the flush
SET commit_delay = 5000;  -- microseconds
-- only takes effect once at least this many transactions
-- are typically in flight, so it doesn't add latency when idle
SET commit_siblings = 5;

Why can't the WAL just grow forever, and what does a checkpoint actually do?

An unbounded WAL is a recovery-time problem waiting to happen — if every change since the database was created lives in the log, a crash means replaying potentially years of history before the database is usable again, and the log itself eventually exhausts disk. A checkpoint is the fix: at intervals, the database forces all changes recorded in the WAL up to that point to actually be applied to the underlying data structure and written durably to disk. Once that's done, any WAL segment entirely older than the checkpoint is guaranteed unnecessary for recovery — everything it recorded is already reflected in the data files — and can be safely discarded or recycled.

Checkpoint frequency is a real, tunable trade-off, not a fire-and-forget setting. Checkpoint too rarely and recovery after a crash takes longer, because there's more WAL to replay from the last checkpoint forward. Checkpoint too often and you pay a continuous I/O tax flushing pages to disk more frequently than the workload strictly needs, competing with the foreground write traffic the WAL exists to make fast in the first place. Every production database exposes this as a tunable for exactly this reason — there's no universally correct interval, only the correct interval for a given recovery-time objective versus a given write-throughput budget.

How does WAL shipping become database replication?

Once you accept that the WAL fully and losslessly describes every change to the database, the mechanism for keeping a replica in sync falls out almost for free: ship the same log records to a second server and have it replay them. This is physical replication — WAL segments stream from primary to replica essentially unmodified, and the replica applies them at the same block level the primary did, producing a byte-for-byte identical copy with no independent decision-making about what the changes "mean." Logical replication goes a layer higher — the WAL (or an equivalent change stream) is decoded into a higher-level representation of the actual row-level changes (insert/update/delete plus the affected values), which is more flexible (it can replicate selectively, across different schema versions, or even into a different kind of system entirely) at the cost of more decoding overhead than simply replaying raw physical log records.

The most expensive WAL mistake I've seen isn't a checkpoint misconfiguration — it's an application acknowledging a write to its own caller before the WAL fsync it depended on actually completed. I've debugged a "the database lost data during a power event" incident that turned out to be an application-level cache that marked a write as done the moment it issued the database call, without waiting for the database's own commit acknowledgment — which meant the application had already told a downstream system the write was durable before the database's fsync had even started. The WAL did its job correctly. The bug was a layer above it, assuming durability that hadn't happened yet. Audit every "durable" claim in a system back to an actual fsync-backed acknowledgment, not an optimistic assumption sitting one layer above it.

What to carry away

The write-ahead log is one idea — append before you apply, so a crash is always recoverable by replaying the log — implemented under different names across nearly every durable system: Postgres's WAL, InnoDB's redo log, MongoDB's journal, Redis's AOF, and Kafka's partition log turning the same internal mechanism into its primary external interface. The expensive part was never the sequential append; it's the fsync that actually forces data to durable storage, which is why group commit — batching multiple transactions into one flush — is standard practice rather than an exotic optimization.

Checkpointing exists because the WAL can't grow forever, and checkpoint frequency is a genuine, tunable trade-off between recovery time and steady-state write throughput — there's no setting that's correct for every workload. And once you see the WAL as a complete, ordered record of every change, replication stops looking like a separate mechanism and starts looking like the obvious consequence: ship the log, replay the log, and you have a second copy — physically if you replay it raw, logically if you decode it first.