MVCC: How Postgres, MySQL, and Snowflake Give You Consistent Reads Without Locking

Before MVCC was the default assumption, I watched a reporting query hold a shared lock long enough to make an unrelated update queue behind it for eleven seconds — on a table that had nothing to do with the report, blocked purely because both operations touched overlapping rows under simple locking. That's the failure mode multi-version concurrency control (MVCC) exists to eliminate: readers and writers no longer fight over the same physical row, because a reader gets a consistent snapshot of the data as it existed at a fixed point in time, while writers keep working on the live version underneath it. Nobody blocks anybody. The catch — and it's a real one — is that "keep multiple versions of a row around" isn't free, and the three major systems that implement MVCC pay for it in genuinely different currencies.

This is the cross-system comparison: how Postgres's xmin/xmax tuple versioning actually decides what a transaction sees, how MySQL InnoDB reconstructs old versions completely differently via undo logs, how Snowflake's time travel solves an MVCC-adjacent problem at a different granularity entirely, and the write skew anomaly that snapshot isolation — MVCC's default guarantee — doesn't protect you from.

What problem does MVCC actually solve?

Under simple locking, a reader that wants a consistent view of data has to block writers from changing it mid-read, and a writer has to block readers from seeing a half-finished change — reads and writes contend directly for the same physical row. MVCC's core idea breaks that contention: instead of one mutable copy of a row that everyone fights over, keep multiple versions of it, and give every transaction a consistent snapshot — effectively "the data as it looked at the moment my transaction/statement began." A reader never sees a writer's in-progress change (it wasn't part of the snapshot), and a writer never has to wait for a reader to finish (the reader is looking at an older version, not the one being modified). Writers create new versions rather than overwriting in place, which is the one sentence that explains almost everything different about how MVCC databases behave compared to lock-based ones.

How does Postgres actually implement version visibility?

Postgres internals already covers MVCC as one of its core mechanisms — every row (tuple) carries two hidden fields, xmin (the ID of the transaction that created this version) and xmax (the ID of the transaction that deleted or superseded it, if any). A tuple is visible to a given transaction's snapshot if its xmin committed before the snapshot was taken and its xmax is either unset or belongs to a transaction that hadn't committed as of that snapshot. An UPDATE in Postgres doesn't modify a row in place — it writes an entirely new tuple with a fresh xmin and marks the old tuple's xmax, leaving the old version physically present in the table until it's no longer needed by any active snapshot.

That "leaving the old version physically present" detail is the whole reason VACUUM exists. Old tuple versions accumulate as writes and updates happen — they're dead weight the moment no active transaction can still see them, but nothing reclaims that space automatically as part of the write path. VACUUM is the deferred cleanup crew, and it's not optional maintenance you can skip: a table under constant update load without regular vacuuming accumulates dead tuples until scans get measurably slower just from wading through versions nobody can see anymore, and in the extreme case, transaction ID wraparound becomes a real operational emergency. This is the tax Postgres's specific MVCC implementation charges, and it's a direct, traceable consequence of the append-heavy, xmin/xmax design choice — not a general MVCC tax every implementation pays identically.

How does MySQL InnoDB do the same thing differently?

InnoDB reaches the identical goal — a consistent snapshot per transaction — through a structurally different mechanism: rather than keeping old tuple versions inline in the table the way Postgres does, InnoDB keeps one current row in the table and writes the before-image of any change to a separate structure called the undo log. When a transaction needs to see an older version of a row than the one currently in the table, InnoDB reconstructs it on the fly by applying the relevant undo log entries in reverse. This is a genuinely different architectural bet: Postgres pays its versioning cost as accumulated dead tuples in the main table, cleaned up later by VACUUM; InnoDB pays its cost as undo log volume and the CPU work of reconstructing old versions at read time when a long-running transaction needs to look further back. Neither is strictly better — they're different places to put the same underlying cost, and which one hurts more depends heavily on the workload (Postgres tends to feel the pain from long-lived transactions holding back vacuum progress; InnoDB tends to feel it from very long-running read transactions forcing extensive undo-log reconstruction).

PostgresInnoDB (MySQL)
Old versions liveInline, in the table itself (xmin/xmax tuples)Separate undo log, reconstructed on demand
Cleanup mechanismVACUUM (deferred, must be run/tuned)Undo log purge (background, bounded by oldest active transaction)
Pain point under loadTable bloat from dead tuples, vacuum falling behindUndo log growth and reconstruction cost from long-running reads

Is Snowflake's time travel the same thing as row-level MVCC?

Not quite — it's solving the same class of problem (consistent reads without blocking writers) at a fundamentally different granularity. Snowflake's architecture is built on immutable micro-partitions: a write never modifies an existing micro-partition in place, it creates new ones, and table metadata simply repoints to which set of micro-partitions constitutes the "current" table state at any given moment. Time travel — querying a table as it existed at a point in the recent past — works because old micro-partitions aren't deleted immediately; the metadata that would let you address them is retained for a configurable window. This is much closer to a versioned-storage architecture than to row-level MVCC as Postgres or InnoDB implement it — there's no per-row xmin/xmax or undo log reconstructing individual rows, just immutable file-level versioning plus a metadata layer that decides which files are "current" or "as of a past timestamp." It solves the reader-writer contention problem MVCC exists for, using storage immutability instead of row versioning as the mechanism — worth knowing precisely because it means the operational failure modes are different too: there's no VACUUM-style bloat to manage, but retention window and storage cost for retained micro-partitions become the equivalent knob to think about.

graph TD
    subgraph PG["Postgres: row-level"]
        T1["Tuple v1
xmin=100, xmax=150"] T2["Tuple v2
xmin=150, xmax=null"] T1 -.->|"superseded"| T2 end subgraph IB["InnoDB: undo-log"] ROW["Current row
(single copy)"] UNDO["Undo log
(before-images)"] ROW -->|"reconstruct on demand"| UNDO end subgraph SF["Snowflake: file-level"] MP1["Micro-partition set A
(old)"] MP2["Micro-partition set B
(current)"] META["Table metadata
points to current set"] META --> MP2 META -.->|"time travel"| MP1 end

Three genuinely different mechanisms converging on the same guarantee — a consistent read without blocking a writer. Postgres versions individual tuples inline; InnoDB keeps one current row and reconstructs history from a separate undo log; Snowflake versions at the file level and repoints metadata, which is why its "old version" retrieval (time travel) reads more like a storage feature than a concurrency-control mechanism.

What is write skew, and why doesn't snapshot isolation prevent it?

MVCC's default isolation level — snapshot isolation — guarantees each transaction sees a consistent point-in-time view and prevents most of the classic concurrency anomalies (dirty reads, non-repeatable reads). It does not prevent write skew: two transactions each read overlapping data, each makes a decision based on what they read, and each writes to a different row — so neither transaction's write technically conflicts with the other's write, yet the combined result violates an invariant that held before either transaction ran. The canonical example: two on-call doctors, a rule that at least one doctor must always be on call, both check the schedule (both see two doctors on call, so it's "safe" to go off call), and both independently remove themselves — because each was reading a snapshot that didn't yet reflect the other's pending change, the invariant breaks even though neither individual write was invalid in isolation. Only the stricter serializable isolation level, which detects and prevents exactly this class of anomaly, closes the gap — and it's rarely the default, because it costs real throughput to enforce, which is precisely why it's opt-in rather than automatic.

"We use Postgres, so we get transactional consistency" is a claim I've heard justify skipping a real invariant check — and it's true only up to snapshot isolation's actual guarantees, which don't include write skew. I've seen an inventory-reservation system built on the assumption that snapshot isolation alone would prevent two concurrent reservations from double-booking the last unit of stock — it didn't, because each transaction read "1 unit available," each decided independently that reserving was safe, and both committed. The fix wasn't more MVCC, it was either an explicit application-level check with a locking read (SELECT ... FOR UPDATE) or bumping the transaction to serializable isolation for that specific operation. Know which invariants your isolation level actually protects before assuming "we use a transactional database" covers a correctness requirement it doesn't.

What to carry away

MVCC solves the same problem everywhere it appears — readers and writers no longer block each other because readers see a consistent snapshot while writers create new versions rather than overwriting in place — but the three major implementations pay for that guarantee in genuinely different currencies. Postgres keeps old tuple versions inline via xmin/xmax and pays with table bloat that VACUUM has to clean up; InnoDB keeps one current row and reconstructs old versions from a separate undo log on demand; Snowflake versions at the immutable-file level and repoints metadata, which is architecturally closer to versioned storage than to row-level MVCC, even though it solves an equivalent problem for the "give me a consistent read without blocking" case that time travel serves.

None of this makes an isolation-level anomaly like write skew disappear — snapshot isolation, MVCC's usual default, is a real and valuable guarantee, but it's not the same guarantee as full serializability, and an invariant that spans two rows written by two concurrent transactions can still break under it. Know which specific guarantee your database's default isolation level gives you before treating "it's transactional" as a substitute for actually checking.