Apache Iceberg Internals: Metadata Trees, Snapshots, and the Catalog Wars

The thing that broke "just put Parquet on S3 and call it a table" was never the files — it was everything around them. Two jobs writing at once and clobbering each other's output. A reader catching a half-finished commit and returning garbage. Renaming a column meaning a careful rewrite of everything. Figuring out which of the 200,000 files in a prefix are actually current. Object storage gives you durable, cheap bytes and none of the guarantees a table needs. Apache Iceberg is the specification that adds those guarantees back — ACID transactions, schema evolution, time travel — without giving up open files on cheap storage.

Iceberg is a table format: a spec for a layer of metadata that sits over your data files and turns them into a real table with transactional semantics. The whole thing rests on one structural idea — a tree of immutable metadata where a single atomic pointer swap is what makes a change visible. Get that tree and you understand snapshots, time travel, hidden partitioning, and the read/write trade-offs. Then there's the 2024 plot twist: the fight stopped being about the format and moved to the catalog. I'll cover both. (For the wider survey of Iceberg vs Delta vs Hudi, see open table formats; this is Iceberg from the inside.)

The metadata tree: the whole format in one structure

An Iceberg table is a hierarchy of files in object storage. Reading top-down:

Metadata file (vN.metadata.json) — the root. Holds the table schema, partition spec, properties, and the list of snapshots, with a pointer to the current one. There's a new immutable metadata file for every table version.
Manifest list — one per snapshot. A list of all the manifest files that make up that snapshot, each annotated with partition-range summaries so a planner can skip whole manifests.
Manifest files — each lists a set of actual data files, and crucially carries per-file statistics: row counts, null counts, and min/max values for each column.
Data files — the Parquet (or ORC/Avro) files holding the rows. Iceberg tracks them explicitly; it never lists a directory to discover them.

graph TD
    CAT["Catalog
(points to current metadata file)"]
    META["Metadata file (vN.metadata.json)
schema, partition spec, snapshot list"]
    SNAP["Snapshot S2 (current)"]
    ML["Manifest list
(+ partition range summaries)"]
    M1["Manifest A
(data files + min/max stats)"]
    M2["Manifest B
(data files + min/max stats)"]
    D1["Parquet data files"]
    D2["Parquet data files"]
    CAT --> META --> SNAP --> ML
    ML --> M1 --> D1
    ML --> M2 --> D2

The Iceberg metadata tree. The catalog holds one pointer — to the current metadata file. From there it's metadata file → snapshot → manifest list → manifests → data files, with column statistics stored at the manifest level. Because every layer is immutable, a commit just writes new files and swaps the catalog's single pointer atomically. That pointer swap is the transaction.

Two payoffs fall straight out of this. First, planning never lists directories — the engine reads the manifest list, prunes by partition summaries, then reads only relevant manifests and prunes data files by their min/max stats. On a table with hundreds of thousands of files, that metadata-driven planning is the difference between a query that starts instantly and one that spends minutes just enumerating S3. Second, the stats enable file-level skipping the same way Parquet footers enable row-group skipping — but across the entire dataset.

Snapshots, atomic commits, and time travel

Every change to an Iceberg table produces a new snapshot — a complete, immutable picture of which data files constitute the table at that moment. A write doesn't mutate anything: it writes new data files, new manifests, a new manifest list, and a new metadata file referencing a new snapshot, then performs one atomic operation to point the catalog at that new metadata file. Until that swap, readers see the old snapshot, whole and consistent; after it, they see the new one. There is no in-between state a reader can observe.

That single atomic pointer swap is the entire basis of Iceberg's ACID guarantees, and concurrent writers are handled with optimistic concurrency: each prepares its commit against the snapshot it read, and the swap succeeds only if nothing else committed in the meantime — otherwise it retries against the new state. No locking the table, no torn reads.

Because old snapshots remain valid until you expire them, time travel is free — it's just reading an older metadata pointer. Query the table as of a timestamp or snapshot id to reproduce exactly what a report saw last Tuesday, audit what changed, or roll back a bad write by re-pointing at the previous snapshot. Reproducibility, which is brutally hard on a mutable warehouse, is a side effect of the immutable tree here. The cost is housekeeping: expire_snapshots to stop old files accumulating forever.

-- Time travel: read the table exactly as it was at a past snapshot
SELECT * FROM orders FOR SYSTEM_TIME AS OF '2024-07-01 00:00:00';
SELECT * FROM orders FOR SYSTEM_VERSION AS OF 3921025478193210123;

-- Roll back a bad write by re-pointing at a known-good snapshot
CALL catalog.system.rollback_to_snapshot('db.orders', 3921025478193210123);

-- Housekeeping so immutable history doesn't grow without bound
CALL catalog.system.expire_snapshots('db.orders', TIMESTAMP '2024-06-01 00:00:00');

Hidden partitioning: the Hive mistake Iceberg fixed

In the old Hive world, partitioning leaked into both physical layout and queries: data lived in directories like /dt=2024-07-09/, and to get pruning you had to filter on a derived dt column and keep it in sync with the real event_time. Forget, or filter on event_time directly, and you silently scanned everything. Iceberg's hidden partitioning records the partition transform (e.g. day(event_time)) in metadata and applies it for you. You filter on the natural column; Iceberg derives the partition values and prunes correctly. Partitioning is no longer a column users must know about and babysit.

Better still, because partitioning is metadata rather than directory structure, you can evolve it — switch from daily to hourly partitions going forward — without rewriting existing data. Old data keeps its old layout; new data uses the new one; queries span both. The same metadata-not-physical principle is why schema evolution is safe: columns have stable ids, so add, drop, rename, and reorder are pure metadata operations that never require rewriting files and never resurrect deleted-then-readded columns.

Copy-on-write vs merge-on-read

The one design decision you actually have to make per table is how row-level updates and deletes are handled. There are two strategies, and they trade write cost against read cost — the classic tension you also see in LSM storage engines:

	Copy-on-write (COW)	Merge-on-read (MOR)
On update/delete	Rewrite the whole data file with the change applied	Write a small delete file marking affected rows; data file untouched
Write cost	High — rewrites entire files	Low — appends delete files quickly
Read cost	Low — read clean files directly	Higher — merge delete files at read time
Best for	Read-heavy, infrequent updates	Frequent updates / streaming upserts, CDC

COW keeps reads fast by paying up front to produce clean files; MOR keeps writes cheap by deferring the merge to read time, which is what you want for high-frequency updates and CDC ingestion — at the price of slower reads and a dependence on regular compaction to fold the delete files back in. There's no free option: you're choosing which side of the read/write trade your workload can afford, and MOR specifically buys you the ability to mutate rows often without rewriting large files each time.

The catalog wars: where the 2024 fight moved

Here's the strategic shift. Notice that the catalog holds the one mutable thing in the whole design — the pointer to the current metadata file. That makes the catalog the transaction coordinator and the control point for governance: who can read or write, where lineage and access policy live. As Iceberg the format effectively won the open-table-format debate, the contest moved up a layer to who owns the catalog, because the catalog is where lock-in (or openness) now lives.

The unlock was the Iceberg REST Catalog spec — a standard HTTP API for catalog operations, so engines can talk to any compliant catalog instead of each vendor's bespoke one. On top of it, 2024 brought a flurry: Snowflake open-sourced Polaris as a REST-compatible open catalog, Databricks open-sourced Unity Catalog and (having acquired Tabular, the company founded by Iceberg's creators) leaned into interoperability, and the older Hive-metastore and cloud-native catalogs remain in play. The format is settling; the catalog is the new battleground.

The catalog is now your real lock-in decision, not the file format. It's tempting to think "we chose Iceberg, so we're open" — but your data files were always the easy, portable part. The catalog holds the pointers, the commit coordination, and increasingly the governance and access control. Pick a catalog you can't easily migrate off, or one that doesn't implement the REST spec cleanly, and you've recreated the lock-in you adopted an open format to escape. Evaluate the catalog's openness and portability as carefully as you once evaluated the warehouse — that's where the leverage sits in 2024.

What to carry away

Apache Iceberg turns files on object storage into a transactional table through a tree of immutable metadata — metadata file, manifest list, manifests, data files — where every change writes new files and a single atomic pointer swap makes it visible. From that one idea you get ACID commits, snapshots and free time travel, metadata-driven query planning that never lists directories, hidden partitioning and safe schema evolution (layout is metadata, not directories), and a per-table choice of copy-on-write vs merge-on-read to place the update cost where your workload can pay it.

And the 2024 lesson on top: the format question is largely settled, so the decision that now carries lock-in is the catalog — the keeper of that one mutable pointer and your governance layer. The REST Catalog spec, Polaris, and open Unity Catalog are all jockeying for it. Choose Iceberg for the table; choose your catalog as deliberately as you'd choose a warehouse. For the comparison against Delta Lake and Hudi, the open table formats piece is the companion.

Frequently asked questions

How does Apache Iceberg give object storage ACID transactions?

Iceberg keeps an immutable metadata tree: metadata file, manifest list, manifests, then data files. A write creates new files and performs a single atomic swap of the catalog's pointer to the new metadata file, so readers always see a complete, consistent snapshot — that pointer swap is the transaction.

What is the difference between copy-on-write and merge-on-read?

Copy-on-write rewrites whole data files on update or delete, giving cheap reads but expensive writes; merge-on-read writes small delete files and merges them at query time, giving cheap writes but more expensive reads. Choose copy-on-write for read-heavy tables and merge-on-read for frequent updates or streaming upserts.

Why did the catalog become the important decision with Iceberg?

The catalog holds the one mutable thing — the pointer to the current metadata — so it is the transaction coordinator and the governance control point. As the format settled, lock-in moved up to the catalog, which is why the Iceberg REST Catalog spec, Polaris, and open Unity Catalog matter.