# Apache Iceberg Internals: Metadata Trees, Snapshots, and the Catalog Wars

The thing that broke "just put Parquet on S3 and call it a table" was never the files — it was everything around them. Two jobs writing at once and clobbering each other's output. A reader catching a half-finished commit and returning garbage. Renaming a column meaning a careful rewrite of everything. Figuring out which of the 200,000 files in a prefix are actually current. Object storage gives you durable, cheap bytes and none of the guarantees a table needs. **Apache Iceberg** is the specification that adds those guarantees back — ACID transactions, schema evolution, time travel — without giving up open files on cheap storage.

Iceberg is a **table format**: a spec for a layer of metadata that sits over your data files and turns them into a real table with transactional semantics. The whole thing rests on one structural idea — a **tree of immutable metadata** where a single atomic pointer swap is what makes a change visible. Get that tree and you understand snapshots, time travel, hidden partitioning, and the read/write trade-offs. Then there's the 2024 plot twist: the fight stopped being about the format and moved to the *catalog*. I'll cover both. (For the wider survey of Iceberg vs Delta vs Hudi, see [open table formats](open-table-formats); this is Iceberg from the inside.)

## The metadata tree: the whole format in one structure

An Iceberg table is a hierarchy of files in object storage. Reading top-down:

- **Metadata file** (`vN.metadata.json`) — the root. Holds the table schema, partition spec, properties, and the list of **snapshots**, with a pointer to the current one. There's a new immutable metadata file for every table version.

- **Manifest list** — one per snapshot. A list of all the manifest files that make up that snapshot, each annotated with partition-range summaries so a planner can skip whole manifests.

- **Manifest files** — each lists a set of actual data files, and crucially carries **per-file statistics**: row counts, null counts, and min/max values for each column.

- **Data files** — the [Parquet](parquet-orc-internals) (or ORC/Avro) files holding the rows. Iceberg tracks them explicitly; it never lists a directory to discover them.

```mermaid
graph TD
    CAT["Catalog(points to current metadata file)"]
    META["Metadata file (vN.metadata.json)schema, partition spec, snapshot list"]
    SNAP["Snapshot S2 (current)"]
    ML["Manifest list(+ partition range summaries)"]
    M1["Manifest A(data files + min/max stats)"]
    M2["Manifest B(data files + min/max stats)"]
    D1["Parquet data files"]
    D2["Parquet data files"]
    CAT --> META --> SNAP --> ML
    ML --> M1 --> D1
    ML --> M2 --> D2
          
```

The Iceberg metadata tree. The catalog holds one pointer — to the current metadata file. From there it's metadata file → snapshot → manifest list → manifests → data files, with column statistics stored at the manifest level. Because every layer is immutable, a commit just writes new files and swaps the catalog's single pointer atomically. That pointer swap is the transaction.

Two payoffs fall straight out of this. First, **planning never lists directories** — the engine reads the manifest list, prunes by partition summaries, then reads only relevant manifests and prunes data files by their min/max stats. On a table with hundreds of thousands of files, that metadata-driven planning is the difference between a query that starts instantly and one that spends minutes just enumerating S3. Second, the stats enable file-level skipping the same way [Parquet footers](parquet-orc-internals) enable row-group skipping — but across the entire dataset.

## Snapshots, atomic commits, and time travel

Every change to an Iceberg table produces a new **snapshot** — a complete, immutable picture of which data files constitute the table at that moment. A write doesn't mutate anything: it writes new data files, new manifests, a new manifest list, and a new metadata file referencing a new snapshot, then performs **one atomic operation** to point the catalog at that new metadata file. Until that swap, readers see the old snapshot, whole and consistent; after it, they see the new one. There is no in-between state a reader can observe.

That single atomic pointer swap is the entire basis of Iceberg's ACID guarantees, and concurrent writers are handled with optimistic concurrency: each prepares its commit against the snapshot it read, and the swap succeeds only if nothing else committed in the meantime — otherwise it retries against the new state. No locking the table, no torn reads.

Because old snapshots remain valid until you expire them, **time travel is free** — it's just reading an older metadata pointer. Query the table as of a timestamp or snapshot id to reproduce exactly what a report saw last Tuesday, audit what changed, or roll back a bad write by re-pointing at the previous snapshot. Reproducibility, which is brutally hard on a mutable warehouse, is a side effect of the immutable tree here. The cost is housekeeping: `expire_snapshots` to stop old files accumulating forever.

```sql
-- Time travel: read the table exactly as it was at a past snapshot
SELECT * FROM orders FOR SYSTEM_TIME AS OF '2024-07-01 00:00:00';
SELECT * FROM orders FOR SYSTEM_VERSION AS OF 3921025478193210123;

-- Roll back a bad write by re-pointing at a known-good snapshot
CALL catalog.system.rollback_to_snapshot('db.orders', 3921025478193210123);

-- Housekeeping so immutable history doesn't grow without bound
CALL catalog.system.expire_snapshots('db.orders', TIMESTAMP '2024-06-01 00:00:00');
```

## Hidden partitioning: the Hive mistake Iceberg fixed

In the old Hive world, partitioning leaked into both physical layout and queries: data lived in directories like `/dt=2024-07-09/`, and to get pruning you had to filter on a derived `dt` column and keep it in sync with the real `event_time`. Forget, or filter on `event_time` directly, and you silently scanned everything. Iceberg's **hidden partitioning** records the partition transform (e.g. `day(event_time)`) in metadata and applies it for you. You filter on the natural column; Iceberg derives the partition values and prunes correctly. Partitioning is no longer a column users must know about and babysit.

Better still, because partitioning is metadata rather than directory structure, you can **evolve** it — switch from daily to hourly partitions going forward — without rewriting existing data. Old data keeps its old layout; new data uses the new one; queries span both. The same metadata-not-physical principle is why **schema evolution** is safe: columns have stable ids, so add, drop, rename, and reorder are pure metadata operations that never require rewriting files and never resurrect deleted-then-readded columns.

## Copy-on-write vs merge-on-read

The one design decision you actually have to make per table is how row-level updates and deletes are handled. There are two strategies, and they trade write cost against read cost — the classic tension you also see in [LSM storage engines](cassandra-internals):

|  | Copy-on-write (COW) | Merge-on-read (MOR) |
| --- | --- | --- |
| On update/delete | Rewrite the whole data file with the change applied | Write a small delete file marking affected rows; data file untouched |
| Write cost | High — rewrites entire files | Low — appends delete files quickly |
| Read cost | Low — read clean files directly | Higher — merge delete files at read time |
| Best for | Read-heavy, infrequent updates | Frequent updates / streaming upserts, CDC |

COW keeps reads fast by paying up front to produce clean files; MOR keeps writes cheap by deferring the merge to read time, which is what you want for high-frequency updates and CDC ingestion — at the price of slower reads and a dependence on regular compaction to fold the delete files back in. There's no free option: you're choosing which side of the read/write trade your workload can afford, and MOR specifically buys you the ability to mutate rows often without rewriting large files each time.

## The catalog wars: where the 2024 fight moved

Here's the strategic shift. Notice that the catalog holds the one mutable thing in the whole design — the pointer to the current metadata file. That makes the catalog the transaction coordinator and the control point for governance: who can read or write, where lineage and access policy live. As Iceberg the format effectively won the open-table-format debate, the contest moved up a layer to *who owns the catalog*, because the catalog is where lock-in (or openness) now lives.

The unlock was the **Iceberg REST Catalog** spec — a standard HTTP API for catalog operations, so engines can talk to any compliant catalog instead of each vendor's bespoke one. On top of it, 2024 brought a flurry: Snowflake open-sourced **Polaris** as a REST-compatible open catalog, Databricks open-sourced **Unity Catalog** and (having acquired Tabular, the company founded by Iceberg's creators) leaned into interoperability, and the older Hive-metastore and cloud-native catalogs remain in play. The format is settling; the catalog is the new battleground.

**The catalog is now your real lock-in decision, not the file format.** It's tempting to think "we chose Iceberg, so we're open" — but your data files were always the easy, portable part. The catalog holds the pointers, the commit coordination, and increasingly the governance and access control. Pick a catalog you can't easily migrate off, or one that doesn't implement the REST spec cleanly, and you've recreated the lock-in you adopted an open format to escape. Evaluate the catalog's openness and portability as carefully as you once evaluated the warehouse — that's where the leverage sits in 2024.

## What to carry away

Apache Iceberg turns files on object storage into a transactional table through a **tree of immutable metadata** — metadata file, manifest list, manifests, data files — where every change writes new files and a single atomic pointer swap makes it visible. From that one idea you get **ACID commits**, **snapshots and free time travel**, metadata-driven query planning that never lists directories, **hidden partitioning** and safe schema evolution (layout is metadata, not directories), and a per-table choice of **copy-on-write vs merge-on-read** to place the update cost where your workload can pay it.

And the 2024 lesson on top: the format question is largely settled, so the decision that now carries lock-in is the **catalog** — the keeper of that one mutable pointer and your governance layer. The REST Catalog spec, Polaris, and open Unity Catalog are all jockeying for it. Choose Iceberg for the table; choose your catalog as deliberately as you'd choose a warehouse. For the comparison against Delta Lake and Hudi, the [open table formats](open-table-formats) piece is the companion.

## Frequently asked questions

### How does Apache Iceberg give object storage ACID transactions?

Iceberg keeps an immutable metadata tree: metadata file, manifest list, manifests, then data files. A write creates new files and performs a single atomic swap of the catalog's pointer to the new metadata file, so readers always see a complete, consistent snapshot — that pointer swap is the transaction.

### What is the difference between copy-on-write and merge-on-read?

Copy-on-write rewrites whole data files on update or delete, giving cheap reads but expensive writes; merge-on-read writes small delete files and merges them at query time, giving cheap writes but more expensive reads. Choose copy-on-write for read-heavy tables and merge-on-read for frequent updates or streaming upserts.

### Why did the catalog become the important decision with Iceberg?

The catalog holds the one mutable thing — the pointer to the current metadata — so it is the transaction coordinator and the governance control point. As the format settled, lock-in moved up to the catalog, which is why the Iceberg REST Catalog spec, Polaris, and open Unity Catalog matter.
