Apache Hudi Internals: Copy-on-Write, Merge-on-Read, and the Record Index

The three open table formats are usually lumped together — Delta, Iceberg, Hudi, the lakehouse trio that brought ACID to object storage. But they didn't start from the same problem, and Hudi's origin explains everything distinctive about it. Apache Hudi was born at Uber to solve a specific pain: ingesting a relentless stream of updates — trip records changing status, riders updating details — into the lake efficiently, without rewriting whole partitions every time one row changed. So where the others started as "how do we get consistent reads over immutable files," Hudi started as "how do we do fast, frequent upserts and incremental processing on the lake." The name even says it: Hudoop Upserts Deletes and Incrementals. Everything interesting about Hudi's internals follows from being upsert-first.

If you've read my Iceberg internals and lakehouse pieces, this completes the trio — and it's the one with the most opinionated take on the read-versus-write trade-off.

The timeline: Hudi's transaction log

Like the others, Hudi gets ACID from an ordered log of actions, which it calls the timeline. Every operation — a commit, a delta-commit, a compaction, a cleanup — is recorded on the timeline with a state (requested, inflight, completed) and a monotonic instant. Readers see a consistent snapshot by reading up to the latest completed instant; a writer's changes are invisible until its instant completes. This is the same fundamental idea as Delta's transaction log and Iceberg's metadata tree: an atomic, ordered record of "what is the table now," which is what makes concurrent reads and writes safe on object storage. The timeline is also what powers Hudi's signature feature — incremental queries — but I'll come back to that.

Copy-on-Write vs Merge-on-Read: the central choice

Here's where Hudi forces a decision the other formats historically didn't put front-and-center. Hudi has two table types, and they're opposite bets on whether you pay the cost of an update at write time or at read time. Understanding this one trade-off is understanding Hudi.

graph TD
    UP["An upsert arrives
(change rows in a file)"]
    subgraph COW["Copy-on-Write (COW)"]
        C1["Rewrite the whole file
with changes applied"]
        C2["Reads: just read columnar files
(fast, no merge)"]
        C1 --> C2
    end
    subgraph MOR["Merge-on-Read (MOR)"]
        M1["Append change to a
row-based delta log file
(cheap write)"]
        M2["Reads: merge base file
+ delta logs on the fly
(slower until compaction)"]
        M3["Compaction (background):
fold deltas into base files"]
        M1 --> M2
        M1 -.-> M3 -.-> M2
    end
    UP --> COW
    UP --> MOR

The two table types, the same upsert. Copy-on-Write pays at write time — it rewrites the affected columnar file so reads are pure, fast columnar scans. Merge-on-Read pays at read time — it appends the change cheaply to a row-based log and merges base + deltas when you read, with a background compaction periodically folding the deltas back into base files. COW favors read-heavy tables; MOR favors write/update-heavy, low-latency-ingest tables. Choosing wrong is the most common Hudi mistake.

Copy-on-Write (COW)

On an update, Hudi rewrites the entire file (file slice) containing the affected records, with the changes merged in, producing a new version. Writes are expensive — you rewrite a whole columnar file to change a few rows — but reads are as fast as possible because they're plain columnar reads with no merge work. COW is the right choice for tables that are read far more than written, where you can tolerate heavier write amplification for clean read performance. It behaves much like Delta/Iceberg's default copy-on-write update.

Merge-on-Read (MOR)

On an update, Hudi appends the change to a row-based delta log file alongside the columnar base file — a cheap, fast write. The cost moves to read time: a query must merge the base file with its delta logs on the fly to get the current state. A background compaction process periodically folds the accumulated deltas into new base files, restoring read speed. MOR is the choice for write-heavy, update-heavy, low-latency ingestion (streaming, CDC) where you can't afford to rewrite files on every change. It's the table type that most expresses Hudi's upsert-first DNA — and the one the other formats took longer to match.

The record-level index: why Hudi upserts are fast

An upsert has a hidden hard part: before you can update a record, you have to find which file it lives in. Naively, that means scanning the table to locate the row by key — ruinous if you're doing it on every micro-batch. Hudi's answer, and a genuine differentiator, is a record-level index: a mapping from record key to the file that holds it. With the index, an incoming upsert is routed straight to the right file without a table scan — which is exactly what makes high-frequency upserts tractable.

Hudi has offered several index types over the years (bloom-filter-based indexes built into the file footers, and a global record-level index in the metadata table for fast key lookups across partitions). The detail that matters for an architect: the index is how Hudi turns "update this key" into a targeted file write instead of a full scan, and it's why Hudi has historically had the strongest upsert story of the three formats. The trade-off is that maintaining the index is itself work, and a global index (lookups across all partitions) costs more than a partition-scoped one.

# Hudi upsert from Spark — the key + precombine fields drive index routing and merge
(df.write.format("hudi")
   .option("hoodie.table.name", "trips")
   .option("hoodie.datasource.write.table.type", "MERGE_ON_READ")   # or COPY_ON_WRITE
   .option("hoodie.datasource.write.recordkey.field", "trip_id")    # identity for the index
   .option("hoodie.datasource.write.precombine.field", "updated_at")# newest wins on conflict
   .option("hoodie.datasource.write.operation", "upsert")
   .mode("append")
   .save("s3://lake/trips"))

Incremental queries: read only what changed

Because the timeline records every commit with an instant, Hudi can answer a query the other formats made you work harder for: "give me only the records that changed since instant X." This incremental query turns the lake table into a change feed — downstream pipelines pull just the new and updated rows since their last run, instead of rescanning the whole table or bolting on separate CDC. It's the same instinct as log-based CDC, but native to the table itself, and it's why Hudi fits incremental ETL chains so naturally. (Iceberg and Delta have since added their own change-feed capabilities, but incremental processing was Hudi's idea from the start.)

The maintenance you can't ignore

Hudi gives you more tuning power than the others, which means more ways to misconfigure it — table services are not optional. The flip side of Hudi's flexibility is operational weight. MOR tables that never compact accumulate delta logs until reads crawl. Without cleaning, old file versions pile up and storage balloons. Without clustering, data layout degrades and queries scan more than they should. These "table services" (compaction, cleaning, clustering) can run inline with writes or asynchronously, and getting that wrong is the classic Hudi failure: a streaming ingest that's fast for a week and then mysteriously slows because compaction couldn't keep up. Hudi's many knobs are a strength when tuned and a liability when ignored — budget for operating the table, not just writing to it. This operational surface is, fairly, the most common reason teams find Hudi harder to run than Iceberg or Delta.

How Hudi compares to Iceberg and Delta

	Hudi	Iceberg	Delta
Born to solve	Fast upserts & incrementals (Uber)	Consistent reads at huge scale (Netflix)	ACID on the lake (Databricks)
Update model	COW or MOR (your choice)	Copy-on-write + merge-on-read (added later)	Copy-on-write + deletion vectors (added later)
Upsert / key index	Record-level index — strongest	No built-in record index	No built-in record index
Incremental query	Native, original	Change-log scans (added)	Change Data Feed (added)
Operational complexity	Higher (table services to tune)	Lower	Lower
Sweet spot	Streaming / CDC / update-heavy ingest	Large analytical tables, broad engine support	Databricks-centric lakehouses

The honest 2024 picture: the three formats have converged a lot — Iceberg and Delta have added merge-on-read-style updates and change feeds, narrowing Hudi's original lead. But Hudi still has the most mature story for the specific workload it was born for: high-frequency upserts and incremental streams, backed by a record-level index the others don't have. If your problem is "I'm constantly updating records and need that to be cheap and fast," Hudi is still the format that was designed for exactly you. (For the catalog and interop layer above all three, see the open table formats overview.)

What to carry away

Apache Hudi is the upsert-first member of the lakehouse trio, and its internals all trace back to that origin. The timeline gives it ACID and powers incremental queries (read only what changed since an instant) — a native change feed before the others had one. The defining choice is the table type: Copy-on-Write pays at write time (rewrite files) for fast pure-columnar reads, while Merge-on-Read pays at read time (append cheap delta logs, merge on read, compact in the background) for cheap high-frequency writes. And the record-level index is the piece that makes upserts fast by routing each key straight to its file instead of scanning.

The cost of all that power is operational: MOR needs compaction, tables need cleaning and clustering, and neglecting those table services is the standard way a fast Hudi pipeline quietly degrades. The formats have converged — Iceberg and Delta now do merge-on-read and change feeds too — so pick Hudi when your workload genuinely is update- and ingest-heavy and you'll use the record index and incremental queries it was built around. For read-mostly analytical tables, the simpler operational profile of Iceberg or Delta is often the better trade. Match the format to the workload it was born for, and Hudi earns its keep exactly where the others have to stretch.