The Lakehouse: How a Transaction Log Put ACID on Object Storage

For most of the last decade, every serious data platform I built had the same shape, and the same scar. You land raw data cheaply in a data lake — Parquet files on S3, Azure Blob, or GCS — and then you copy a curated slice of it into a data warehouse so the BI tools and analysts have something fast, reliable, and governed to query. Two systems. Two copies of the truth. A nightly ETL job stitching them together, and a standing argument about which number is correct when the dashboard disagrees with the lake. In January 2021 a paper at the CIDR conference — "Lakehouse: A New Generation of Open Platforms that Unify Data Warehousing and Advanced Analytics," by the Databricks founders — named this pain and proposed killing one of the two systems. The idea had been building for a couple of years; 2021 is when it got a name and an argument, and when I started taking it seriously in real designs.

The lakehouse is a single system that puts warehouse-grade management — ACID transactions, schema enforcement, governance, fast SQL — directly on the cheap, open files in your data lake, so you don't need a separate warehouse at all. The whole bet rests on one unglamorous piece of engineering: a transaction log sitting next to your Parquet files. This is how that works and why it mattered.

Why the two-tier architecture hurt

The lake-plus-warehouse split wasn't stupid — each tier was good at what the other was bad at. The lake gave you cheap, infinite, open storage that any engine could read, perfect for raw and semi-structured data and for ML, which wants files, not table rows. The warehouse gave you transactions, fast queries, and the governance the lake lacked. You used both because neither alone was enough.

But the seam between them leaked, constantly, in ways that compounded:

  • Two copies, double the cost, and staleness. Data lived in the lake and again in the warehouse. The warehouse copy was always a little behind, and you paid to store and move everything twice.
  • The lake had no transactions. A Spark job writing to S3 that failed halfway left a directory of half-written files. A reader running during a write saw a torn, inconsistent picture. There was no atomic "commit" — just files appearing one by one.
  • No schema enforcement, so "data swamp." Anything could write any shape of file anywhere. Six months in, nobody trusted the lake, which is exactly why the warehouse existed.
  • ML and BI fought over data. Data scientists wanted the raw lake files; analysts wanted warehouse tables. Keeping both consistent was a permanent tax.

The lakehouse thesis is that almost every one of those warehouse-only capabilities is achievable on the lake's own files if you add one missing ingredient: a transaction layer.

The missing ingredient: a transaction log over files

Here's the central trick, and it's worth slowing down for. A storage format like Delta Lake (the implementation I'll use throughout — Apache Iceberg and Apache Hudi solve the same problem with different designs) does not invent a new file format. Your data is still ordinary Parquet. What Delta adds is a transaction log — a directory called _delta_log sitting beside the data files — that records, as an ordered series of JSON commits, exactly which Parquet files make up the table right now.

The table is no longer "whatever Parquet files are in this directory." The table is "the set of files the log says are currently valid." That indirection is everything. Adding data means writing new Parquet files and then atomically appending a commit to the log that says "these files are now part of the table." Deleting or updating means writing new files and committing a record that says "these old files are out, these new ones are in." The data files are immutable; the log is the source of truth about which of them count.

graph TD
    subgraph TABLE["A Delta table in object storage (S3 / ADLS / GCS)"]
        P1["part-0001.parquet"]
        P2["part-0002.parquet"]
        P3["part-0003.parquet"]
        subgraph LOG["_delta_log/ (the transaction log)"]
            C0["00000.json
ADD part-0001, part-0002"] C1["00001.json
ADD part-0003
REMOVE part-0001"] end end READER["Reader: replay the log
to get the current file set"] LOG --> READER READER -.->|"valid now"| P2 READER -.->|"valid now"| P3 READER -.->|"superseded"| P1

A Delta table is plain Parquet plus an ordered JSON log. To read the table you replay the log's add/remove records to compute the current set of valid files — here part-0001 was superseded by part-0003 and is no longer part of the table, even though the file still physically exists. Because the log is the authority, a half-finished write that never committed is simply invisible: readers only ever see files a completed commit blessed.

How the log delivers each warehouse feature

ACID transactions on object storage

A write becomes atomic because it has exactly one commit point: the moment the new log entry lands. Before that, the new Parquet files exist but no reader counts them; after it, they all count at once. A job that dies mid-write leaves orphan Parquet files that no commit ever referenced — invisible garbage, not corruption. Isolation comes from snapshot reads: a reader pins the log version it started at and sees a stable picture even while a writer commits new versions. Concurrent writers use optimistic concurrency — each prepares its commit and the log's atomic "create the next numbered entry" operation lets exactly one win; the loser detects the conflict and retries against the new state.

Time travel

Because the log is an append-only history of versions, and old data files aren't deleted the instant they're superseded, you can read the table as of any past version or timestamp — just replay the log up to that point. This is genuinely useful, not a party trick: reproduce yesterday's report exactly, debug "what did this table look like before the bad load," roll back a botched write, or pin an ML training set to an immutable version.

-- the table as it was 50 commits ago, or at a wall-clock time
SELECT * FROM orders VERSION AS OF 50;
SELECT * FROM orders TIMESTAMP AS OF '2021-04-01 00:00:00';

Schema enforcement and evolution

The log stores the table's schema. A write whose columns don't match is rejected by default — the wall that stops a lake from rotting into a swamp. When you genuinely need to change shape, schema evolution is an explicit, logged operation (add a column, widen a type) rather than an accident a stray job inflicts on everyone downstream.

Performant updates, deletes, and merges

Plain data lakes are append-only in practice — updating one row meant rewriting a whole partition by hand. With a transaction log you get real UPDATE, DELETE, and MERGE (upsert). They work by writing new files for the affected data and committing a swap of old-for-new. That single capability is what makes CDC ingestion, GDPR "delete this user," and slowly-changing dimensions tractable on the lake — the things you used to flee to the warehouse for.

-- upsert from a CDC feed, atomically, directly on lake files
MERGE INTO customers t
USING changes s ON t.id = s.id
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *;

The medallion pattern: how teams actually organize a lakehouse

Capabilities don't make an architecture; conventions do. The pattern that emerged around the lakehouse is the medallion (bronze/silver/gold) layout — refining data in stages, each a set of transactional tables.

LayerContentsRole
BronzeRaw, as-ingested data — appended, schema-on-read-ish but transactionalThe immutable landing zone and replay source; never lose the original
SilverCleaned, conformed, de-duplicated, joined entitiesThe trustworthy, queryable core most analytics build on
GoldBusiness-level aggregates and serving tables for BI/MLFast, curated, the layer dashboards and models actually read

It's the same raw-to-refined flow the warehouse world has always used — the difference is that all three layers live as open tables on the same cheap storage, queryable by the same engine, with no copy-into-a-separate-system step between the lake and "the warehouse." The warehouse layer didn't vanish; it became the gold tables.

An honest look at the trade-offs

I'm enthusiastic, not evangelical. In 2021 the lakehouse is a strong direction with real, current limitations, and pretending otherwise sets teams up to be disappointed.

It is not yet a free win on raw query speed. A mature warehouse — a well-tuned Snowflake or a columnar MPP engine — still beats a lakehouse on many concurrent low-latency BI queries today, because decades of work went into its storage layout, caching, and vectorized execution. Lakehouse engines are closing this fast (vectorized native execution engines are arriving), but if your only need is sub-second dashboards over modest, highly-structured data, don't rip out a working warehouse on principle. The lakehouse wins biggest when you have large, varied data feeding both BI and ML and are paying the two-tier tax.

Two more realities. Small files are the lake's eternal nemesis — streaming or frequent writes produce many tiny Parquet files that wreck read performance, so you need periodic compaction (Delta's OPTIMIZE) and care about file sizing; the transaction log makes this safe but doesn't make it automatic. And the format landscape was contested: Delta Lake, Apache Iceberg, and Apache Hudi all implement the same core idea — a metadata/transaction layer over open files — with different designs and governance, and in 2021 you had to bet on one. They share a philosophy; they were not interchangeable.

How it compares, in one table

CapabilityPlain data lakeData warehouseLakehouse
Storage costLow (object storage)High (proprietary)Low (object storage)
Open formatsYesNo (mostly closed)Yes (Parquet + open log)
ACID transactionsNoYesYes (via the log)
Schema enforcementNoYesYes
BI / SQL performancePoorExcellentGood and improving
Direct ML / file accessYesAwkwardYes
Single copy of dataYes (the point)

What to carry away

The lakehouse is one architectural move: add a transaction log over the open files you already keep in object storage, and a directory of Parquet stops being a fragile data swamp and starts behaving like a managed table — atomic commits, snapshot isolation, time travel, schema enforcement, and real updates and merges. Delta Lake (along with Iceberg and Hudi) is that log. The payoff is collapsing the two-tier lake-plus-warehouse stack into one system, killing the second copy, the sync job, and the argument over which number is right.

In 2021 it's not a clean replacement for every warehouse — raw BI latency still favors mature warehouses, small files need tending, and the format wars hadn't settled. But the direction is unmistakable, and the reason it works is almost humble: not a new engine or a new file format, just an ordered list of which files count, kept honest. For platforms straddling analytics and ML, that's the difference between maintaining two truths and trusting one. It's the foundation a lot of what comes next is built on — including the transformation workflows teams run on top of these tables.