Change Data Capture with Debezium: Log-Based Capture and the Outbox Pattern

Sooner or later every data platform hits the same wall: you need to know, in something close to real time, what changed in a database — to keep a search index fresh, invalidate a cache, feed a data lake, or let one microservice react to another's data. The naive answer is to poll: SELECT * WHERE updated_at > last_run on a schedule. It works just badly enough to ship, and then it betrays you. Change Data Capture done properly — log-based, with Debezium — is the answer, and it's worth understanding why it's so much better than the polling you'll be tempted to write first.

Why log-based beats query-based CDC

Query-based CDC — polling a timestamp or version column — has three structural flaws that no amount of tuning fixes:

  • It misses deletes. A deleted row simply isn't in the next query result. You can't poll for the absence of a row, so deletes silently never propagate (the classic workaround, soft-delete flags, is a leak you now maintain forever).
  • It misses intermediate states. If a row changes three times between polls, you see only the final value — the intermediate transitions are lost. For an event stream, those transitions are often the point.
  • It loads the source and lags. Polling puts query load on the production database, and your freshness is bounded by the poll interval — poll more often to reduce lag, pay more load.

Log-based CDC sidesteps all three by reading the database's own transaction log — the write-ahead log every relational database already maintains for durability and replication. The log records every committed change, in order, including deletes and every intermediate state. Reading it imposes almost no load (it's the same mechanism a read replica uses) and gives near-real-time latency. You're not asking the database what changed; you're reading the authoritative record it already writes.

What Debezium is

Debezium is a set of source connectors that read these transaction logs and emit change events. Each connector speaks its database's native log protocol:

DatabaseLog mechanism Debezium reads
MySQL / MariaDBThe binary log (binlog)
PostgreSQLThe write-ahead log via logical decoding (WAL)
MongoDBThe replication oplog / change streams
SQL Server, Oracle, othersTheir respective transaction-log / CDC interfaces

Debezium most commonly runs on Kafka Connect, the framework purpose-built for moving data in and out of Kafka. Connect gives Debezium the things you'd otherwise have to build: distributed workers, offset tracking (so it remembers its position in the log and resumes without re-reading), restarts, and scaling. Debezium reads the log; Connect writes the resulting change events to Kafka topics — typically one topic per table.

graph LR
    APP["Application writes"]
    DB[("Source database
+ transaction log
(binlog / WAL)")] DBZ["Debezium connector
on Kafka Connect
(reads the log, tracks offset)"] K["Kafka topics
(one per table —
change events)"] C1["Sink: search index"] C2["Sink: cache invalidation"] C3["Sink: data lake / warehouse"] APP --> DB DB -->|"transaction log"| DBZ DBZ --> K K --> C1 K --> C2 K --> C3

The Debezium pipeline. The application writes to the database as normal; Debezium tails the transaction log (negligible load), and Kafka Connect publishes ordered change events to per-table Kafka topics. Many independent consumers then react to the same change stream — each at its own pace, replayable from any offset. The database stays the single source of truth; everything downstream derives from its log.

The shape of a change event

Each event Debezium emits describes one row change with a consistent envelope: the operation (c create, u update, d delete, r read-during-snapshot), the row state before and after the change, and source metadata (table, log position, transaction, timestamp). The before/after pair is what makes it so much richer than polling — a consumer sees exactly what changed, and deletes arrive as first-class events with the deleted row in before.

{
  "op": "u",
  "before": { "id": 42, "status": "pending", "total": 100 },
  "after":  { "id": 42, "status": "paid",    "total": 100 },
  "source": { "table": "orders", "lsn": "0/1A2B3C", "ts_ms": 1660000000000 }
}

Snapshot, then stream

When a connector starts against an existing database, the log only contains recent changes — it won't reconstruct rows that were written long ago. So Debezium begins with an initial snapshot: it reads the current contents of the tables (emitting them as read events) to establish a complete baseline, then seamlessly switches to streaming from the log position captured at snapshot start. The handoff is the delicate part — done right, you get a complete picture with no gap and no duplication of the boundary. (Modern Debezium also supports incremental snapshots, so you can snapshot without a long blocking pause and even re-snapshot specific tables on demand.)

The outbox pattern: solving the dual-write problem

Here's where CDC stops being merely an ingestion tool and becomes an architectural pattern. In microservices, a service often needs to both update its database and tell the rest of the system about it (publish an event). The obvious approach — write to the DB, then publish to Kafka — has a lethal flaw called the dual-write problem: those are two separate systems with no shared transaction, so a crash between them leaves you with the database updated but no event published (or vice versa). Your systems silently diverge, and it's nearly impossible to debug after the fact.

The outbox pattern fixes this elegantly by using the one transaction you do have. The service writes its business change and an event row into an outbox table in the same local database transaction. Either both commit or neither does — atomicity guaranteed by the database. Debezium then captures inserts into the outbox table from the log and publishes them to Kafka. The dual write becomes a single atomic write, and CDC does the reliable publishing.

graph TD
    subgraph TXN["ONE database transaction (atomic)"]
        BIZ["UPDATE orders SET status='paid'"]
        OUT["INSERT INTO outbox (event: OrderPaid)"]
    end
    LOG[("Transaction log")]
    DBZ["Debezium (outbox connector)"]
    K["Kafka topic: order events"]
    BIZ --> LOG
    OUT --> LOG
    LOG --> DBZ --> K
          

The outbox pattern. The business change and its event are written in a single transaction, so they commit together or not at all — no dual-write divergence. Debezium streams the committed outbox rows to Kafka, giving reliable, exactly-once-from-the-source event publishing without a distributed transaction. Debezium even ships an outbox event router to reshape these rows into clean domain events.

Delivery guarantees and the consumer's job

Debezium provides at-least-once delivery: it tracks its log offset, but a crash between emitting an event and committing the offset means that event can be re-delivered on restart. In practice you also get duplicates around connector restarts and rebalances. This isn't a defect — it's the honest guarantee of the underlying transport, the same one any Kafka-based pipeline lives with. The implication is firm: downstream consumers must be idempotent. Because every change event carries the row's primary key and a monotonic log position, that's very achievable — upsert by key, ignore events older than the latest applied position, and duplicates become harmless.

Operational realities to plan for: the source database must be configured for log access (binlog enabled with row-level image on MySQL; wal_level = logical on Postgres), and on Postgres a replication slot is created that the database retains WAL for — if Debezium falls behind or stops, that WAL accumulates and can fill the disk. Monitor slot/connector lag like you'd monitor a replica. And schema changes flow too: Debezium tracks DDL and adjusts event schemas, which is exactly why pairing it with a schema registry (so consumers handle evolution gracefully) is standard practice.

What to carry away

Log-based CDC reads the database's own transaction log instead of polling it, which is why it captures deletes and every intermediate state, imposes almost no load, and stays near-real-time — the three things query-based CDC can't do. Debezium implements this as Kafka Connect source connectors that snapshot, then stream, emitting rich before/after change events to per-table topics for any number of independent consumers. And the outbox pattern turns CDC into the reliable backbone of event-driven microservices by collapsing the dual-write problem into one atomic database transaction.

Build with the guarantee in mind — make consumers idempotent, watch connector and replication-slot lag, plan for schema evolution — and CDC becomes the quiet, dependable nervous system that keeps caches, indexes, lakes, and services all in sync with the source of truth, without anyone writing another polling loop.