# Change Data Capture with Debezium: Log-Based Capture and the Outbox Pattern

Sooner or later every data platform hits the same wall: you need to know, in something close to real time, what changed in a database — to keep a search index fresh, invalidate a cache, feed a data lake, or let one microservice react to another's data. The naive answer is to poll: `SELECT * WHERE updated_at > last_run` on a schedule. It works just badly enough to ship, and then it betrays you. **Change Data Capture** done properly — log-based, with Debezium — is the answer, and it's worth understanding why it's so much better than the polling you'll be tempted to write first.

## Why log-based beats query-based CDC

Query-based CDC — polling a timestamp or version column — has three structural flaws that no amount of tuning fixes:

- **It misses deletes.** A deleted row simply isn't in the next query result. You can't poll for the absence of a row, so deletes silently never propagate (the classic workaround, soft-delete flags, is a leak you now maintain forever).

- **It misses intermediate states.** If a row changes three times between polls, you see only the final value — the intermediate transitions are lost. For an event stream, those transitions are often the point.

- **It loads the source and lags.** Polling puts query load on the production database, and your freshness is bounded by the poll interval — poll more often to reduce lag, pay more load.

**Log-based CDC** sidesteps all three by reading the database's own **transaction log** — the write-ahead log every relational database already maintains for durability and replication. The log records *every* committed change, in order, including deletes and every intermediate state. Reading it imposes almost no load (it's the same mechanism a read replica uses) and gives near-real-time latency. You're not asking the database what changed; you're reading the authoritative record it already writes.

## What Debezium is

Debezium is a set of **source connectors** that read these transaction logs and emit change events. Each connector speaks its database's native log protocol:

| Database | Log mechanism Debezium reads |
| --- | --- |
| MySQL / MariaDB | The binary log (binlog) |
| PostgreSQL | The write-ahead log via logical decoding (WAL) |
| MongoDB | The replication oplog / change streams |
| SQL Server, Oracle, others | Their respective transaction-log / CDC interfaces |

Debezium most commonly runs on **Kafka Connect**, the framework purpose-built for moving data in and out of Kafka. Connect gives Debezium the things you'd otherwise have to build: distributed workers, offset tracking (so it remembers its position in the log and resumes without re-reading), restarts, and scaling. Debezium reads the log; Connect writes the resulting change events to Kafka topics — typically one topic per table.

```mermaid
graph LR
    APP["Application writes"]
    DB[("Source database+ transaction log(binlog / WAL)")]
    DBZ["Debezium connectoron Kafka Connect(reads the log, tracks offset)"]
    K["Kafka topics(one per table —change events)"]
    C1["Sink: search index"]
    C2["Sink: cache invalidation"]
    C3["Sink: data lake / warehouse"]
    APP --> DB
    DB -->|"transaction log"| DBZ
    DBZ --> K
    K --> C1
    K --> C2
    K --> C3
          
```

The Debezium pipeline. The application writes to the database as normal; Debezium tails the transaction log (negligible load), and Kafka Connect publishes ordered change events to per-table Kafka topics. Many independent consumers then react to the same change stream — each at its own pace, replayable from any offset. The database stays the single source of truth; everything downstream derives from its log.

### The shape of a change event

Each event Debezium emits describes one row change with a consistent envelope: the operation (`c` create, `u` update, `d` delete, `r` read-during-snapshot), the row state `before` and `after` the change, and source metadata (table, log position, transaction, timestamp). The `before`/`after` pair is what makes it so much richer than polling — a consumer sees exactly what changed, and deletes arrive as first-class events with the deleted row in `before`.

```json
{
  "op": "u",
  "before": { "id": 42, "status": "pending", "total": 100 },
  "after":  { "id": 42, "status": "paid",    "total": 100 },
  "source": { "table": "orders", "lsn": "0/1A2B3C", "ts_ms": 1660000000000 }
}
```

## Snapshot, then stream

When a connector starts against an existing database, the log only contains *recent* changes — it won't reconstruct rows that were written long ago. So Debezium begins with an **initial snapshot**: it reads the current contents of the tables (emitting them as `read` events) to establish a complete baseline, then seamlessly switches to **streaming** from the log position captured at snapshot start. The handoff is the delicate part — done right, you get a complete picture with no gap and no duplication of the boundary. (Modern Debezium also supports incremental snapshots, so you can snapshot without a long blocking pause and even re-snapshot specific tables on demand.)

## The outbox pattern: solving the dual-write problem

Here's where CDC stops being merely an ingestion tool and becomes an architectural pattern. In microservices, a service often needs to both update its database *and* tell the rest of the system about it (publish an event). The obvious approach — write to the DB, then publish to Kafka — has a lethal flaw called the **dual-write problem**: those are two separate systems with no shared transaction, so a crash between them leaves you with the database updated but no event published (or vice versa). Your systems silently diverge, and it's nearly impossible to debug after the fact.

The **outbox pattern** fixes this elegantly by using the one transaction you *do* have. The service writes its business change *and* an event row into an `outbox` table **in the same local database transaction**. Either both commit or neither does — atomicity guaranteed by the database. Debezium then captures inserts into the outbox table from the log and publishes them to Kafka. The dual write becomes a single atomic write, and CDC does the reliable publishing.

```mermaid
graph TD
    subgraph TXN["ONE database transaction (atomic)"]
        BIZ["UPDATE orders SET status='paid'"]
        OUT["INSERT INTO outbox (event: OrderPaid)"]
    end
    LOG[("Transaction log")]
    DBZ["Debezium (outbox connector)"]
    K["Kafka topic: order events"]
    BIZ --> LOG
    OUT --> LOG
    LOG --> DBZ --> K
          
```

The outbox pattern. The business change and its event are written in a single transaction, so they commit together or not at all — no dual-write divergence. Debezium streams the committed outbox rows to Kafka, giving reliable, exactly-once-from-the-source event publishing without a distributed transaction. Debezium even ships an outbox event router to reshape these rows into clean domain events.

## Delivery guarantees and the consumer's job

Debezium provides **at-least-once** delivery: it tracks its log offset, but a crash between emitting an event and committing the offset means that event can be re-delivered on restart. In practice you also get duplicates around connector restarts and rebalances. This isn't a defect — it's the honest guarantee of the underlying transport, the same one any [Kafka](kafka-internals)-based pipeline lives with. The implication is firm: **downstream consumers must be idempotent**. Because every change event carries the row's primary key and a monotonic log position, that's very achievable — upsert by key, ignore events older than the latest applied position, and duplicates become harmless.

**Operational realities to plan for:** the source database must be configured for log access (binlog enabled with row-level image on MySQL; `wal_level = logical` on Postgres), and on Postgres a replication slot is created that the database *retains WAL for* — if Debezium falls behind or stops, that WAL accumulates and can fill the disk. Monitor slot/connector lag like you'd monitor a replica. And schema changes flow too: Debezium tracks DDL and adjusts event schemas, which is exactly why pairing it with a schema registry (so consumers handle evolution gracefully) is standard practice.

## What to carry away

Log-based CDC reads the database's own transaction log instead of polling it, which is why it captures deletes and every intermediate state, imposes almost no load, and stays near-real-time — the three things query-based CDC can't do. **Debezium** implements this as Kafka Connect source connectors that snapshot, then stream, emitting rich `before`/`after` change events to per-table topics for any number of independent consumers. And the **outbox pattern** turns CDC into the reliable backbone of event-driven microservices by collapsing the dual-write problem into one atomic database transaction.

Build with the guarantee in mind — make consumers idempotent, watch connector and replication-slot lag, plan for schema evolution — and CDC becomes the quiet, dependable nervous system that keeps caches, indexes, lakes, and services all in sync with the source of truth, without anyone writing another polling loop.
