Running Kafka in Production: Consumer Groups, Lag, DLQs, and Hard Lessons

Getting a Kafka pipeline to work takes an afternoon. The producer sends, the consumer receives, the demo dazzles. Getting one to keep working at 3am when a downstream API starts timing out, a deploy triggers a rebalance, and one malformed message wedges an entire partition — that takes scars. I've collected a few. This is the operational Kafka the getting-started guides skip: the consumer-group behavior that surprises everyone, why consumer lag is the single metric I trust most, how to build dead-letter queues that don't quietly lose data, and what "exactly-once" actually buys you. None of it is about how Kafka works internally; it's about what bites you when you operate it.

Consumer groups: the part everyone gets wrong first

The thing to understand before anything else: a partition is consumed by exactly one consumer within a group at a time. That single rule explains most production behavior. Your unit of parallelism is the partition, full stop. If a topic has 12 partitions, you can usefully run up to 12 consumer instances in a group — each owns a slice of the partitions. Add a 13th and it sits idle, owning nothing, because there's no spare partition to give it.

This trips up every team at least once: they see lag climbing, scale the consumer deployment from 6 pods to 20, and lag doesn't budge — because the topic has 8 partitions and 12 of those pods are doing nothing. Throughput is bounded by partition count, not pod count. So the capacity-planning question isn't "how many consumers do I run," it's "did I create enough partitions up front," because you can add partitions but never remove them, and adding them breaks key-based ordering for keys that move to new partitions. Over-provision partitions deliberately at topic-creation time; it's the cheap decision you can't easily reverse later.

graph TD
    subgraph TOPIC["Topic: orders (4 partitions)"]
        P0["Partition 0"]
        P1["Partition 1"]
        P2["Partition 2"]
        P3["Partition 3"]
    end
    subgraph GROUP["Consumer group: order-processor"]
        C1["Consumer A"]
        C2["Consumer B"]
        C3["Consumer C (idle —
no spare partition)"]
    end
    P0 --> C1
    P1 --> C1
    P2 --> C2
    P3 --> C2

Four partitions, three consumers. Kafka assigns each partition to exactly one consumer in the group, so A and B share the work and C sits idle — there's no fourth-and-a-half partition to hand it. Scaling consumers past the partition count buys nothing. This is also why partition count is the throughput ceiling you must size correctly at creation, since you can add partitions but never take them away.

Rebalancing: necessary, and a foot-gun

When a consumer joins or leaves the group — a deploy, a crash, an autoscale event — Kafka rebalances: it reassigns partitions across the surviving members. Necessary, but historically brutal, because the classic "eager" protocol used a stop-the-world rebalance: every consumer gave up all its partitions, then everyone got new assignments. For the duration, your whole group stops processing. A rolling deploy of 10 pods could trigger 10 rebalances, each a full pause — and if a slow consumer takes too long to rejoin, the others keep rebalancing without it. People call these "rebalance storms," and they look exactly like an outage.

Two fixes that I now treat as defaults. First, the cooperative sticky assignor (available since Kafka 2.4): instead of revoking everything, only the partitions that actually need to move are revoked, and consumers keep processing the ones they retain. Rebalances become incremental, not stop-the-world. Second, static group membership (group.instance.id): give each consumer a stable identity so a quick restart — a rolling deploy, a brief blip — doesn't trigger a rebalance at all, because Kafka recognizes the returning member within the session timeout.

; the two settings that tamed rebalancing for me
partition.assignment.strategy = org.apache.kafka.clients.consumer.CooperativeStickyAssignor
group.instance.id            = order-processor-3   ; stable identity per instance
session.timeout.ms           = 45000               ; survive brief restarts
max.poll.interval.ms         = 300000              ; headroom for slow processing

The slowest cause of "my consumer keeps leaving the group": you took too long between polls. Kafka decides a consumer is dead in two ways. A background heartbeat thread covers crashes (governed by session.timeout.ms). But there's a second, sneakier timeout: max.poll.interval.ms. If your processing loop takes longer than this between calls to poll() — say a batch hits a slow external API — Kafka assumes the consumer is stuck, kicks it out, and rebalances its partitions to someone else. The classic death spiral: processing is slow → consumer misses the poll deadline → gets evicted → its partitions move → the new owner is now also overloaded → it gets evicted too. Either make processing faster, shrink max.poll.records so each batch is smaller, or raise the interval — but understand it's a processing-time problem, not a network one.

Consumer lag: the one metric I'd keep if I could only keep one

Consumer lag is the number of messages produced to a partition that the consumer group hasn't read yet — the gap between the partition's latest offset (the log end) and the consumer's committed offset. It is the single best health signal for a streaming pipeline, because it directly answers the question that matters: are we keeping up, and how far behind is the data your dashboards and downstream systems are seeing?

What I actually watch is not the raw number but its derivative. Steady lag, even if non-zero, is fine — you're keeping pace. Lag that's trending upward means production now exceeds your consumption rate, and you have a finite amount of time before retention deletes data you never processed. A sudden lag cliff after a deploy usually means a consumer is crash-looping or stuck in a rebalance. I alert on the trend and the time-to-drain, not on a fixed threshold, because "10,000 messages behind" means something completely different on a topic doing 100/sec versus 100,000/sec.

Tools like Burrow or the lag metrics your Kafka platform exposes will compute this; the discipline is to make lag-trend a first-class alert, not something you check after someone complains the data is stale.

Poison pills and dead-letter queues

Here's a failure mode that feels almost unfair the first time. One message in a partition can't be processed — malformed JSON, a schema your consumer doesn't understand, a value that throws. Your consumer tries it, fails, and… what? If you retry forever, you're stuck on that one message and the entire partition behind it stops moving — the poison pill that blocks everything. If you skip it silently, you've lost data and you'll never know. Neither is acceptable.

The pattern that works is a dead-letter queue: after a bounded number of retries, you move the bad message off the main path to a separate topic, commit past it, and keep the partition flowing. The DLQ is where poison pills go to be inspected, fixed, and replayed — not discarded.

graph LR
    SRC["Source topic"]
    PROC["Consumer
process message"]
    OK["Success:
commit offset, continue"]
    RETRY["Transient error:
retry with backoff"]
    DLQ["Dead-letter topic
(message + error context)"]
    ALERT["Alert + inspect + replay"]
    SRC --> PROC
    PROC --> OK
    PROC -->|"fails"| RETRY
    RETRY -->|"still failing
after N tries"| DLQ
    DLQ --> ALERT

A bad message is retried a bounded number of times for transient faults, then routed to a dead-letter topic with its error context attached, so the consumer can commit past it and the partition keeps flowing. The DLQ turns "one message halts everything" into "one message is quarantined for later" — but it's only safe if someone is actually alerted on and works the DLQ, otherwise it's a silent data-loss bucket.

Two things make a DLQ real rather than theatre. Distinguish transient from permanent failures: a downstream timeout deserves retry-with-backoff (it'll probably succeed in a second); a schema violation will fail identically forever, so retrying it is pure waste — send it straight to the DLQ. And capture context with the dead message: the exception, the stack, the original topic/partition/offset. A DLQ full of payloads with no error attached is a pile of mysteries nobody can fix.

Delivery semantics: at-least-once, and the cost of more

Most pipelines run on at-least-once delivery, and you should assume that's what you have unless you've gone out of your way for more. It means: a message is never lost, but it may be delivered more than once — because the moment of risk is the offset commit. Process the message, then crash before committing the offset, and on restart you'll reprocess it. That's not a bug to eliminate; it's a property to design around.

The design answer is the same one that makes CDC pipelines safe: make your processing idempotent. If reprocessing the same message produces the same result, duplicates are harmless. Upsert by a business key instead of blind-inserting; dedupe on an event ID; make the side effect naturally repeatable. Idempotency is almost always cheaper and more robust than chasing true exactly-once.

Guarantee	What it means	Cost / how
At-most-once	Commit offset before processing; may lose messages on crash	Cheap, rarely what you want
At-least-once	Process then commit; may reprocess duplicates on crash	The sane default — pair with idempotent consumers
Exactly-once	Each message effected once, even across failures	Kafka transactions + idempotent producer; real overhead, and only within Kafka-to-Kafka

Kafka does offer exactly-once semantics — via the idempotent producer and transactions that atomically write output records and commit consumer offsets together (the read-process-write loop, the foundation of Kafka Streams' exactly-once). It's real and it works. But be honest about its boundary: it guarantees exactly-once within Kafka. The moment your consumer writes to an external system — a database, an HTTP API, a search index — that external write is outside Kafka's transaction, and you're back to needing idempotency on that side. So for most pipelines I reach for at-least-once plus idempotent writes first, and only pay for transactions when the processing is genuinely Kafka-to-Kafka and duplicates are truly unacceptable.

Performance tuning that actually moved the needle

After correctness, throughput. The settings that gave me the biggest real-world wins:

Batch and compress on the producer. Raise linger.ms (wait a few milliseconds to fill a batch) and batch.size, and set compression.type=lz4 (or zstd). Bigger compressed batches mean far fewer, fatter requests — usually the single largest throughput gain, at the cost of a little latency.
Right-size max.poll.records. Fetch enough per poll to amortize overhead, but not so much that processing a batch blows the max.poll.interval.ms deadline and triggers the eviction spiral above. This is a balance you tune to your processing time.
Set acks deliberately. acks=all (with a sane min.insync.replicas) is the durable choice and what I default to; acks=1 trades durability for latency. Know which one you picked and why — it's a data-loss decision.
Partition for parallelism and even keys. Throughput scales with partitions, but skewed keys create hot partitions where one consumer drowns while others idle. Choose a partition key that distributes evenly, and remember ordering is only guaranteed within a partition.

Commit offsets deliberately, not on autopilot. The default enable.auto.commit=true commits on a timer, which means it can commit offsets for messages you haven't finished processing — process-crash there and you've lost them (silent at-most-once you didn't ask for). For anything that matters, turn auto-commit off and commit explicitly after the work is durably done. It's a few lines of code and it's the difference between at-least-once you can reason about and a data-loss window you didn't know you had.

What to carry away

Operating Kafka well comes down to internalizing a handful of truths the tutorials gloss over. Parallelism is bounded by partitions, so over-provision them up front; rebalancing is necessary but historically stop-the-world, so use the cooperative sticky assignor and static membership to tame it. Watch consumer lag — its trend above all — as your primary health signal. Build dead-letter queues that separate transient from permanent failures and carry error context, so one poison pill can't halt a partition and nothing gets silently dropped.

And be clear-eyed about delivery: you almost certainly have at-least-once, so make your processing idempotent and commit offsets only after the work is done. Exactly-once is real but lives inside Kafka's boundary; the instant you write to an external system, idempotency is back to being your job. Get those right and Kafka becomes the dependable backbone everyone assumes it already is — but it earns that reliability from how you operate it, not from the demo that worked on the first try.