Apache Flink Internals: State, Checkpoints, Watermarks, and Exactly-Once

Most "stream processing" isn't. It's batch processing run on small batches, fast enough to feel live. Flink is the system that takes streaming literally: it processes one record at a time, the instant it arrives, and still gives you exactly-once correctness and the ability to reason about when events actually happened rather than when they showed up. That combination — true low-latency streaming plus strong correctness on messy, out-of-order data — is why Flink became the serious choice for stateful stream processing. The price is conceptual: you have to understand state, checkpoints, and time. That's this article.

I'll go in dependency order: the runtime (how a job becomes parallel tasks), state (the thing that makes streaming hard and Flink good), checkpoints (how state survives failure), exactly-once (how output stays correct across restarts), and event time and watermarks (how you handle late and out-of-order data). The stream you process usually comes from Kafka — see Kafka Internals for what's on the other end.

True streaming vs micro-batch

The defining design choice: Flink is a record-at-a-time dataflow engine, not a micro-batch one. Your program is a graph of operators (source → transformations → sink); records flow through it continuously, each operator processing and emitting as data arrives. There's no batch boundary to wait for, so per-record latency can be milliseconds. This is the opposite of the micro-batch model (collect a tiny batch, process it, repeat), and it's the root of Flink's latency advantage and of why its correctness machinery has to be cleverer — you can't just re-run a batch to recover.

The runtime: JobManager, TaskManagers, slots

A Flink cluster has one logical JobManager — the coordinator that schedules work, triggers checkpoints, and handles recovery — and many TaskManagers, the worker processes that run your operators. Each TaskManager offers a number of task slots, the unit of parallelism; the job's operators are split into parallel subtasks and placed into slots across the TaskManagers.

Your logical dataflow is compiled into a parallel physical graph: operators that can run together are chained into a single task (no serialization between them), and where data must be redistributed by key — a keyBy — there's a network shuffle between operators, just like a stage boundary in a batch engine.

graph LR
    SRC["Source
(Kafka)"]
    MAP["map / filter
(chained — no shuffle)"]
    KB(["keyBy
redistribute by key
(network shuffle)"])
    AGG["keyed window / aggregate
(holds STATE per key)"]
    SINK["Sink
(exactly-once)"]
    SRC --> MAP --> KB --> AGG --> SINK

A Flink dataflow. Records stream through continuously; operators that don't need a reshuffle are chained into one task for speed. A keyBy forces a network redistribution so all records for a key land on the same subtask — which is where keyed state lives. State plus a continuous stream is what makes Flink powerful and what its checkpoint mechanism exists to protect.

State: the heart of stateful streaming

State is what separates real stream processing from a stateless filter. To compute a running count per user, detect a pattern across events, deduplicate, or join two streams, an operator must remember things between records. Flink makes this state a first-class, managed concept.

The most important kind is keyed state: state scoped to a key, available on the subtask that processes that key after a keyBy. Because Flink partitions the stream by key, each subtask owns a disjoint slice of the keys and their state — so state access is local, no coordination needed. Where that state physically lives is the state backend:

State backend	Where state lives	Fits
Heap (in-memory)	On the JVM heap of the TaskManager	Small state, lowest latency
RocksDB	An embedded key-value store on local disk (spills beyond memory)	Large state — gigabytes to terabytes per job

The RocksDB backend is the one that makes Flink viable for huge state: your keyed state can exceed memory because RocksDB keeps it on local disk, log-structured, with the hot part cached. Massive deduplication windows, long-lived aggregations, big stream-to-stream joins — all rely on this.

Checkpoints: surviving failure without losing state

Here's the central correctness problem. The stream never stops, state lives in operators spread across the cluster, and machines fail. How do you take a consistent snapshot of all that distributed state — without stopping the stream — so you can recover to a coherent point after a crash? Flink's answer is checkpointing via barrier snapshotting, an application of the Chandy-Lamport distributed-snapshot algorithm.

The JobManager periodically injects a checkpoint barrier into the source streams. Barriers flow downstream with the records. When an operator has received the barrier on all its inputs, it snapshots its current state (asynchronously, to durable storage like S3/HDFS) and forwards the barrier. When the barrier reaches all sinks, the checkpoint is complete — a globally consistent snapshot of every operator's state as of the same logical point in the stream, taken while data kept flowing.

graph LR
    JM["JobManager
injects barrier N"]
    S["Source
(barrier after offset X)"]
    O1["Operator
snapshot state on barrier"]
    O2["Operator
snapshot state on barrier"]
    DS["Durable store
(S3 / HDFS)
checkpoint N"]
    JM --> S
    S -->|"records + barrier N"| O1
    O1 -->|"records + barrier N"| O2
    O1 -.->|"async state snapshot"| DS
    O2 -.->|"async state snapshot"| DS
    S -.->|"source offsets"| DS

Barrier snapshotting. A barrier flows through the dataflow with the records; each operator snapshots its state when the barrier passes and writes it asynchronously to durable storage, alongside the source offsets at that point. The result is a consistent global snapshot taken without pausing the stream. On failure, Flink restores every operator's state from the last completed checkpoint and rewinds the sources to the saved offsets — and processing resumes as if nothing happened.

Recovery is the payoff: on a failure, Flink restarts the job, restores all operator state from the latest checkpoint, and resets the Kafka source offsets to the ones recorded in that checkpoint. Records since the checkpoint are replayed against the restored state. (A savepoint is the same mechanism triggered manually — for upgrades, rescaling, or migrations.)

Exactly-once: it's about effects, not delivery

"Exactly-once" is the most misunderstood phrase in streaming, so let me be precise. Checkpointing gives exactly-once state: because recovery restores state and rewinds sources together, each record affects internal state exactly once even though records are replayed after a failure. Replay means a record may be processed more than once — what's guaranteed is that the effect on state is as if it happened once.

Extending that guarantee to output needs more, because once you've emitted a result to an external system, replay could emit it again. Flink solves this with two-phase-commit sinks: the sink writes results in a pending/transactional state as it goes, and only commits them when the checkpoint that covers them completes. To Kafka or a transactional database, the output of a checkpoint either fully commits or is rolled back on recovery — end-to-end exactly-once.

Exactly-once isn't free, and it isn't always what you want. Two-phase-commit sinks add latency: results are only visible to consumers after the next checkpoint commits, so a 30-second checkpoint interval means up to ~30 seconds of output latency. For many pipelines, at-least-once with idempotent writes (upsert by key) is simpler, lower-latency, and just as correct in practice. Reach for end-to-end exactly-once when duplicates are genuinely unacceptable and you can't dedup downstream — not by reflex.

Event time and watermarks: handling messy reality

The last hard problem is time. Events arrive late and out of order — a mobile event from 10:00 might land at 10:05 because the phone was offline. If you bucket by processing time (when Flink sees the record), your 10:00 window already closed and the event lands in the wrong bucket. Flink lets you compute on event time — the timestamp in the event itself — so results are correct regardless of arrival order or delay.

But if you're keying windows by event time, how does an operator know when it has seen "enough" of 10:00's events to close that window? Through watermarks. A watermark is a marker flowing in the stream that asserts "event time has advanced to T — expect no more events older than T." When the watermark passes a window's end, Flink closes and emits that window. Watermarks are how Flink trades off latency against completeness: emit a watermark aggressively and you get fast results that might miss stragglers; allow more lateness and you wait longer for completeness. You tune that, and configure how to handle events that arrive even after the watermark (drop, or route to a side output).

Concept	What it gives you
Event time	Correct windowing by when events happened, not when they arrived
Watermark	A "event time has reached T" signal that decides when a window can close
Allowed lateness	Grace period to still update a window after its watermark passes
Windows	Tumbling (fixed, non-overlapping), sliding (overlapping), session (gap-based)

Backpressure: the stream self-regulates

One more piece worth knowing because it shows up in production. If a downstream operator can't keep up, Flink's credit-based flow control naturally slows the upstream operators and ultimately the source — backpressure. Rather than dropping data or running out of memory, the pipeline throttles itself to the speed of its slowest stage. When a Flink job's throughput is mysteriously capped, the first move is to find the backpressured operator — that's your bottleneck, exposed by the runtime.

What to carry away

Flink does true record-at-a-time streaming with correctness most "streaming" systems approximate. Its power is managed keyed state (RocksDB-backed, so it scales past memory), made fault-tolerant by checkpoint barrier snapshotting — consistent global snapshots taken without stopping the stream, restored with source offsets on failure for exactly-once state, extended to exactly-once output by two-phase-commit sinks. And event time plus watermarks let it compute correct results on late, out-of-order data, trading latency against completeness on your terms.

That capability is also its cost: state, checkpoints, and time are real concepts to operate, and exactly-once adds latency you don't always need. When Flink is the right tool — and how it compares to Kafka Streams and Spark Structured Streaming — is the subject of Flink vs Kafka Streams vs Spark Structured Streaming.