# The Dataflow Model: Event Time, Windows, Watermarks, and Triggers

The hardest thing about stream processing isn't throughput or fault tolerance — it's *time*. A user opens your app on a plane, taps around offline, and the events arrive three hours later when they reconnect. Did those taps happen "now," when you received them, or at 30,000 feet, when they actually occurred? Every meaningful streaming computation — sessions, hourly counts, billing windows — depends on the answer, and getting it wrong produces results that are subtly, unfixably wrong. The Dataflow model is the conceptual framework that finally made this tractable, and it's the theory sitting underneath [Flink](apache-flink-internals), Apache Beam, and essentially every modern streaming engine.

It came out of Google in 2015, and its lasting contribution is a reframing: stop treating batch and streaming as different paradigms, and instead answer four questions about any computation. I'll build up the pieces — **event time vs processing time**, **windows**, **watermarks**, and **triggers** — and then the four questions that tie them together.

## The two clocks: event time vs processing time

Everything starts with distinguishing two notions of time, and conflating them is the original sin of streaming:

- **Event time** — when the event actually happened, stamped at the source (the moment the user tapped).

- **Processing time** — when your system observed the event (the moment it arrived at the pipeline).

In a perfect world these are nearly equal. In the real world they diverge wildly and unpredictably: network delays, offline buffering, retries, backpressure, and reprocessing all mean events arrive **out of order** and **late**. The plane example is extreme, but seconds-to-minutes of skew is the everyday norm. The crucial decision: almost any computation you actually care about — "how many orders in the 9 a.m. hour," "did this user's session end" — is defined in **event time**, because that's when the things happened. Processing time is just an accident of your infrastructure.

**Windowing by processing time is the bug that looks like it works.** Counting "events per hour" by arrival time is trivial to implement and seems fine in testing — until a network blip shifts a thousand 8:59 events into the 9:00 bucket, or a backfill dumps yesterday's data into today. The counts are now wrong in a way no amount of compute fixes, because the information about when things happened was thrown away. If the question is about when events *occurred*, you must compute in event time — full stop.

## Windows: bounding the unbounded

A stream is infinite, but aggregates need boundaries — you can't sum "all events ever" and emit a result. **Windowing** slices the stream into finite chunks (in event time) that you can aggregate over. Three shapes cover almost everything:

| Window | Shape | Example |
| --- | --- | --- |
| **Tumbling (fixed)** | Adjacent, non-overlapping, fixed size | Count per clock hour |
| **Sliding** | Fixed size, overlapping by a step | 5-minute average, updated every minute |
| **Session** | Dynamic — closes after a gap of inactivity | A user's burst of activity, ended by 30 min idle |

Session windows are the one that's impossible to express cleanly without this model — their boundaries aren't fixed in advance; they're data-driven (the window ends when the user goes quiet). Being able to define "a session" declaratively, in event time, is a good test of whether a streaming system has a real windowing model or just a fixed-interval timer.

## Watermarks: how do you know a window is done?

Here's the deep problem. You're computing the count for the 9:00–10:00 event-time window. At 10:00 processing time, can you emit the result? No — because late events with 9:xx event-time timestamps might still be in flight. But you can't wait forever, or you'd never emit anything. You need an estimate of "event time has progressed far enough that I've probably seen everything up to here." That estimate is the **watermark**.

A watermark is the system's assertion: *"I believe no more events with event time earlier than T will arrive."* When the watermark passes the end of a window, the engine considers the window complete and can emit its result. Watermarks advance based on the timestamps the engine is seeing, and they are necessarily a **heuristic** — too aggressive and you close windows before late data arrives (dropping it); too conservative and results lag. This tension between **completeness and latency** is fundamental, not an implementation wart: you cannot have both perfectly, and the watermark is where you choose the trade.

```mermaid
graph LR
    E["Events arrive out of order(by processing time)"]
    ASSIGN["Assign to event-time windowse.g. 9:00–10:00"]
    WM["Watermark advances:'seen everything up to T'"]
    CLOSE["Watermark passes window end→ window fires, emit result"]
    LATE["Late event (event time < watermark)→ dropped, or handled by a late trigger"]
    E --> ASSIGN --> WM --> CLOSE
    ASSIGN -.-> LATE
          
```

Watermarks and late data. Events land out of order; each is placed in its event-time window. The watermark tracks how far event time has progressed; when it passes a window's end, the window fires. Anything arriving after the watermark is "late" — dropped by default, or, if you've configured allowed lateness, used to emit a corrected result via a late trigger.

## Triggers: deciding when to emit

Watermarks give you one natural firing point — "when the window is complete" — but sometimes that's not enough. For a long window (a daily aggregate) you may want **early** results every minute so a dashboard isn't blank all day; for late data you may want to emit a **corrected** result after the window already fired. **Triggers** decouple "what window" from "when to emit," letting you fire early (on a processing-time timer), on-time (at the watermark), and late (when stragglers arrive).

And when a window fires more than once, an **accumulation mode** says how successive firings relate: do you *accumulate* (each result supersedes the last, including all data so far) or emit *deltas* (each firing is just the new data)? Together, triggers and accumulation let one pipeline serve a fast-but-approximate early number *and* a slower-but-correct final number from the same window — exactly what real dashboards need.

## The four questions: batch and streaming, unified

The model's punchline is that any data processing — batch or streaming — is fully described by four questions. This is the reframing that mattered:

1. **What** results are computed? (the transformation: sum, count, join)

1. **Where** in event time? (windowing)

1. **When** in processing time are results emitted? (watermarks + triggers)

1. **How** do refinements relate? (accumulation mode)

The insight that collapses the batch/streaming divide: **batch is just a special case where the "when" is "once, at the end, when all data is present."** A bounded dataset is an unbounded one whose watermark jumps straight to infinity. Once you see batch and streaming as the same model with different answers to "when," the artificial wall between them — and the old "Lambda architecture" of running two separate systems — stops making sense. That realization is why a single API (Beam) can target both, and why Flink runs batch as bounded streaming rather than as a separate engine.

This is the theory my [Flink internals](apache-flink-internals) and [stream-processor comparison](stream-processing-flink-kafka-streams-spark) pieces are built on. When you see Flink talk about event-time watermarks and allowed lateness, or Beam's `Window`/`Trigger` API, or a [streaming database](streaming-databases) keeping a result current, they're all implementing answers to these four questions. Learn the model once and every streaming system's time-handling becomes legible instead of a pile of engine-specific knobs.

## What to carry away

Stream processing is hard because of time, and the Dataflow model is how the field tamed it. Separate **event time** (when it happened) from **processing time** (when you saw it), and compute in event time because that's what your questions are actually about. Use **windows** (tumbling, sliding, session) to bound the infinite stream; use **watermarks** to estimate when a window has seen enough data to fire, accepting the unavoidable completeness-vs-latency trade; and use **triggers and accumulation** to emit early, on-time, and corrected results from the same window.

Above all, hold the unifying idea: **What / Where / When / How** describes any computation, and batch is just streaming whose data is already all there. That collapses a decade of batch-vs-streaming dualism into one framework — the one beneath [Flink](apache-flink-internals), Beam, and the streaming systems that came after. Get the model, and the engines are just details.
