# Spark Real-Time Mode: Sub-300ms Streaming Without Leaving Structured Streaming

For a decade, the standard answer to "can Spark do real-time?" came with an asterisk. Spark [Structured Streaming](dataflow-model-windows-watermarks) is excellent, battle-tested, and runs an enormous share of the world's streaming pipelines — but its latency floor sat in the seconds, sometimes the high hundreds of milliseconds on a good day. If you needed single-digit-to-low-hundreds-of-milliseconds tail latency — fraud scoring before a transaction clears, ad bidding, instant operational alerts — the advice was to reach for [Apache Flink](apache-flink-internals) and accept a second framework, a second programming model, and a second thing to operate. Databricks' **Real-Time Mode** for Structured Streaming sets out to remove that asterisk: sub-300ms end-to-end tail latency while keeping the exact same Structured Streaming code, API, and exactly-once guarantees. This is how it works, why microbatch couldn't get there, and where I'd actually use it.

## Why classic microbatch has a latency floor

To understand Real-Time Mode you have to understand what it's replacing. Classic Structured Streaming is, under the hood, **microbatch**: a streaming query is executed as a relentless series of tiny batch jobs. Every trigger, Spark wakes up, figures out what new data arrived, plans a job, schedules tasks onto executors, runs them, writes results, commits a checkpoint, and goes back to sleep until the next trigger. It's an elegant idea — streaming as "a table that keeps growing, re-queried continuously" — and it gives you Spark's full SQL surface and rock-solid exactly-once.

But each microbatch pays fixed overhead that no amount of tuning erases:

- **Per-batch task scheduling.** Spark launches a fresh set of tasks every microbatch. The driver plans the job and dispatches tasks to executors — milliseconds of coordination, paid *every batch*, before a single record is touched.

- **Shuffle through disk.** Any operation that moves data between stages (a join, an aggregation) does a [shuffle](spark-performance) — and classic Spark shuffle writes intermediate data to disk on the map side, then the reduce side fetches it. Durable and fault-tolerant, but disk round-trips add latency you can feel.

- **Batch-boundary state and checkpoint work.** Stateful operators and the checkpoint commit happen at batch boundaries, so state maintenance can stall the start of the next batch.

Add those up and you get a floor. The "as fast as possible" default trigger still incurs the per-batch tax on every cycle, so you bottom out in the hundreds of milliseconds. (Spark did ship an experimental **Continuous Processing** mode years ago for lower latency, but it supported only a narrow set of map-like operations and weaker guarantees, so almost nobody adopted it for real work.) Microbatch's overhead is the price of its generality and reliability — and Real-Time Mode is an attempt to keep the generality while removing the overhead.

## The three changes that break the floor

Real-Time Mode keeps the Structured Streaming programming model and the dataframe API, but swaps the execution engine underneath for a low-latency one. Three architectural changes do the heavy lifting.

```mermaid
graph TD
    subgraph MB["Classic microbatch (per cycle)"]
        S1["Plan job + schedule tasks"]
        S2["Run stage 1"]
        S3["Shuffle to disk"]
        S4["Run stage 2"]
        S5["Commit checkpoint"]
        S1 --> S2 --> S3 --> S4 --> S5 --> S1
    end
    subgraph RT["Real-Time Mode (continuous)"]
        T1["Long-running tasks(scheduled once, stay up)"]
        T2["Streaming shuffle(pipelined over network,no disk round-trip)"]
        T3["Incremental state +async checkpoint(off the hot path)"]
        T1 --> T2 --> T3
    end
          
```

Left: every microbatch re-pays planning, scheduling, a disk-based shuffle, and a checkpoint commit — fixed overhead per cycle that sets the latency floor. Right: Real-Time Mode schedules long-running tasks once and keeps them alive, pipelines data between stages over the network instead of through disk, and moves state maintenance and checkpointing off the record-processing hot path. The record no longer waits for a batch boundary to exist.

### 1. Long-running tasks instead of per-batch scheduling

This is the biggest single win. Instead of launching fresh tasks each microbatch, Real-Time Mode schedules **long-running tasks once** and leaves them running, continuously consuming and processing records. The driver's per-batch planning-and-dispatch tax — milliseconds paid on every cycle in microbatch — is paid essentially once, at startup. Records flow through tasks that are already up and waiting, rather than waiting for the next batch's tasks to be born. Remove the scheduling round-trip from the hot path and a huge chunk of the floor disappears.

### 2. A pipelined streaming shuffle

The second change targets the disk round-trip. In Real-Time Mode, when data must move between stages, it's **streamed over the network directly from upstream to downstream tasks** rather than materialized to disk and fetched. Because the downstream long-running tasks are already alive, an upstream task can push records straight to them as it produces them — pipelining the stages instead of running them in lock-step with a disk barrier between. You trade some of microbatch shuffle's disk-backed fault-tolerance model for latency, which the checkpointing model compensates for.

### 3. Incremental state and asynchronous checkpointing

Stateful streaming — aggregations, joins, dedup — keeps state in a state store, and in microbatch the maintenance and checkpoint of that state happens at batch boundaries, where it can block the next batch. Real-Time Mode moves this off the hot path: state cleanup is done **incrementally** rather than in a big batch-boundary sweep, and checkpointing runs **asynchronously** so committing progress doesn't stall record processing. The newer checkpoint format (often called checkpoint v2) is designed to make this incremental, low-overhead progress tracking work without sacrificing the exactly-once guarantee — which is the non-negotiable that separates this from the old Continuous Processing experiment.

## Turning it on, and what stays the same

The deliberate design choice — and the reason this matters more than a faster competitor framework — is that **your code doesn't change**. The same readStream/writeStream pipeline, the same SQL and dataframe transformations. You opt into the low-latency engine through the trigger:

```python
# the same Structured Streaming query you already write...
df = (spark.readStream
      .format("kafka")
      .option("subscribe", "transactions")
      .load())

scored = score_for_fraud(df)   # your existing transformations, unchanged

# ...opted into the low-latency execution engine via the trigger
(scored.writeStream
   .format("kafka")
   .option("topic", "fraud-decisions")
   .trigger(realTime="5 seconds")   # Real-Time Mode (checkpoint interval), not a batch interval
   .option("checkpointLocation", "/chk/fraud")
   .start())
```

Note what the trigger duration means here: it's *not* a microbatch interval that batches up 5 seconds of data. In Real-Time Mode the tasks run continuously and records flow through immediately; the interval governs how often progress is checkpointed, not how often data is processed. That distinction is the whole mental shift — you stop thinking "how big is each batch" and start thinking "records flow continuously, I checkpoint periodically."

**Same guarantees, same ecosystem.** Real-Time Mode keeps exactly-once semantics, integrates with the same sources and sinks (Kafka first among them), and runs the same dataframe/SQL logic. That's the pitch in one sentence: you're not adopting a new streaming framework, you're switching the execution mode of the one you already use. The migration cost that pushed teams to run Flink alongside Spark is largely what this removes.

## Real-Time Mode vs. classic Structured Streaming

| Dimension | Classic microbatch | Real-Time Mode |
| --- | --- | --- |
| Execution model | Series of small batch jobs | Long-running, continuously processing tasks |
| Tail latency target | Hundreds of ms to seconds | Sub-300ms (p99) end-to-end |
| Task scheduling | Per microbatch | Once, at startup |
| Inter-stage data movement | Shuffle via disk | Pipelined streaming shuffle over network |
| State / checkpoint | At batch boundaries | Incremental cleanup + async checkpoint |
| Programming model | Structured Streaming | Identical Structured Streaming |
| Delivery guarantee | Exactly-once | Exactly-once |
| Best for | The vast majority of pipelines | Genuinely latency-critical paths |

## The use cases that actually justify it

Sub-300ms is not free, so the honest question is: which workloads have a business reason to need it? The pattern is the same in each — a *decision* sits in the critical path of something happening, and being late is the same as being wrong.

- **Fraud and risk scoring** while a payment or login is in flight: the decision has to come back before the transaction completes, or it's useless.

- **Ad bidding and real-time personalization:** auctions and page renders have hard tens-of-milliseconds budgets; a late bid is a lost impression.

- **Operational alerting and observability:** detecting an outage, a security event, or an SLA breach where every extra second of detection latency is extra blast radius.

- **Live gaming, leaderboards, and trading signals:** anything where users perceive the lag directly or money moves on it.

- **IoT and network anomaly detection:** reacting to a sensor or telemetry pattern fast enough to actuate, not just to record.

The common thread: the value of the result *decays in milliseconds*. If your output feeds a dashboard someone glances at, or a table queried minutes later, you do not have a Real-Time Mode use case — you have a microbatch use case that would only cost more to run continuously.

## Best practices and honest limits

**End-to-end latency is a chain, and Real-Time Mode is only one link.** Hitting sub-300ms p99 across the whole path means every hop cooperates. If your [Kafka](kafka-production-pipeline-patterns) producers batch with a high `linger.ms`, or your sink commits slowly, or a stateful join holds enormous state, the engine's low latency is wasted — you'll wait on the slowest link. And measure the *tail*: the entire point is p99/p999, not the average. A pipeline with a great average and an ugly p99 fails exactly the use cases that needed Real-Time Mode in the first place. Garbage-collection pauses, skewed keys, and a single hot partition all live in the tail, so profile there.

A few more things I'd hold a team to:

- **Keep state lean.** The lower the latency target, the more a fat stateful operator hurts. Bound state with watermarks and TTLs, watch for state that grows without cleanup, and prefer stateless or lightly-stateful transformations on the hot path where you can.

- **Provision steady capacity, not bursty.** Long-running tasks expect resources to be there continuously; the elastic, scale-with-the-batch posture that suits microbatch is a worse fit for a mode whose whole premise is "always running." Size for sustained throughput plus headroom.

- **Mind the source and sink.** Kafka is the natural pairing; a high-latency source or a sink that can't keep up will dominate your latency budget no matter how fast the engine is.

- **Don't default to it.** This is the most important one. Microbatch Structured Streaming is the right answer for the large majority of streaming work — it's simpler to reason about, cheaper, and plenty fast for anything that isn't on a millisecond decision path. Reach for Real-Time Mode for the specific critical path that needs it, often alongside microbatch pipelines for everything else.

### Where this leaves the Spark-vs-Flink question

For years the clean split was: Flink for true low-latency per-record streaming, Spark for unified batch-and-stream with a slightly higher latency floor. Real-Time Mode narrows that gap substantially — it brings Spark into latency territory that used to require Flink, without leaving the Spark ecosystem, the SQL surface, or the [unified batch/stream](stream-processing-flink-kafka-streams-spark) story. Flink still has its own deep strengths in native event-at-a-time processing and its maturity there. But the calculus changes: if you're already on Spark and a handful of pipelines need to go fast, you may no longer need to stand up and operate a second engine to get there. That's the real significance — not that Spark "beats" Flink, but that the cost of needing low latency on Spark dropped a lot.

## What to carry away

Classic Structured Streaming has a latency floor because microbatch re-pays fixed overhead every cycle: per-batch task scheduling, a disk-based shuffle, and batch-boundary state and checkpoint work. Real-Time Mode removes those from the hot path with three moves — long-running tasks scheduled once, a pipelined streaming shuffle that skips disk, and incremental state with asynchronous checkpointing — to reach sub-300ms tail latency while keeping the identical Structured Streaming API and exactly-once guarantees.

You opt in through the trigger, and the duration becomes a checkpoint cadence, not a batch size — records flow continuously. Use it for the genuinely latency-critical paths where a result's value decays in milliseconds: fraud, bidding, alerting, live experiences. Don't use it for everything; microbatch remains the right, cheaper default for the bulk of streaming. And remember the latency is end-to-end — the engine is one link in a chain that's only as fast as its slowest hop, so tune the source, the sink, and the state, and always measure the tail. The asterisk on "can Spark do real-time?" is finally gone; the discipline to know when you need it is what's left.
