For a decade, the standard answer to "can Spark do real-time?" came with an asterisk. Spark Structured Streaming is excellent, battle-tested, and runs an enormous share of the world's streaming pipelines — but its latency floor sat in the seconds, sometimes the high hundreds of milliseconds on a good day. If you needed single-digit-to-low-hundreds-of-milliseconds tail latency — fraud scoring before a transaction clears, ad bidding, instant operational alerts — the advice was to reach for Apache Flink and accept a second framework, a second programming model, and a second thing to operate. Databricks' Real-Time Mode for Structured Streaming sets out to remove that asterisk: sub-300ms end-to-end tail latency while keeping the exact same Structured Streaming code, API, and exactly-once guarantees. This is how it works, why microbatch couldn't get there, and where I'd actually use it.
Why classic microbatch has a latency floor
To understand Real-Time Mode you have to understand what it's replacing. Classic Structured Streaming is, under the hood, microbatch: a streaming query is executed as a relentless series of tiny batch jobs. Every trigger, Spark wakes up, figures out what new data arrived, plans a job, schedules tasks onto executors, runs them, writes results, commits a checkpoint, and goes back to sleep until the next trigger. It's an elegant idea — streaming as "a table that keeps growing, re-queried continuously" — and it gives you Spark's full SQL surface and rock-solid exactly-once.
But each microbatch pays fixed overhead that no amount of tuning erases:
- Per-batch task scheduling. Spark launches a fresh set of tasks every microbatch. The driver plans the job and dispatches tasks to executors — milliseconds of coordination, paid every batch, before a single record is touched.
- Shuffle through disk. Any operation that moves data between stages (a join, an aggregation) does a shuffle — and classic Spark shuffle writes intermediate data to disk on the map side, then the reduce side fetches it. Durable and fault-tolerant, but disk round-trips add latency you can feel.
- Batch-boundary state and checkpoint work. Stateful operators and the checkpoint commit happen at batch boundaries, so state maintenance can stall the start of the next batch.
Add those up and you get a floor. The "as fast as possible" default trigger still incurs the per-batch tax on every cycle, so you bottom out in the hundreds of milliseconds. (Spark did ship an experimental Continuous Processing mode years ago for lower latency, but it supported only a narrow set of map-like operations and weaker guarantees, so almost nobody adopted it for real work.) Microbatch's overhead is the price of its generality and reliability — and Real-Time Mode is an attempt to keep the generality while removing the overhead.
The three changes that break the floor
Real-Time Mode keeps the Structured Streaming programming model and the dataframe API, but swaps the execution engine underneath for a low-latency one. Three architectural changes do the heavy lifting.
graph TD
subgraph MB["Classic microbatch (per cycle)"]
S1["Plan job + schedule tasks"]
S2["Run stage 1"]
S3["Shuffle to disk"]
S4["Run stage 2"]
S5["Commit checkpoint"]
S1 --> S2 --> S3 --> S4 --> S5 --> S1
end
subgraph RT["Real-Time Mode (continuous)"]
T1["Long-running tasks
(scheduled once, stay up)"]
T2["Streaming shuffle
(pipelined over network,
no disk round-trip)"]
T3["Incremental state +
async checkpoint
(off the hot path)"]
T1 --> T2 --> T3
end
Left: every microbatch re-pays planning, scheduling, a disk-based shuffle, and a checkpoint commit — fixed overhead per cycle that sets the latency floor. Right: Real-Time Mode schedules long-running tasks once and keeps them alive, pipelines data between stages over the network instead of through disk, and moves state maintenance and checkpointing off the record-processing hot path. The record no longer waits for a batch boundary to exist.
1. Long-running tasks instead of per-batch scheduling
This is the biggest single win. Instead of launching fresh tasks each microbatch, Real-Time Mode schedules long-running tasks once and leaves them running, continuously consuming and processing records. The driver's per-batch planning-and-dispatch tax — milliseconds paid on every cycle in microbatch — is paid essentially once, at startup. Records flow through tasks that are already up and waiting, rather than waiting for the next batch's tasks to be born. Remove the scheduling round-trip from the hot path and a huge chunk of the floor disappears.
2. A pipelined streaming shuffle
The second change targets the disk round-trip. In Real-Time Mode, when data must move between stages, it's streamed over the network directly from upstream to downstream tasks rather than materialized to disk and fetched. Because the downstream long-running tasks are already alive, an upstream task can push records straight to them as it produces them — pipelining the stages instead of running them in lock-step with a disk barrier between. You trade some of microbatch shuffle's disk-backed fault-tolerance model for latency, which the checkpointing model compensates for.
3. Incremental state and asynchronous checkpointing
Stateful streaming — aggregations, joins, dedup — keeps state in a state store, and in microbatch the maintenance and checkpoint of that state happens at batch boundaries, where it can block the next batch. Real-Time Mode moves this off the hot path: state cleanup is done incrementally rather than in a big batch-boundary sweep, and checkpointing runs asynchronously so committing progress doesn't stall record processing. The newer checkpoint format (often called checkpoint v2) is designed to make this incremental, low-overhead progress tracking work without sacrificing the exactly-once guarantee — which is the non-negotiable that separates this from the old Continuous Processing experiment.
Turning it on, and what stays the same
The deliberate design choice — and the reason this matters more than a faster competitor framework — is that your code doesn't change. The same readStream/writeStream pipeline, the same SQL and dataframe transformations. You opt into the low-latency engine through the trigger:
# the same Structured Streaming query you already write...
df = (spark.readStream
.format("kafka")
.option("subscribe", "transactions")
.load())
scored = score_for_fraud(df) # your existing transformations, unchanged
# ...opted into the low-latency execution engine via the trigger
(scored.writeStream
.format("kafka")
.option("topic", "fraud-decisions")
.trigger(realTime="5 seconds") # Real-Time Mode (checkpoint interval), not a batch interval
.option("checkpointLocation", "/chk/fraud")
.start())
Note what the trigger duration means here: it's not a microbatch interval that batches up 5 seconds of data. In Real-Time Mode the tasks run continuously and records flow through immediately; the interval governs how often progress is checkpointed, not how often data is processed. That distinction is the whole mental shift — you stop thinking "how big is each batch" and start thinking "records flow continuously, I checkpoint periodically."
Same guarantees, same ecosystem. Real-Time Mode keeps exactly-once semantics, integrates with the same sources and sinks (Kafka first among them), and runs the same dataframe/SQL logic. That's the pitch in one sentence: you're not adopting a new streaming framework, you're switching the execution mode of the one you already use. The migration cost that pushed teams to run Flink alongside Spark is largely what this removes.
Real-Time Mode vs. classic Structured Streaming
| Dimension | Classic microbatch | Real-Time Mode |
|---|---|---|
| Execution model | Series of small batch jobs | Long-running, continuously processing tasks |
| Tail latency target | Hundreds of ms to seconds | Sub-300ms (p99) end-to-end |
| Task scheduling | Per microbatch | Once, at startup |
| Inter-stage data movement | Shuffle via disk | Pipelined streaming shuffle over network |
| State / checkpoint | At batch boundaries | Incremental cleanup + async checkpoint |
| Programming model | Structured Streaming | Identical Structured Streaming |
| Delivery guarantee | Exactly-once | Exactly-once |
| Best for | The vast majority of pipelines | Genuinely latency-critical paths |
The use cases that actually justify it
Sub-300ms is not free, so the honest question is: which workloads have a business reason to need it? The pattern is the same in each — a decision sits in the critical path of something happening, and being late is the same as being wrong.
- Fraud and risk scoring while a payment or login is in flight: the decision has to come back before the transaction completes, or it's useless.
- Ad bidding and real-time personalization: auctions and page renders have hard tens-of-milliseconds budgets; a late bid is a lost impression.
- Operational alerting and observability: detecting an outage, a security event, or an SLA breach where every extra second of detection latency is extra blast radius.
- Live gaming, leaderboards, and trading signals: anything where users perceive the lag directly or money moves on it.
- IoT and network anomaly detection: reacting to a sensor or telemetry pattern fast enough to actuate, not just to record.
The common thread: the value of the result decays in milliseconds. If your output feeds a dashboard someone glances at, or a table queried minutes later, you do not have a Real-Time Mode use case — you have a microbatch use case that would only cost more to run continuously.
Best practices and honest limits
End-to-end latency is a chain, and Real-Time Mode is only one link. Hitting sub-300ms p99 across the whole path means every hop cooperates. If your Kafka producers batch with a high linger.ms, or your sink commits slowly, or a stateful join holds enormous state, the engine's low latency is wasted — you'll wait on the slowest link. And measure the tail: the entire point is p99/p999, not the average. A pipeline with a great average and an ugly p99 fails exactly the use cases that needed Real-Time Mode in the first place. Garbage-collection pauses, skewed keys, and a single hot partition all live in the tail, so profile there.
A few more things I'd hold a team to:
- Keep state lean. The lower the latency target, the more a fat stateful operator hurts. Bound state with watermarks and TTLs, watch for state that grows without cleanup, and prefer stateless or lightly-stateful transformations on the hot path where you can.
- Provision steady capacity, not bursty. Long-running tasks expect resources to be there continuously; the elastic, scale-with-the-batch posture that suits microbatch is a worse fit for a mode whose whole premise is "always running." Size for sustained throughput plus headroom.
- Mind the source and sink. Kafka is the natural pairing; a high-latency source or a sink that can't keep up will dominate your latency budget no matter how fast the engine is.
- Don't default to it. This is the most important one. Microbatch Structured Streaming is the right answer for the large majority of streaming work — it's simpler to reason about, cheaper, and plenty fast for anything that isn't on a millisecond decision path. Reach for Real-Time Mode for the specific critical path that needs it, often alongside microbatch pipelines for everything else.
Where this leaves the Spark-vs-Flink question
For years the clean split was: Flink for true low-latency per-record streaming, Spark for unified batch-and-stream with a slightly higher latency floor. Real-Time Mode narrows that gap substantially — it brings Spark into latency territory that used to require Flink, without leaving the Spark ecosystem, the SQL surface, or the unified batch/stream story. Flink still has its own deep strengths in native event-at-a-time processing and its maturity there. But the calculus changes: if you're already on Spark and a handful of pipelines need to go fast, you may no longer need to stand up and operate a second engine to get there. That's the real significance — not that Spark "beats" Flink, but that the cost of needing low latency on Spark dropped a lot.
What to carry away
Classic Structured Streaming has a latency floor because microbatch re-pays fixed overhead every cycle: per-batch task scheduling, a disk-based shuffle, and batch-boundary state and checkpoint work. Real-Time Mode removes those from the hot path with three moves — long-running tasks scheduled once, a pipelined streaming shuffle that skips disk, and incremental state with asynchronous checkpointing — to reach sub-300ms tail latency while keeping the identical Structured Streaming API and exactly-once guarantees.
You opt in through the trigger, and the duration becomes a checkpoint cadence, not a batch size — records flow continuously. Use it for the genuinely latency-critical paths where a result's value decays in milliseconds: fraud, bidding, alerting, live experiences. Don't use it for everything; microbatch remains the right, cheaper default for the bulk of streaming. And remember the latency is end-to-end — the engine is one link in a chain that's only as fast as its slowest hop, so tune the source, the sink, and the state, and always measure the tail. The asterisk on "can Spark do real-time?" is finally gone; the discipline to know when you need it is what's left.