Most data frameworks bind two things together that don't have to be: what your pipeline does and which engine runs it. Write a Spark job and you've married Spark; write a Flink job and you've married Flink. Apache Beam's entire reason for existing is to annul that marriage. You write the pipeline once, in the Beam model, and then choose — at submit time, by a flag — whether it runs on Google Cloud Dataflow, on Apache Flink, on Spark, or on your laptop for a test. Apache Beam is a unified programming model plus a portability layer that decouples pipeline logic from the execution engine. This is how that decoupling actually works, and the honest limits of "write once, run anywhere."
One thing up front: Beam is the implementation of the Dataflow model — the event-time, windows, watermarks, and triggers theory I covered separately in The Dataflow Model. That piece is the "why time is hard" semantics. This one is the engineering on top: the SDKs, the runners, and the portability framework that make one pipeline run on many engines. If you want the windowing theory, start there; here I'll assume it.
The model: four abstractions
Beam pipelines are built from a deliberately small vocabulary, and the discipline of that small vocabulary is what makes portability possible — there's less surface area for runners to disagree on.
- Pipeline — the whole computation, a directed graph of transforms.
- PCollection — a distributed dataset flowing through the pipeline. It can be bounded (a finite batch) or unbounded (an endless stream). Crucially, the same transforms work on both — that's the "unified batch and streaming" promise made concrete.
- PTransform — an operation on PCollections: the core ones are
ParDo(parallel per-element processing, like a flatMap),GroupByKey, andCombine(aggregation), plus the I/O connectors that read and write external systems. - Runner — the adapter that takes your pipeline and executes it on a specific engine.
A pipeline reads like a chain of transforms applied to PCollections — here in the Python SDK:
import apache_beam as beam
with beam.Pipeline(options=opts) as p:
(p
| "Read" >> beam.io.ReadFromPubSub(subscription=sub) # unbounded source
| "Parse" >> beam.Map(parse_event)
| "Window" >> beam.WindowInto(beam.window.FixedWindows(60))
| "CountByKey">> beam.combiners.Count.PerKey()
| "Write" >> beam.io.WriteToBigQuery(table))
# the SAME code runs on Dataflow, Flink, or Spark — the runner is chosen in `opts`
Nothing in that pipeline names an execution engine. The engine is selected by the options you pass at submit time. That's the whole idea, expressed in one snippet.
The portability framework: how one pipeline runs on any engine
Here's the part that's genuinely clever, and that most people using Beam never look at. In the early days, every SDK-language/runner combination needed bespoke glue — a Python pipeline on the Flink runner was a different integration than a Java pipeline on Flink. That doesn't scale to N languages × M runners. The portability framework solved it with two language-neutral contracts.
graph TD
SDK["Your pipeline
(Java / Python / Go SDK)"]
RAPI["Runner API
(language-neutral plan,
serialized as protobuf)"]
RUNNER["Runner
(Dataflow / Flink / Spark)
orchestrates execution"]
FNAPI["Fn API
(runner ↔ worker protocol)"]
HARNESS["SDK harness
(container running YOUR code
in its native language)"]
SDK -->|"build + translate"| RAPI
RAPI --> RUNNER
RUNNER -->|"drive user code via"| FNAPI
FNAPI --> HARNESS
HARNESS -->|"results"| RUNNER
The portability framework's two contracts. The Runner API serializes any pipeline into a language-neutral protobuf plan, so a runner never has to understand the SDK language. The Fn API is the protocol by which a runner drives the actual user code, which runs in an SDK harness container in its native language (your Python ParDo really runs in a Python process the Flink runner talks to). Together they turn an N×M integration problem into N SDKs plus M runners that all speak the same two protocols.
This architecture is also what enables cross-language transforms — one of Beam's quietly powerful features. Because transforms are described in the language-neutral Runner API and executed in per-language harnesses, a Python pipeline can use a transform implemented in Java. In practice that's how Python pipelines reuse Beam's mature Java I/O connectors (Kafka, JDBC, and others) instead of waiting for a Python reimplementation. The portability layer isn't just runner-portability; it's language-interop.
The runners
A runner is where your portable pipeline meets a real execution engine. The ones that matter:
| Runner | What it runs on | When you'd choose it |
|---|---|---|
| Dataflow | Google Cloud's fully-managed service | On GCP and want zero cluster ops — autoscaling, no infra to run; Beam's flagship runner |
| Flink | An Apache Flink cluster (self-managed or hosted) | Want a true-streaming engine you control, on any cloud or on-prem |
| Spark | An Apache Spark cluster | Already invested in Spark infrastructure and want Beam pipelines on it |
| Direct | Your local machine | Development and testing — and it deliberately surfaces model violations early |
The Direct runner deserves a note: it's not just "local mode," it intentionally does things like reshuffle and re-fire to expose pipeline bugs (non-deterministic transforms, incorrect windowing assumptions) on your laptop rather than in production. Use it as a correctness check, not a performance proxy.
Where portability gets leaky
"Run anywhere" is true for the model, not for every feature on every runner. Beam publishes a capability matrix precisely because runners differ in what they support — a given state/timer feature, a specific windowing or trigger behavior, or a particular metric may be fully supported on Dataflow, partial on Flink, and limited on Spark. A pipeline that leans on advanced stateful processing can run beautifully on one runner and hit an unsupported-feature wall on another. So portability is real and valuable, but it is not a guarantee that any pipeline runs identically everywhere. Before you promise the business "we can switch engines anytime," check the capability matrix for the features you actually use, and test on the target runner — don't assume.
There's a second, subtler cost: the abstraction tax. Beam sits above the engine, so you're a layer removed from engine-specific tuning knobs, and the portability machinery (especially cross-language and the SDK-harness model) adds operational moving parts. If you are certain you'll only ever run on Flink, writing native Flink gives you the engine's full control surface with no intermediary. Beam earns its keep when portability or the unified model is worth that layer — not as a default wrapper around an engine you've already committed to.
When Beam is the right call — and when it isn't
The decision comes down to whether you value what the abstraction buys:
- Use Beam if you're on GCP and want managed streaming: Dataflow + Beam is the natural, batteries-included path — autoscaling, no clusters, and the unified model for batch and stream in one codebase.
- Use Beam if engine-independence has real value: you want to avoid lock-in, may move between clouds, or genuinely run the same logic on multiple engines. That optionality is exactly what Beam sells.
- Use Beam for cross-language needs: a Python shop that needs Beam's mature Java connectors gets them through cross-language transforms without rewriting.
- Skip Beam if you've committed to one engine: if it's Flink forever, write Flink. You'll get the full tuning surface and one less abstraction to operate. The stream-processor comparison is the right lens for that native-engine choice.
- Skip Beam for simple, single-engine batch: the model's power is in unified event-time streaming; for a plain batch job on one engine it can be more ceremony than payoff.
What to carry away
Apache Beam decouples what a pipeline does from the engine that runs it: you express the computation once with a small vocabulary — Pipelines, PCollections (bounded or unbounded), and PTransforms — and pick the runner (Dataflow, Flink, Spark, Direct) at submit time. The magic underneath is the portability framework: the Runner API serializes pipelines into a language-neutral plan, the Fn API lets a runner drive user code running in a per-language SDK harness, and together they enable both runner-portability and cross-language transforms.
But hold the promise honestly. Portability covers the model, not every feature on every runner — the capability matrix is real, and so is the abstraction tax of sitting above the engine. Beam shines when managed streaming on Dataflow, engine-independence, the unified batch/stream model, or cross-language reuse are things you actually need. If you've already wedded one engine and will never leave, writing to it natively is the simpler honest choice. The value of Beam is optionality and one model across batch and streaming — pay for it when you'll use it.