Spark Internals (2.x): RDDs, the DAG Scheduler, Catalyst, and Tungsten

Spark is the engine that ate the big-data world, and most people who use it daily can't explain what happens between calling .show() and getting rows back. That's fine until a job is slow or out of memory, at which point the black box becomes a liability. The good news is that Spark's execution model is genuinely learnable — there's a small set of concepts that, once they click, make performance behavior predictable instead of mysterious. This article is those concepts, for the Spark 2.x line.

I'll build it in layers: the RDD (the foundation everything compiles down to), the DAG scheduler (how your code becomes distributed work), the driver/executor runtime (where that work runs), and then the two engines — Catalyst and Tungsten — that make the DataFrame API fast. By the end you should be able to look at a job and reason about how many stages it has, where it shuffles, and why.

RDDs: the foundation

The Resilient Distributed Dataset is Spark's core abstraction, and even though you mostly write DataFrames now, they compile down to RDD operations, so it's worth understanding. An RDD is an immutable, partitioned collection spread across the cluster. Three properties matter:

  • Partitioned. The data is split into partitions; a partition is the unit of parallelism — one task processes one partition.
  • Immutable + lineage. You never modify an RDD; transformations produce new RDDs. Spark records the lineage — the graph of transformations that built each RDD — so if a partition is lost to a node failure, Spark recomputes just that partition from its lineage rather than replicating data. That's the "resilient" part.
  • Lazy. Transformations (map, filter, join) don't execute; they just extend the lineage graph. Nothing runs until an action (count, collect, write) forces it.

That laziness is what lets Spark see the whole computation before running it and optimize across it — the foundation for everything below.

The DAG scheduler: jobs, stages, tasks, and the shuffle

When an action fires, Spark's DAG scheduler turns the lineage into an execution plan with a precise three-level vocabulary. Getting this vocabulary exact is the whole game, because every performance discussion uses it:

UnitWhat it isBoundary
JobAll work triggered by one actionOne per action
StageA set of tasks that run without moving data across the networkA new stage begins at every shuffle
TaskThe unit of execution — one task processes one partitionOne per partition per stage

The pivotal concept is the stage boundary, and it's always a shuffle, which comes from the distinction between two kinds of dependency. Narrow transformations (map, filter) — each output partition depends on one input partition, so the work stays local on each executor. Wide transformations (groupByKey, join, reduceByKey) — each output partition depends on many input partitions, so data with the same key must be redistributed across the network to land together. That redistribution is the shuffle: Spark writes map-side output to disk partitioned by key, then each reducer fetches its partitions over the network. It's the most expensive thing Spark does, and it's why "how many shuffles does this job have?" is the first performance question.

graph LR
    subgraph S1["Stage 1 (narrow — local)"]
        T1["task: read+filter partition 1"]
        T2["task: read+filter partition 2"]
        T3["task: read+filter partition 3"]
    end
    SH(["SHUFFLE
redistribute by key
(disk write + network fetch)"]) subgraph S2["Stage 2 (after shuffle)"] T4["task: reduce key-group A"] T5["task: reduce key-group B"] end T1 --> SH T2 --> SH T3 --> SH SH --> T4 SH --> T5

A wide dependency forces a shuffle, which ends one stage and starts the next. Narrow transformations pipeline together inside a stage with no data movement. Almost every Spark performance problem is "too much shuffle" or "shuffle skewed onto one task" — so reasoning about where stage boundaries fall is reasoning about performance.

The runtime: driver and executors

Where does this run? A Spark application has one driver — it holds your code, builds the DAG, and schedules tasks — and a set of executors, JVM processes on worker nodes that actually run tasks and hold data in memory/disk. A cluster manager (YARN, Mesos, standalone, and increasingly Kubernetes) provisions the executors. The driver hands tasks to executors, each executor runs tasks on its cores (one task per core at a time), and results flow back.

The two classic failure modes live here. Pull too much data to the driver — collect() on a big dataset — and the driver OOMs, because it's a single JVM. Give executors too little memory for a shuffle or a wide aggregation and they spill to disk or OOM. Most "Spark is broken" tickets are really "the driver or an executor ran out of memory," and knowing which process failed tells you where to look.

Catalyst and Tungsten: why DataFrames beat RDDs

If you write raw RDD code, Spark runs your closures more or less as written — it can't see inside a Scala lambda to optimize it. The DataFrame and Dataset APIs changed that by making the computation declarative, which unlocked two engines that are the real story of Spark 2.x performance.

Catalyst: the query optimizer

Catalyst takes your DataFrame operations or SQL and treats them as a query to optimize, exactly like a database would. It builds a logical plan, applies rule-based rewrites, and produces a physical plan:

graph LR
    SQL["DataFrame / SQL"]
    LP["Logical plan"]
    OPT["Optimized logical plan
(predicate pushdown,
column pruning,
constant folding)"] PP["Physical plan
(join strategy:
broadcast vs sort-merge)"] EXE["RDDs + Tungsten codegen"] SQL --> LP --> OPT --> PP --> EXE

The Catalyst pipeline. Because DataFrame operations are declarative, Catalyst can rewrite them: push filters down to the data source, prune unread columns, fold constants, and choose join strategies (a broadcast join ships a small table to every executor and skips the shuffle; a sort-merge join shuffles both sides). The same query written as opaque RDD lambdas gets none of this.

A standout optimization: pushing filters and column projection all the way down into the file reader. Pair Catalyst with a columnar format like Parquet and the WHERE and SELECT turn into predicate pushdown and column pruning at the file level — Spark reads only the row groups and columns it needs. The optimizer and the file format cooperate.

Tungsten: the execution engine

Catalyst decides what to run; Tungsten makes the running fast by attacking JVM overhead head-on:

  • Whole-stage code generation. Instead of interpreting a chain of operators with a function call per row per operator, Tungsten generates fused Java bytecode for an entire stage — collapsing the operators into a single tight loop. Far fewer virtual calls, far better CPU branch prediction.
  • Off-heap, binary memory management. Tungsten stores data in compact off-heap binary format and manages that memory explicitly, sidestepping the garbage-collection pressure and per-object overhead of keeping millions of JVM objects on the heap.
  • Cache-aware computation — layouts and algorithms tuned for CPU cache behavior.

Together, Catalyst and Tungsten are why a DataFrame job routinely outruns the equivalent hand-written RDD job: the optimizer prunes the work and the engine runs what's left close to the metal.

On the horizon: Spark's optimization is still largely decided before the job runs, from static estimates. The community is working on Adaptive Query Execution — re-optimizing at runtime using actual stage statistics to coalesce shuffle partitions, flip to broadcast joins, and handle skew — slated for the Spark 3.0 line. It's not GA yet, but it's the clear next step, and it targets exactly the shuffle and skew pain this article keeps circling.

The whole model in four sentences

RDDs give you partitioned, immutable data with lineage, so Spark recovers from failure by recomputation and stays lazy until an action. The DAG scheduler compiles that lineage into jobs, stages, and tasks, with a shuffle at every stage boundary — and shuffles are where the cost and the skew live. The driver schedules; executors run, and that's where memory problems surface. Catalyst optimizes the declarative DataFrame plan and Tungsten executes it close to the hardware, which is why DataFrames beat raw RDD code.

Hold those four and Spark stops being a black box. Slow job? Count the shuffles and look for skewed partitions. OOM? Decide whether it's the driver or an executor. Surprisingly fast despite sloppy code? Catalyst and Tungsten covering for you. The engine rewards engineers who can see the stages — and once you can, tuning is just arithmetic.