# Native Spark Engines: DataFusion Comet vs Gluten+Velox vs Photon

Spark won the big-data execution war, and then spent a decade carrying a handicap it couldn't fully shake: it runs on the JVM. [Tungsten and whole-stage code generation](spark-internals-rdd-catalyst-tungsten) clawed back a lot — off-heap memory, generated code instead of an interpreter — but the JVM still can't do what a hand-written C++ or Rust columnar engine does: process data in tight, SIMD-vectorized loops over columnar batches that the CPU loves. So a whole category of **native execution engines** emerged to fix this without making anyone rewrite their Spark jobs. Databricks built **Photon**. The open-source world produced **Gluten** (with **Velox**) and **Apache DataFusion Comet**. They make the same bet, and understanding that shared bet is the key to choosing between them.

The bet: keep Spark's API, SQL surface, and Catalyst optimizer exactly as they are, but intercept the physical plan and run the heavy operators in a native vectorized engine instead of the JVM. Your code doesn't change; the execution underneath does.

## Why the JVM is the bottleneck

It helps to be precise about what "native is faster" actually means, because it's not magic. A modern analytical engine wants three things the JVM makes hard: **columnar batches** processed a vector at a time (not row objects), **SIMD** instructions that apply one operation to many values per CPU cycle, and **no garbage collector** pausing execution or boxing primitives. Spark's Tungsten got partway — it generates code and manages off-heap memory — but it's still JVM bytecode operating largely row-at-a-time within a stage, and it can't emit the kind of vectorized machine code a C++/Rust engine compiles to.

Native engines are written in C++ (Velox, Photon) or Rust (DataFusion), operate on [Apache Arrow](arrow-datafusion-internals)-style columnar batches, and exploit SIMD directly. On scan-and-filter-and-aggregate workloads — the bread and butter of analytics — that's commonly a 2–4× speedup, sometimes more, for the same cluster. The point of all three projects is to deliver that to Spark users transparently.

## How a native engine plugs into Spark

All three follow the same lifecycle, and it's worth tracing once because the trade-offs live in it. Spark parses your SQL/DataFrame, Catalyst optimizes it into a physical plan, and then — instead of executing that plan on the JVM — the plugin walks the plan and **replaces supported operators with native equivalents**. The native engine runs those operators on columnar batches; any operator it doesn't support stays on the JVM, with a **columnar-to-row transition** inserted at the boundary.

```mermaid
graph TD
    SQL["Spark SQL / DataFrame"]
    CAT["Catalyst optimizer(unchanged)"]
    PLAN["Physical plan"]
    PLUGIN["Native plugin walks the plan:which operators are supported?"]
    NATIVE["Native engine(Velox / DataFusion / Photon)columnar + SIMD, off-heap"]
    JVM["Fallback to JVM Spark(unsupported operators)"]
    TRANS["Columnar to row transitionat every boundary"]
    SQL --> CAT --> PLAN --> PLUGIN
    PLUGIN -->|"supported"| NATIVE
    PLUGIN -->|"unsupported"| JVM
    NATIVE -.->|"hand-off"| TRANS -.-> JVM
          
```

The shared architecture. Catalyst and your code are untouched; the plugin rewrites the physical plan, sending supported operators to the native engine and leaving the rest on the JVM. The dotted path is the catch: every hand-off between native (columnar) and JVM (row) execution costs a format conversion, so a plan that bounces back and forth between supported and unsupported operators can be slower than staying on plain Spark.

That fallback boundary is the single most important thing to understand about this whole category, so hold onto it — I'll come back to why it decides real-world results.

## The three engines

### Databricks Photon

Photon is Databricks' native vectorized engine, written in C++, introduced around 2022 and now the default execution engine on Databricks compute. It's the most mature and polished of the three — deeply integrated, broad operator coverage, and you mostly just turn it on. The catch is the obvious one: it's **proprietary and Databricks-only**. You can't run it on open-source Spark, EMR, or Dataproc. If you're all-in on Databricks, Photon is the path of least resistance and it's very good; if you're not, it isn't an option at all. (For how it fits the rest of the platform, see [Databricks internals](databricks-internals).)

### Gluten + Velox

Gluten (Apache incubating) is a Spark plugin that offloads execution to a pluggable native backend — and the most common backend is **Velox**, Meta's open-source C++ vectorized execution library (the same engine that backs Presto/Prestissimo and other systems). Gluten translates Spark's physical plan into **Substrait** (more on that below), hands it to Velox, and Velox executes it. The appeal is that Velox is a serious, widely-used engine with strong coverage, and Gluten is open-source and runs on vanilla Spark. The cost is operational weight: you're deploying a C++ native library alongside the JVM, with its own memory management to tune, and it's younger and less turnkey than Photon.

### Apache DataFusion Comet

Comet is a Spark accelerator (originated at Apple, donated to the Apache Software Foundation) built on **Apache DataFusion**, the Rust columnar query engine. Like Gluten, it's a plugin that swaps supported Spark operators for native DataFusion-backed ones running on Arrow batches, with JVM fallback for the rest. Being Rust, it sidesteps a class of C++ memory-safety concerns, and it rides DataFusion's fast-moving ecosystem (the same engine inside many modern data tools). It's the youngest of the three, so coverage is narrower and it's maturing quickly — but it's a clean, open, and increasingly capable option, especially attractive if you already lean on the Arrow/DataFusion world.

|  | Photon | Gluten + Velox | DataFusion Comet |
| --- | --- | --- | --- |
| Native language | C++ | C++ (Velox) | Rust (DataFusion) |
| Open source | No (proprietary) | Yes (Apache incubating) | Yes (Apache) |
| Runs on | Databricks only | Any Spark (OSS, EMR, etc.) | Any Spark (OSS, EMR, etc.) |
| Plan hand-off | Internal | Via Substrait to Velox | Spark plan to DataFusion |
| Maturity (2026) | Most mature, default on Databricks | Mature backend, growing adoption | Youngest, fast-moving |
| Best fit | Databricks shops | OSS Spark wanting proven coverage | OSS Spark in the Arrow/Rust ecosystem |

## Substrait, Arrow, and ADBC: the interop layer

Two of these engines lean on a piece of plumbing that deserves its own moment, because it's a quietly important idea: **Substrait**. Substrait is a cross-engine specification for serializing query plans — relational algebra (scans, filters, joins, aggregates) expressed in a standard, language-neutral format. It's a *lingua franca* for query plans: one system can produce a plan and a completely different system can execute it, with no shared code. That's exactly what Gluten does — it lowers Spark's physical plan into Substrait so Velox (which speaks Substrait) can run it. The deeper promise is decoupling: optimizers and execution engines stop being welded together, and you can mix and match.

Underneath all of this sits **Apache Arrow**, the standardized columnar in-memory format. It's why these engines interoperate at all — they pass data as Arrow batches, zero-copy, instead of each inventing its own layout and paying to convert. And rounding out the standards picture is **ADBC (Arrow Database Connectivity)**, an Arrow-native answer to JDBC/ODBC: where the old APIs hand you data row-by-row (and force a columnar engine to re-rowify and then re-columnarize), ADBC moves results as Arrow columnar batches end-to-end. The theme connecting Substrait, Arrow, and ADBC is the same — **standardize the plan, the memory, and the wire so engines compose** instead of each being an island.

## The fallback trap (where real-world results are decided)

**Partial operator coverage can make a native engine *slower*, not faster.** None of these engines supports 100% of Spark's operators and functions — and when a plan hits an unsupported operator (an exotic function, a UDF, an unsupported data type), execution falls back to the JVM, with a columnar-to-row conversion at the boundary and a row-to-columnar conversion to get back. A query that ping-pongs between native and JVM operators pays that conversion repeatedly, and can end up slower than just running on plain Spark. This is the number-one reason a native-engine proof-of-concept disappoints: someone benchmarks a query full of unsupported functions, sees constant fallback, and concludes "native isn't faster." Always check how much of your plan actually ran natively (the engines expose this) before judging — a query that's 60% fallback isn't testing the native engine, it's testing the transitions.

This is also why the engines aren't interchangeable in practice even though they're architecturally identical: **coverage is the differentiator**. Photon's maturity means more of a typical plan runs native; Velox's long pedigree gives Gluten strong coverage; Comet's is narrower but climbing fast. The right question isn't "which engine is fastest in a microbenchmark" but "which one runs the most of *my* workloads natively, with the fewest fallbacks." Your function mix decides the winner.

### Turning one on

Adoption is deliberately low-friction — add the plugin and a few configs, and your existing jobs run faster where coverage allows. Comet, for example, is enabled with extension and memory settings:

```ini
; enable Apache DataFusion Comet on an existing Spark job
spark.plugins                         = org.apache.spark.CometPlugin
spark.comet.enabled                   = true
spark.comet.exec.enabled              = true
spark.comet.exec.shuffle.enabled      = true
; native engines use OFF-heap memory — size it deliberately
spark.memory.offHeap.enabled          = true
spark.memory.offHeap.size             = 8g
```

**Re-budget your memory for off-heap.** The native engine doesn't live in the JVM heap — it allocates off-heap (or native) memory. If you leave all your memory assigned to the JVM heap and give the native engine scraps, it spills or fails, and you'll wrongly blame the engine. When you switch on Photon, Gluten, or Comet, shift a meaningful share of executor memory to off-heap and tune from there. The execution model changed; the memory split has to change with it.

## How to choose

The decision is mostly made for you by where you run and what you value:

- **You're on Databricks:** use Photon. It's the default, the most mature, and there's no integration work. The lock-in is real but you've already accepted it by being on Databricks.

- **You run open-source Spark (EMR, Dataproc, on-prem) and want proven coverage today:** Gluten + Velox. Velox is a battle-tested engine and Gluten gives you the broadest open-source coverage, at the cost of operating a C++ native library.

- **You run OSS Spark and live in the Arrow/Rust ecosystem, or want the cleanest momentum:** DataFusion Comet. Younger and narrower today, but moving fast, memory-safe by construction, and aligned with where a lot of the modern data stack is heading.

- **In every case:** measure native coverage on *your* queries before committing, because fallback — not the engine's peak speed — is what determines your actual numbers.

## What to carry away

Native Spark engines all make one bet: keep Spark's API and Catalyst untouched, intercept the physical plan, and run the heavy operators in a vectorized C++/Rust engine over columnar Arrow batches — because the JVM, even with Tungsten, can't match SIMD-vectorized native code on analytical workloads. Photon (proprietary, Databricks-only, most mature), Gluten+Velox (open, broad coverage, C++), and DataFusion Comet (open, Rust, youngest but fast-moving) are three implementations of that same idea, tied together by the standards underneath — Substrait for portable plans, Arrow for shared columnar memory, ADBC for columnar transport.

The trade-off that decides everything is operator coverage and the fallback it implies: the engine only accelerates the parts of your plan it supports, and bouncing between native and JVM costs conversions that can erase the win. So don't pick on benchmark headlines — pick on where you run, and on how much of *your* workload executes natively. The promise is real and large: the same cluster, often several times faster, with no change to your code. You just have to make sure your queries actually get to use it.
