Apache Arrow & DataFusion: The Columnar In-Memory Standard

Here's a cost that hid in plain sight for years: every time data moved between two systems — Spark to pandas, a database to your Python process, one service to another — it got serialized on one side and deserialized on the other. Two systems that both held the same table as columns would convert it to some wire format, ship the bytes, and rebuild it. For a lot of analytical pipelines, that conversion was burning more CPU than the actual computation. Apache Arrow is the boring, foundational fix for that, and it's the reason it now sits quietly under DuckDB, Spark, pandas, Polars, and most of the modern data stack without most users ever naming it.

Arrow is a language-independent columnar memory format — a precise specification for how to lay out a table's data in RAM. That sounds modest. It's the most important plumbing decision of the last decade in analytics, because it turns "move data between systems" from a copy-and-convert problem into a pointer-passing one. I'll explain the layout, why it vectorizes, the zero-copy and Flight pieces built on it, and then DataFusion — the query engine that shows what you get to build once Arrow exists.

What problem Arrow actually solves

Before Arrow, every system had its own in-memory representation of a table. Connecting two of them meant agreeing on a serialization format (CSV, JSON, a custom protocol), and paying to encode and decode on every hop. There was no shared way for two processes to look at the same columnar data. The Arrow project's bet was that if enough systems agreed on one memory layout, you could eliminate the conversion entirely: two systems speaking Arrow can share a buffer directly, with no serialization, because they already agree byte-for-byte on what's in it.

The definitional line: Arrow is a standard for representing columnar data in memory, designed so that any system can read another's data without copying or converting it. The file formats you already know — Parquet and ORC — are for data at rest on disk, optimized for compression. Arrow is for data in motion and in compute, optimized for the CPU. They're complementary, and in fact Arrow is the natural in-memory target you decode Parquet into.

The memory layout, and why it vectorizes

Arrow stores each column as a contiguous block of memory. A column of 64-bit integers is just a packed array of those integers, back to back. Nulls are tracked separately in a compact validity bitmap — one bit per value — so the data buffer itself stays dense and uniform. Variable-length data like strings uses an offsets buffer pointing into a single packed bytes buffer. The whole table is a set of these column buffers grouped into record batches.

graph TD
    RB["Record batch (a chunk of rows)"]
    C1["Column: int64
contiguous packed array
+ validity bitmap"] C2["Column: string
offsets buffer →
packed bytes buffer"] C3["Column: float64
contiguous packed array
+ validity bitmap"] SIMD["CPU: SIMD over a column
(one instruction, many values)"] RB --> C1 RB --> C2 RB --> C3 C1 --> SIMD C3 --> SIMD

Arrow's layout. Each column is a contiguous buffer with nulls in a separate bitmap, grouped into record batches. Because a column's values sit packed and adjacent in memory, the CPU can stream them through SIMD vector instructions and they fit cache lines cleanly — the same vectorization principle behind every fast columnar engine, but standardized so every tool shares it.

Why does this layout matter for speed? Because modern CPUs are starved for the right data shape. When a column's values sit packed and adjacent, the processor can load many of them into a vector register and apply one SIMD instruction to all of them at once, and the prefetcher and cache lines work in your favor. Row-oriented memory — where a value is interleaved with the other fields of its row — defeats all of that. This is the same reason columnar engines like ClickHouse are fast; Arrow's contribution is making that cache- and SIMD-friendly layout a shared standard rather than each engine's private internal detail.

Zero-copy: the payoff

Once two systems agree on the layout, the conversion disappears. If your Python process and an Arrow-native database both represent a result as Arrow, the database can hand Python a pointer to the buffers and Python reads them in place — no serialize, no deserialize, no second copy in RAM. That's zero-copy data sharing, and it's why an Arrow-backed handoff that used to dominate a pipeline's runtime can become effectively free.

The cleanest demonstration is something you can run now: read a Parquet file with one library and hand the result to another with no conversion cost, because both speak Arrow underneath. pandas historically reserialized everything at every boundary; Arrow-backed libraries pass buffers. (The forthcoming pandas 2.0 brings an Arrow-backed option to pandas itself — the ecosystem is converging on this layout from both ends.)

# Arrow as the shared currency between tools — no row-by-row conversion
import pyarrow.parquet as pq
import pyarrow as pa

# Decode Parquet (on-disk, compressed) into Arrow (in-memory, columnar)
table = pq.read_table("events.parquet")   # -> pyarrow.Table (Arrow buffers)

# Hand the SAME buffers to a query engine with no copy
import datafusion
ctx = datafusion.SessionContext()
ctx.register_record_batches("events", [table.to_batches()])
ctx.sql("SELECT country, count(*) FROM events GROUP BY country").show()

Arrow Flight: zero-copy over the network

The same idea extends across machines. Arrow Flight is an RPC framework (built on gRPC) that moves Arrow record batches between processes without ever leaving the Arrow format — no translating to some database wire protocol and back. The classic, painfully slow path of pulling a large result set out of a warehouse over ODBC/JDBC, row by row, is exactly what Flight (and Flight SQL, the database-protocol layer on top of it) is built to replace, with parallel streams of columnar batches. For anyone who's watched a "small" query take minutes purely to transfer results, this is the fix.

DataFusion: what you build once Arrow exists

Apache DataFusion is a query engine written in Rust that executes SQL (and a DataFrame API) directly over Arrow data. It's an Arrow project, and it exists to be embedded — you pull it into your own Rust application as a library and get a full query engine: a SQL parser, a logical planner, a cost-aware optimizer (predicate pushdown, projection pushdown, join reordering), and a vectorized execution engine that operates on Arrow record batches natively.

Why does it matter that a query engine is a library rather than a server? Because it lets people stop writing query engines from scratch. If you're building a new database, a custom data tool, or a specialized execution layer, DataFusion gives you the parser-planner-optimizer-executor spine, and you supply the parts that are actually novel to your problem. It's the same role Presto played as a standalone MPP service, but reframed as a composable building block. There's even a distributed scheduler, Ballista, that spreads DataFusion execution across a cluster.

graph LR
    SQL["SQL or DataFrame API"]
    LP["Logical plan"]
    OPT["Optimizer:
predicate & projection pushdown,
join reordering"] PP["Physical plan
(vectorized operators)"] EX["Execution over
Arrow record batches"] SQL --> LP --> OPT --> PP --> EX

DataFusion's pipeline — the same shape as any serious query engine, delivered as an embeddable Rust library. Every operator consumes and produces Arrow record batches, so there's no format conversion inside the engine and no impedance mismatch when it shares data with the Arrow-native world around it.

This is why Arrow is described as the substrate of a whole movement sometimes called the "composable data systems" idea: when the in-memory format (Arrow), the transport (Flight), and a reusable engine (DataFusion) are standardized and open, you can assemble a data system from interoperable parts instead of building a monolith. DuckDB, Polars, and others ride the same Arrow foundation, which is why they all interoperate so cleanly.

Arrow is not a database, and it's easy to expect too much of it. It's a memory format plus libraries — no storage engine, no durability, no transactions, no query optimizer on its own. The validity-bitmap-and-buffers layout that's perfect for scans is not what you want for high-rate single-row updates; this is analytical machinery. And "zero-copy" only holds while you stay inside the Arrow ecosystem — the moment you hand data to something that wants its own format, you pay the conversion you were avoiding. Treat Arrow as the lingua franca between analytical tools, not as the tool itself.

What to carry away

Apache Arrow is a standardized columnar memory layout, and standardizing it is the whole point: when many systems agree byte-for-byte on how a table sits in RAM, moving data between them stops requiring serialization and becomes zero-copy pointer-passing. The same packed-column layout that makes sharing free also vectorizes beautifully on modern CPUs, so it's fast for compute, not just transport. Arrow Flight carries that across the network; DataFusion shows what you build on top — an embeddable, vectorized SQL engine that never leaves the format.

You probably won't write Arrow buffers by hand. But knowing it's there explains why the modern stack interoperates so well, why a Parquet-to-DataFrame handoff got cheap, and why so many new engines are appearing as Rust libraries instead of monolithic servers — they're all standing on the same columnar foundation. For the on-disk counterpart that Arrow decodes from, the Parquet and ORC internals piece is the other half.