pandas vs Polars vs DuckDB vs Arrow: Choosing Your Analysis Engine

For fifteen years the answer to "how do I poke at this data in Python?" was one word: pandas. It's still the default in most people's fingers, and for good reason. But somewhere around the time datasets routinely outgrew "comfortably fits in a pandas DataFrame," a cluster of alternatives matured — Polars, DuckDB, and the Apache Arrow layer underneath them — and now the honest answer is "it depends on what you're doing." These four aren't really four competitors; they're overlapping tools with different shapes, and choosing well means understanding the shapes. I've used all of them in anger for exploratory analysis, data-quality work, and feeding pipelines, and this is the map I wish I'd had.

The one-sentence version: pandas is the familiar eager dataframe, Polars is a faster dataframe with a query optimizer, DuckDB is an embedded SQL analytics engine, and Arrow is the columnar memory format the others increasingly share. Two of them are how you express analysis; one is what they run on.

The four, and what each actually is

pandas — the eager incumbent

pandas is an eager, single-threaded dataframe library: each operation runs immediately and returns a result, and its in-memory model historically leaned on NumPy. That eagerness is exactly why it's so good for interactive exploration — you type a transformation, you see the result, you iterate. The costs show up at scale: it's largely single-core, memory-hungry (intermediate copies pile up), and it has no query optimizer, so a chain of operations executes literally as written, materializing every step. Past a few gigabytes on a laptop it starts to hurt, and past RAM it simply dies.

Polars — the dataframe that thinks before it runs

Polars is a dataframe library written in Rust, built on Arrow memory, with two features pandas lacks: it's multi-threaded by default, and it has a lazy API with a query optimizer. In lazy mode you build a plan, and Polars optimizes the whole thing before executing — pushing filters down to the scan, pruning unread columns, fusing operations — the same tricks a database planner uses. The result is commonly an order of magnitude faster than pandas on the same machine, with far lower memory use, and it can stream larger-than-memory data. The API is a deliberate break from pandas (expressions instead of bracket-indexing), which is a learning curve but, I'd argue, a cleaner mental model once it clicks.

DuckDB — SQLite for analytics

DuckDB is an embedded columnar SQL database — an in-process engine, no server, that you pip install and query with full analytical SQL. It's vectorized and multi-core, it spills to disk so it handles larger-than-memory gracefully, and — the killer feature — it queries Parquet and CSV files in place, and reads pandas/Polars/Arrow tables zero-copy. I covered its engine in DuckDB Internals; the point here is the interface: if you think in SQL, DuckDB gives you a warehouse-grade engine on your laptop with no infrastructure.

Apache Arrow — the layer underneath

Arrow is the odd one out because it's not how you do analysis — it's the standardized columnar in-memory format the others are built on or speak. Polars uses Arrow memory; DuckDB reads and writes Arrow zero-copy; pandas can back columns with Arrow. Its value is interoperability: because these tools share the Arrow layout, you can hand a dataset between them without copying or serializing. I rarely write raw Arrow code, but Arrow is the reason the other three compose so well — it's the plumbing, covered more fully in the Arrow & DataFusion piece.

graph TD
    subgraph IFACE["How you express analysis"]
        PD["pandas
(eager dataframe)"] PL["Polars
(lazy dataframe + optimizer)"] DD["DuckDB
(embedded SQL)"] end ARROW["Apache Arrow
(shared columnar memory format)"] FILES[("Parquet / CSV on disk")] PD -.->|"can back columns"| ARROW PL -->|"built on"| ARROW DD -->|"reads/writes zero-copy"| ARROW DD --> FILES PL --> FILES ARROW -->|"hand data between tools, no copy"| ARROW

The relationship that matters: pandas, Polars, and DuckDB are three ways to express analysis, while Arrow is the columnar memory format underneath that lets them share data zero-copy. This is why "pandas vs DuckDB" is a false binary — you can read a Parquet file with DuckDB's SQL, hand the Arrow result to Polars for a transformation, and never pay a conversion. They interoperate by design.

For exploratory data analysis (EDA)

EDA is interactive and iterative — you want low friction and instant feedback. Here pandas' eagerness is a genuine virtue, and its ecosystem (every tutorial, every plotting library, every Stack Overflow answer) is unmatched. For a few hundred MB of data and quick poking, pandas is still the path of least resistance, and I won't pretend otherwise.

But the moment the data is large enough that pandas operations make you wait, the calculus flips. Polars' speed keeps EDA interactive at sizes where pandas stalls, and DuckDB lets you explore a directory of Parquet files with SQL without loading anything fully into memory — SELECT ... FROM 'data/*.parquet' LIMIT 100 just works. My actual habit: pandas for small and familiar, DuckDB when the data lives in files and I think in SQL, Polars when I need dataframe ergonomics at speed.

For data quality (DQ) checks

Data-quality work — null rates, duplicates, range and uniqueness checks, distribution profiling — is where SQL shines, and it's the case I most often reach for DuckDB. Expressing "what fraction of this column is null, grouped by source" is a one-liner in SQL, and DuckDB runs it over Parquet files directly:

-- DQ profiling straight over files, no load step
SELECT
  count(*)                                            AS rows,
  count(*) - count(email)                             AS null_emails,
  count(*) - count(DISTINCT customer_id)              AS dup_ids,
  avg(CASE WHEN amount < 0 THEN 1 ELSE 0 END)         AS pct_negative_amount
FROM 'raw/orders/*.parquet';

Polars is excellent for DQ too, especially when the checks are programmatic (generated from a schema) rather than hand-written SQL — its expression API composes checks cleanly and runs them fast. pandas can do all of this, but on large data the profiling itself becomes the bottleneck. For DQ at scale my default is DuckDB for SQL-shaped checks, Polars for programmatic ones.

For performance and scale

This is the clearest separation, and the numbers aren't subtle. On analytical workloads over hundreds of MB to tens of GB on a single machine, Polars and DuckDB routinely run several times to an order of magnitude faster than pandas, using a fraction of the memory — because they're vectorized, multi-core, optimize the query, and (Polars lazy / DuckDB) can process larger-than-memory data by streaming or spilling.

pandasPolarsDuckDB
InterfaceEager dataframeDataframe (eager + lazy)SQL
ParallelismSingle-core (mostly)Multi-coreMulti-core, vectorized
Query optimizerNoneYes (lazy mode)Yes
Larger-than-memoryNo (dies)Yes (streaming)Yes (spills to disk)
Reads files in placeLoads fullyYes (scan)Yes (query Parquet/CSV)
Ecosystem maturityLargest by farGrowing fastGrowing fast
Best atSmall data, EDA, glueFast dataframe workSQL analytics & DQ over files

The benchmark you read is probably not your workload. "Polars is 10× faster than pandas" is true for the queries it was measured on — typically clean, columnar, group-by-heavy analytics. Your real job might be dominated by something else entirely: a messy CSV parse, string-heavy cleaning, a join that fans out, a row-wise Python function (a .apply() with arbitrary logic) that none of these engines can vectorize. The honest move is to benchmark your actual transformation on your actual data before switching, because the headline multipliers evaporate the moment your workload doesn't match the benchmark's shape. I've seen a "10× faster" rewrite come out slower because the bottleneck was a Python UDF the optimizer couldn't touch.

The lessons I actually carry

  • It's not either/or — compose them over Arrow. The biggest unlock isn't picking a winner; it's realizing they interoperate zero-copy. Query files with DuckDB SQL, hand the result to Polars as an Arrow table for a transformation, convert to pandas only at the edge where a library demands it. Use each for what it's best at in one workflow.
  • Learn Polars' lazy API, not just its eager one. People try Polars eagerly, see a modest speedup, and shrug. The real win is scan_parquet(...).filter(...).group_by(...).collect() — the optimizer only kicks in when you let it see the whole plan before executing.
  • DuckDB is the easiest on-ramp for a SQL team. No new dataframe API to learn — if your team thinks in SQL, DuckDB gives them warehouse-grade local analytics with zero infrastructure and a tiny learning curve.
  • pandas isn't going anywhere — keep it for the edges. Its ecosystem gravity (plotting, scikit-learn, every connector) means it remains the lingua franca at the boundaries. Use the fast engines for the heavy middle and pandas where the ecosystem demands it.
  • Watch the conversion boundaries. Zero-copy is only zero-copy when types line up; a careless .to_pandas() on a large Arrow table can copy and blow up memory. Stay in the columnar world as long as you can and convert deliberately, at the end.

What to carry away

pandas, Polars, DuckDB, and Arrow aren't four contestants for one throne — they're three ways to express analysis (eager dataframe, optimized dataframe, SQL) over one shared columnar substrate. pandas wins on familiarity and ecosystem for small interactive work; Polars wins when you want dataframe ergonomics with a query optimizer and multi-core speed; DuckDB wins for SQL analytics and data-quality checks over files with no infrastructure; and Arrow is the format that lets you move between all of them without paying to copy.

So stop asking which one to standardize on and start asking which one fits the task in front of you — and chain them over Arrow when a workflow spans several. Benchmark your real transformation rather than trusting the headline multiplier, learn Polars' lazy mode if you adopt it, and keep pandas for the edges where its ecosystem rules. The era of "just use pandas for everything" is over; the era of using the right engine for each step, with no copy between them, is the upgrade.