DuckDB Internals: The Embedded OLAP Engine That Runs Anywhere

The first time DuckDB clicked for me, I'd pip install duckdb'd it to avoid spinning up a warehouse for a one-off, pointed a SQL query straight at a folder of Parquet files, and got a grouped aggregate over tens of millions of rows back before I'd finished reaching for coffee. No server, no cluster, no connection string — a library in my Python process. That's the pitch people compress to "SQLite for analytics," and it's accurate, but it undersells what's going on. DuckDB is a genuinely modern columnar OLAP engine that happens to run in-process, and the engineering inside it is why the laptop experience feels like a warehouse.

DuckDB is an in-process analytical database: it runs inside your application — Python, R, a CLI, the browser via WebAssembly — with no separate server. The interesting question is how something embeddable delivers warehouse-class scan performance, and the answer is four design choices: columnar storage, vectorized execution, a single-file format with real transactions, and zero-copy interop with the Arrow and Parquet world. I'll take each, then be honest about the edges.

In-process: the SQLite model, aimed at analytics

SQLite won by being a library, not a server — no install, no daemon, no network hop, just a database that lives in your process and a file on disk. But SQLite is row-oriented and built for transactional, single-row-at-a-time access, which makes it slow at the scan-and-aggregate queries analytics demands. DuckDB took SQLite's deployment model and rebuilt the engine for the opposite workload. Same "it's just a library" ergonomics; a completely different engine underneath, tuned for reading lots of rows and few columns at a time.

Why does in-process matter beyond convenience? Because it kills the data-movement tax. When the engine runs inside your Python process and can read your DataFrame's memory directly, there's no serializing results across a socket, no ODBC round-trip, no copy. For the analyst-on-a-laptop and the "transform step inside a bigger pipeline" use cases, eliminating that boundary is most of the speed you feel.

Columnar storage + vectorized execution

DuckDB stores data by column, not by row, for the same reason every analytical engine does: queries touch a few columns out of many, and columnar layout means you read only those, and they compress well because adjacent values are similar. That's the storage half. The execution half is where DuckDB's design shows, and it's a specific choice worth naming: vectorized execution.

There are three broad ways an engine can run a query. The classic one processes a row at a time through the operators — simple, but the per-row interpreter overhead dwarfs the actual work. The fastest in theory is compiling the query to native code. DuckDB takes the middle path that gets most of the benefit with none of the compile latency: it pushes data through operators in vectors — batches of roughly a couple thousand values from one column at a time. The per-batch overhead is amortized across thousands of values, the data fits in CPU cache, and the tight inner loops let the compiler emit SIMD. It's the same principle behind ClickHouse's speed; DuckDB delivers it in an embeddable library.

graph LR
    SCAN["Scan: read column
in vectors (~2048 values)"]
    FILT["Filter operator
(applies to whole vector)"]
    AGG["Aggregate operator
(processes the vector)"]
    OUT["Result vectors"]
    SCAN -->|"vector of values"| FILT
    FILT -->|"vector of values"| AGG
    AGG --> OUT

Vectorized execution. Instead of one row crawling through the operator pipeline, DuckDB streams fixed-size vectors of a single column between operators. The interpreter cost is paid once per ~2,000 values instead of once per value, the working set stays in cache, and the inner loops vectorize on the CPU — warehouse-grade scan speed without a query compiler's startup cost.

Parallelism is layered on with morsel-driven scheduling: the input is sliced into small chunks ("morsels") handed to worker threads, so a query uses all your laptop's cores without you configuring anything. And DuckDB executes larger-than-memory: when a hash join or sort exceeds RAM, it spills intermediate state to disk and keeps going rather than falling over — so "my data doesn't fit in memory" isn't the hard wall people assume.

The single-file format: storage with real transactions

DuckDB can run purely in memory, but it also has its own on-disk format: a single file holding your tables, indexes, and metadata, much like a SQLite file. Inside, data is stored columnar and compressed, organized into row groups with per-column statistics (min/max and the like) so the engine can skip blocks that can't match a filter — the same "read the metadata, skip what you can" idea that Parquet uses, applied to its native storage.

Crucially, it's a real transactional database, not just a cache. DuckDB implements MVCC (multiversion concurrency control — the same mechanism behind Postgres) so you get ACID transactions: changes are atomic and durable, and a reader sees a consistent snapshot. That's a meaningful step up from "I have a pile of Parquet files" — you can UPDATE, DELETE, and transact, in a file you can copy around like any other.

A detail that makes DuckDB feel magical: it queries Parquet and CSV in place, without importing. SELECT * FROM 'data/*.parquet' reads the files directly, using their footer statistics to skip row groups and reading only the referenced columns. You can join a Parquet folder against a CSV against an in-memory DataFrame in one query. Often the fastest "ETL" is no load step at all — just point DuckDB at the files.

-- Query a folder of Parquet files directly — no load, no server
SELECT country, count(*) AS sessions, avg(duration_s) AS avg_dur
FROM 's3://bucket/events/*.parquet'
WHERE event_date >= DATE '2023-06-01'
GROUP BY country
ORDER BY sessions DESC;

-- Persist to DuckDB's own transactional single-file format when you want
-- ACID, indexes, and updates rather than immutable files:
CREATE TABLE events AS SELECT * FROM 'events/*.parquet';

Zero-copy interop: the quiet superpower

Because DuckDB runs in-process and speaks Apache Arrow natively, it can read and write a pandas or Polars DataFrame, or an Arrow table, without copying the data. Query a pandas DataFrame as if it were a SQL table; get results back as Arrow with no serialization. In a Python data pipeline that's transformative — DuckDB becomes the fast SQL engine sitting in the middle of your existing dataframes, not another system you have to move data into and out of. This interop, more than any single benchmark, is why DuckDB spread so fast through the Python data world.

Property	SQLite	DuckDB	Cloud warehouse
Deployment	In-process library	In-process library	Managed service / cluster
Storage orientation	Row	Columnar	Columnar
Built for	Transactional, single-row	Analytical scans + aggregates	Analytical at large scale
Concurrency	Single writer; embedded	Single process; MVCC snapshots	Many concurrent users
Scale ceiling	Modest analytics	One big machine (spills to disk)	Effectively unbounded

DuckDB is single-process, and that's the line you can't cross by tuning. It is not a concurrent multi-user server — one process owns the database file for writes, so it's wrong for a shared OLTP-style backend with many writers, and it doesn't replace a warehouse that hundreds of analysts hit at once. Its scaling story is "a bigger machine," not "more nodes." The right framing: DuckDB is the single-node analytical engine — superb for an analyst's laptop, an embedded analytics feature, a pipeline transform step, or crunching a few hundred GB on one fat box. When you genuinely need multi-user concurrency or horizontal scale-out, that's a warehouse's job. (The hosted service MotherDuck is emerging to stretch DuckDB toward the cloud, but the local engine is single-node by design.)

What to carry away

DuckDB took SQLite's "database as a library" model and built a modern columnar OLAP engine inside it. Columnar storage reads only the columns a query needs; vectorized, morsel-driven execution delivers warehouse-class scan speed across all your cores without a query compiler's startup cost, and spills to disk when data exceeds memory; a single-file format with MVCC gives you real ACID transactions, not just a cache; and zero-copy Arrow/Parquet interop lets it sit inside your existing dataframes and query files in place with no load step.

Hold the boundary clearly and DuckDB becomes one of the most useful tools in the stack: it's the single-node analytics engine, unbeatable for the laptop, the embedded feature, and the pipeline middle — and not a substitute for a multi-user, scale-out warehouse. Knowing exactly where that line sits is what separates "DuckDB everywhere" enthusiasm from sound architecture. For the on-disk format it reads so happily, see Parquet & ORC internals; for the memory standard behind its zero-copy trick, Arrow & DataFusion.