Parquet & ORC Internals: How Columnar Files Actually Store Data

Almost every analytical query you run touches a Parquet or ORC file, and almost nobody who runs those queries knows what's inside one. That's a shame, because the file format quietly decides a huge fraction of your query performance — often more than the engine on top of it. Switch a pipeline from CSV or JSON to Parquet and watch the same query get an order of magnitude faster and the storage bill drop by more than half. That's not the query engine getting smarter; it's the file format doing work the engine no longer has to.

So let's open the file. I'll use Parquet as the main example because it's become the de facto standard, then contrast ORC, which shares the same ideas with different packaging. The goal is that by the end you can reason about why a query reads the bytes it reads.

Why columnar, in one paragraph

A row-oriented format (CSV, Avro, a database row store) stores all of record 1, then all of record 2. To read one column out of fifty, you still stream every byte of every row past your reader and discard the other forty-nine columns. A columnar format stores all the values of column A together, then all of column B. An analytical query that selects three columns reads only those three columns' bytes. And because each column holds values of one type with lots of repetition, it compresses far better than a row of mixed types. Less data read, and less data to begin with — the two wins compound. Everything below is the machinery that delivers those two wins.

The anatomy of a Parquet file

Parquet is a hybrid: it's columnar within horizontal slices of the data, not across the whole file. That hybrid structure is the key to understanding it. Working top down:

  • Row group — a horizontal partition of the rows (often ~128 MB of data). The file is a sequence of row groups. This is the unit of parallelism: one reader task typically processes one row group.
  • Column chunk — within a row group, all the values for a single column, stored contiguously. One column chunk per column per row group.
  • Page — within a column chunk, data is split into pages (~1 MB). The page is the smallest unit of encoding and compression, and it carries its own header with statistics.
  • Footer — at the end of the file, the metadata: the schema, and for every row group and column chunk, the byte offsets and the statistics (min, max, null count). Readers seek to the footer first.
graph TD
    subgraph FILE["Parquet file"]
        subgraph RG0["Row group 0 (~128 MB of rows)"]
            C0A["Column chunk: date
pages..."] C0B["Column chunk: user_id
pages..."] C0C["Column chunk: amount
pages..."] end subgraph RG1["Row group 1"] C1A["date"] C1B["user_id"] C1C["amount"] end FOOT["Footer (at end)
schema + per-row-group/column
offsets & min/max/null stats"] end RG0 --> RG1 --> FOOT

A Parquet file: row groups split the data horizontally; within each, every column is a contiguous chunk of pages. The footer at the end indexes everything and stores min/max statistics. A reader parses the footer first to learn the layout — then reads only the column chunks it needs from only the row groups it can't rule out.

Encodings: why a column is so small

Before any general-purpose compression (Snappy, gzip, zstd) runs, Parquet encodes each column in a way that exploits its structure. These encodings are most of the storage win:

EncodingHow it worksGreat for
DictionaryBuild a dictionary of distinct values; store small integer codes instead of the valuesLow-cardinality columns (status, country, category)
Run-length (RLE)Store "value × N" instead of N copiesLong runs of repeated values (especially after sorting)
Bit-packingUse only as many bits as the value range needs, not a full 32/64Small integers, dictionary codes
DeltaStore differences between consecutive valuesSorted IDs, timestamps

Dictionary plus RLE plus bit-packing is the workhorse combination: a column of a few hundred distinct strings becomes a dictionary plus a stream of tiny bit-packed codes, with runs collapsed — often a 10×+ reduction before the codec even sees it. This is also why sorting your data before writing matters so much: sorted columns have long runs and tight value ranges, which both encode dramatically better and produce sharper statistics (next section).

The footer statistics: read less, the whole point

Here's where the file format starts skipping work for the engine. Because the footer stores the min and max (and null count) for each column chunk, a query engine can do two powerful things before reading any data.

Column projection. SELECT date, amount reads only the date and amount column chunks; every other column's bytes are never touched. Free, automatic, and impossible in a row format.

Predicate pushdown. WHERE amount > 1000 — the engine checks each row group's amount min/max in the footer. If a row group's max is 500, it can't contain any matching row, so the engine skips the entire row group without reading it. On well-sorted data this prunes most of the file.

graph LR
    Q["Query:
SELECT amount WHERE amount > 1000"] F["Read footer:
row-group min/max for amount"] P{"Row group's
max > 1000?"} SKIP["Skip row group
(no bytes read)"] READ["Read only the
amount column chunk"] Q --> F --> P P -->|no| SKIP P -->|yes| READ

Predicate pushdown plus column projection. The footer lets the engine rule out whole row groups by statistics and read only the referenced columns. The bytes a query reads are decided here, before data access — which is why this footer-and-statistics design is the heart of columnar query performance, and the same principle behind modern table formats and engines.

The practitioner's lever: statistics are only useful if they're selective. If a column's values are scattered randomly across row groups, every row group's min/max spans the whole range and nothing gets pruned. Sort (or at least cluster) your data by the columns you filter on before writing, and keep row groups a sensible size. That single habit turns predicate pushdown from theoretical into transformative.

ORC: same ideas, different packaging

ORC (Optimized Row Columnar) grew out of the Hive world and shares Parquet's DNA — columnar within horizontal slices, encodings, and statistics — with different names and a few different choices:

ConceptParquetORC
Horizontal sliceRow groupStripe (~64 MB)
Statistics granularityRow group + pageStripe + row index (every 10k rows)
Lightweight indexesColumn statisticsBuilt-in indexes; optional bloom filters
Ecosystem leanSpark, broad/neutralHive, Presto

ORC's finer-grained row indexes (statistics every 10,000 rows within a stripe) and optional bloom filters can prune more aggressively for point lookups; Parquet's broader tool support and nested-data handling (the Dremel-style repetition/definition levels for representing nested and repeated fields) have made it the more common default outside the Hive ecosystem. In practice the choice is usually decided by your engine and ecosystem rather than a deep technical gap — both are excellent, and both beat row formats decisively for analytics.

What to carry away

Three things. Columnar layout means you read only the columns a query references, and each column compresses far better because it's one type with repetition. Encodings (dictionary, RLE, bit-packing, delta) do most of the size reduction before the codec, and they reward sorted data. Footer statistics drive column projection and predicate pushdown, letting the engine skip whole row groups before reading — the bytes a query reads are decided by the file's metadata, not just the query.

That last idea is the one that keeps paying off. The "read the metadata, skip what you can, then read only what's left" pattern in Parquet is exactly the pattern that table formats layered on top of these files extend to the whole dataset — the subject of open table formats. Get the file format right and you've solved half the performance problem before the engine starts.