# Hash Tables in Query Execution: Hash Joins, Hash Aggregation, and Why They Spill to Disk

The same query ran in four seconds every day for three months, then one morning took ninety. Nothing about the SQL changed. What changed was the data volume on one side of a join crossing the threshold where the query engine's hash table no longer fit comfortably in the memory it had been allocated — and the engine, correctly, started spilling to disk to keep executing rather than fail outright. That cliff is one of the most common "why did this suddenly get slow" stories in query performance work, and understanding it means understanding a structure that has nothing to do with the on-disk indexes covered in [hash, bitmap, and inverted indexes](index-types-hash-bitmap-inverted) — this is a hash table that exists for the duration of a single query, entirely in memory, built and torn down fresh every time.

Worth stating plainly up front, because the name overlap causes real confusion: a **hash index** is a durable, on-disk structure that persists between queries and speeds up point lookups against stored data. The hash table this article covers is **ephemeral and execution-time** — built fresh inside the query engine for one join or one aggregation, discarded the moment that operation finishes, and never touching disk at all unless memory pressure forces it to.

## How does a hash join actually work?

A **hash join** answers an equi-join (`WHERE a.id = b.id`) in two phases. The **build phase** picks the smaller of the two relations (or the one the optimizer estimates is smaller) and hashes every row into an in-memory hash table, keyed on the join column. The **probe phase** streams the larger relation row by row, hashing each row's join key and looking it up in the hash table built in phase one — a match means a joined output row, a miss means that row simply doesn't participate. The entire algorithm's efficiency rests on one assumption: the build side's hash table fits comfortably in the memory budget the query engine has available for that operation.

## When does hash join beat sort-merge join, and when does it lose?

**Sort-merge join** takes the opposite approach: sort both relations by the join key, then walk them together in lockstep, advancing whichever side is behind and emitting matches as the sorted keys align. Hash join usually wins for equi-joins when the smaller side genuinely fits in memory, because building a hash table is cheaper than a full sort, and the probe phase is a single streaming pass over the larger side with no sort required on it at all. Sort-merge wins in two specific situations worth naming precisely: when the inputs are *already sorted* on the join key (from a prior operation, or because the underlying storage is naturally ordered that way — see how [ClickHouse](clickhouse-architecture-internals) and other sorted-storage engines can make this the common case rather than the exception), which eliminates the sort cost entirely and leaves sort-merge with nothing to lose against; and under genuine memory constraints, because external sort's disk-friendly, sequential-access pattern degrades far more gracefully than a hash join whose build side has outgrown memory — which leads directly to the spill mechanism below.

```mermaid
graph TD
    B["Build phase:hash smaller relationinto in-memory hash table"]
    P["Probe phase:stream larger relation,hash + lookup each row"]
    B --> P
    P -->|"match"| OUT["Emit joined row"]
    P -->|"miss"| SKIP["Row doesn't participate"]
    B -.->|"build side too largefor memory"| SPILL["Partition both sides to disk(grace hash join)"]
          
```

The build/probe hash join in its simple, all-in-memory form — fast, and the default assumption most query optimizers make when choosing between hash and sort-merge join. The spill path only activates once the build side outgrows the memory budget, and it's a fundamentally different, more expensive execution mode, not a minor slowdown of the same algorithm.

## What is hash aggregation, and how is it different from sorting for a GROUP BY?

**Hash aggregation** computes a `GROUP BY` by hashing each incoming row's grouping-column values into a hash table, using that hash as the key to find (or create) an accumulator entry, and updating the running aggregate state (sum, count, min, whatever the query needs) for that group as each row streams through. This needs exactly one pass over the input and no sort at all, which is why it's usually the faster default for aggregation. The alternative, **sort-based aggregation**, sorts the input by the grouping columns first so that all rows belonging to the same group become physically adjacent, then aggregates each contiguous run in a single streaming pass with essentially no extra memory beyond the current group's accumulator — a real advantage when the number of distinct groups is enormous, because sort-based aggregation never needs to hold more than one group's state in memory at a time, where hash aggregation needs a hash-table entry for every distinct group simultaneously.

## What actually happens when the hash table doesn't fit in memory?

This is where the "why did my query suddenly get ten times slower" story comes from. When the build side of a hash join (or the distinct-key set of a hash aggregation) grows past the memory budget the query engine allocated for that operation, the engine has to **spill to disk** rather than simply fail or run out of memory. The classic algorithm for this is the **grace hash join** (with real production implementations more often running a *hybrid* hash join, adaptively deciding how much to keep in memory versus spill): both relations get partitioned into buckets using a hash function on the join key, chosen so that any two rows that could possibly join are guaranteed to land in the same pair of partitions — then each partition pair, now small enough to fit in memory individually, gets joined the ordinary in-memory way, one pair at a time.

The reason this shows up as such a dramatic performance cliff rather than a gradual slowdown: the moment spilling activates, the operation goes from one sequential read of each input straight through memory to writing every row to disk during partitioning and then reading it all back again during the per-partition join phase — the I/O cost roughly doubles at minimum, and depending on how many partitioning passes are needed for genuinely large skewed data, it can compound further. This is precisely why a query that's comfortably fast right up until a data volume threshold can fall off a cliff rather than degrade smoothly — spilling isn't a slower version of the same algorithm, it's qualitatively different work.

```sql
-- A join that's fine at yesterday's volume can spill today purely
-- because the build side (the smaller-seeming table) crossed the
-- engine's memory threshold for that operation
EXPLAIN ANALYZE
SELECT o.order_id, c.customer_name
FROM orders o
JOIN customers c ON o.customer_id = c.customer_id;
-- look for a spill/disk-based indicator in the plan output —
-- Spark, Snowflake, and DuckDB all surface this differently,
-- but every one of them has a way to tell you it happened
```

Different engines handle this pressure differently, and the difference is worth knowing before you're debugging it live. [Spark](spark-performance-optimization) spills shuffle and join data to local disk and surfaces this in its UI as spill metrics, directly tunable via executor memory and shuffle partition count. [Snowflake](snowflake-internals) spills to local SSD first and then to remote storage under more extreme pressure, with query profile explicitly showing "bytes spilled to local storage" and "bytes spilled to remote storage" as distinct, escalating tiers. [DuckDB](duckdb-internals), built for single-node execution with a defined memory limit, spills to disk using its own out-of-core join and aggregation implementations specifically so it can process datasets larger than available RAM without the caller having to manually partition anything.

**The join-order assumption an optimizer makes about which side is "smaller" is a statistics-driven guess, and stale statistics are the single most common reason a hash join spills that shouldn't have.** I've debugged a sudden regression that turned out to be exactly this: a table's row count had grown 40x since the last time statistics were refreshed, the optimizer still believed it was the smaller side of the join based on stale numbers, built the hash table on what was now actually the larger relation, and the operation spilled hard. The fix wasn't a bigger cluster — it was refreshing statistics so the optimizer picked the correct build side. Before tuning memory or scaling compute in response to a sudden join slowdown, check whether the optimizer's build-side choice still matches reality.

## What to carry away

Hash join and hash aggregation both trade a one-time in-memory build cost for a fast, single-pass probe or accumulate phase — hash join beats sort-merge whenever the build side fits comfortably in memory and the inputs aren't already sorted, and hash aggregation beats sort-based aggregation whenever the number of distinct groups is manageable, losing that edge only when group cardinality gets enormous enough that sort-based aggregation's near-zero memory footprint per group starts to matter more than avoiding a sort.

The spill-to-disk path — grace or hybrid hash join partitioning both sides when the build side outgrows memory — is qualitatively different work, not a slower version of the same algorithm, which is exactly why performance falls off a cliff rather than degrading gracefully once a data volume threshold is crossed. Before assuming a sudden join slowdown needs more compute, check whether the optimizer's build-side and statistics assumptions still match the actual data — a stale row-count estimate choosing the wrong side to hash is a far more common root cause than genuine data growth outpacing the cluster.
