# BigQuery Internals: Dremel, Colossus, and Separating Storage from Compute

The first time you run a query in BigQuery that scans a few terabytes and get results in seconds — with no cluster to provision, no indexes to build, no vacuum to schedule — it feels like a different category of system from the warehouses that came before. It is. BigQuery isn't a faster version of a traditional MPP database; it's a fundamentally different architecture that happens to speak SQL. Understanding how it works underneath is what turns it from magic into a tool you can use deliberately (and cost-effectively, which is the same thing in BigQuery).

The whole design rests on one decision that everything else follows from: **storage and compute are completely separate systems, connected by a network fast enough that the separation doesn't cost you**. Let's build up from there — the storage (Colossus + Capacitor), the engine (Dremel and its serving tree), the network (Jupiter) that makes it all hang together, and slots, the unit you actually pay for.

## The foundational split: storage is not compute

In a classic MPP warehouse, each node owns a slice of the data on its local disks and processes that slice. Storage and compute are coupled: to store more you add nodes, to compute more you add the same nodes, and resizing the cluster means physically reshuffling data. BigQuery throws this out. Your tables live in **Colossus**, Google's distributed file system, entirely separate from the machines that run queries. When you submit a query, BigQuery grabs compute from a shared pool, that compute reads your data from Colossus over the network, processes it, and releases the compute.

This is why there's no cluster to manage and why it scales so elastically: storage grows independently and cheaply, and a query can recruit thousands of workers for a few seconds and then give them back. It only works because the network is extraordinary — more on Jupiter shortly.

```mermaid
graph TD
    subgraph C["Compute (Dremel) — ephemeral, shared pool"]
        DR["Dremel workers (slots)"]
    end
    subgraph NET["Jupiter network — petabit bisection bandwidth"]
        J["..."]
    end
    subgraph S["Storage (Colossus) — persistent, independent"]
        CAP["Tables in Capacitor columnar format"]
    end
    DR <-->|"read columns over the network"| J
    J <--> CAP
          
```

BigQuery's defining split: a persistent storage layer (Colossus) and an ephemeral, shared compute layer (Dremel) joined by the Jupiter network. Compute is allocated per query and released after — so you never size or manage a cluster, and storage scales on its own axis. The same storage/compute decoupling later reshaped the whole warehouse market.

## Colossus and Capacitor: the storage layer

**Colossus** is Google's successor to the original Google File System — the distributed, replicated storage that handles durability, replication, and the file management BigQuery never makes you think about. On top of it, BigQuery stores your tables in **Capacitor**, a columnar format.

Capacitor applies the same principles as [Parquet and ORC](parquet-orc-internals) — column-oriented layout so a query reads only the columns it references, heavy encoding and compression of each column, and per-column statistics — with Google-specific optimizations (it invests serious effort at write time to find near-optimal encodings, because data is written once and read many times). The takeaway is the same: when you `SELECT` three columns out of a hundred, BigQuery reads only those three columns' bytes from Colossus, and BigQuery bills you on bytes scanned — so the columnar format is directly visible in your invoice.

**The cost model is the architecture, exposed.** On-demand BigQuery charges by bytes scanned, and bytes scanned is decided by Capacitor's columnar layout: select fewer columns, scan fewer bytes, pay less. `SELECT *` on a wide table is the single most expensive habit in BigQuery. The internals aren't trivia here — they're the line item.

## Dremel: the engine and its serving tree

The query engine is **Dremel**, and its signature is the **multi-level serving tree** — the same shape used in distributed search. A query enters at the **root**, which rewrites it and pushes work down through intermediate **mixer** levels to a wide base of **leaf** workers. The leaves do the heavy lifting: they read the actual columns from Colossus in parallel and apply filters and partial aggregation. Partial results flow back *up* the tree, with each level aggregating its children's output, until the root assembles the final answer.

```mermaid
graph TD
    ROOT["Root server(receives query, returns result)"]
    M1["Mixer"]
    M2["Mixer"]
    L1["Leaf: scan + filter + partial agg(reads columns from Colossus)"]
    L2["Leaf"]
    L3["Leaf"]
    L4["Leaf"]
    ROOT --> M1
    ROOT --> M2
    M1 --> L1
    M1 --> L2
    M2 --> L3
    M2 --> L4
          
```

Dremel's serving tree. The query fans out from the root through mixers to many leaf workers that scan columns in parallel from Colossus; partial aggregates flow back up and combine at each level. Massive parallelism at the leaves over a columnar scan is what lets BigQuery brute-force terabytes in seconds without indexes — it doesn't need an index because it can afford to scan the relevant columns in parallel across thousands of workers.

For operations that need to redistribute data across workers — large `JOIN`s and `GROUP BY`s, the equivalent of a shuffle — Dremel uses an **in-memory shuffle** tier that decouples the producing and consuming stages through fast distributed memory rather than writing to local disk between stages. This is a meaningful departure from the disk-bound MapReduce shuffle and a big part of why complex BigQuery queries stay fast.

## Jupiter: the network that makes it possible

Everything above depends on one thing that's easy to overlook: the network. If compute is separate from storage, every query reads all its data across the network — so the network has to be effectively as fast as local disk used to be. Google's **Jupiter** data-center network fabric delivers petabit-per-second bisection bandwidth within a data center, enough that thousands of Dremel leaves can stream columns from Colossus simultaneously without the network becoming the bottleneck.

This is the quiet enabler. The storage/compute separation that defines BigQuery is only practical because the interconnect removed the penalty that separation would normally impose. No fast network, no serverless warehouse.

## Slots: the unit of compute you actually buy

From the user's side, all that distributed machinery is abstracted into one concept: the **slot**, a unit of compute capacity (roughly, a share of a worker). When a query runs, BigQuery works out how many slots it needs and schedules its execution tree across the slots available to you. Two purchasing models:

| Model | You pay for | Fits |
| --- | --- | --- |
| **On-demand** | Bytes scanned per query | Spiky / unpredictable workloads; getting started |
| **Flat-rate** (reserved slots) | A fixed number of slots, by the month/hour | Steady, heavy workloads wanting predictable cost |

Because slots are shared and dynamically scheduled, a query that needs more capacity than is momentarily free simply queues for slots rather than failing — the fair-scheduling of a multi-tenant compute pool, again a consequence of compute being separate and shared rather than statically owned.

## What to take away

BigQuery makes sense once you hold the one idea everything descends from: **storage (Colossus + Capacitor) and compute (Dremel) are separate systems joined by a network (Jupiter) fast enough to make the separation free**. From that flows serverless elasticity (compute is borrowed per query and returned), the columnar cost model (you pay for bytes scanned, set by the format), brute-force speed without indexes (thousands of leaves scanning columns in parallel down a serving tree), and slots as the abstraction over it all.

That architecture — decouple storage from compute, lean on a great network, scale compute elastically per query — is not staying inside Google. It's the template the rest of the cloud-warehouse world is converging on, because it solves the coupling that made the previous generation rigid and expensive. BigQuery just got there first, by building it on infrastructure Google already had.
