Presto Internals: MPP Query Execution and the Coordinator/Worker Model

Presto occupies a specific and valuable niche: interactive, ad-hoc SQL over very large data, often spread across multiple systems, with latency measured in seconds rather than the minutes a batch engine takes. It was built at Facebook because Hive-on-MapReduce was too slow for analysts who wanted to explore data conversationally. The design choices that make it fast for that use case are different from a batch engine's, and they come with real trade-offs. Understanding them is how you know when to reach for Presto and when not to.

A quick note on names, because 2020 is a confusing year for it: in early 2019 the original creators left Facebook and forked the project, so there are now two lineages — PrestoDB (Facebook/Linux Foundation) and PrestoSQL (the founders' Presto Software Foundation). The internals I'm describing are common to both. The PrestoSQL side has also signaled it will rebrand shortly to avoid the naming confusion, so don't be surprised if "Presto" becomes "Trino" by the end of the year. I'll say Presto throughout.

The two big ideas

Presto's character comes from two decisions. First, it's a pure query engine with no storage of its own — it reads data from wherever it lives (HDFS, S3, a relational database, a NoSQL store) through pluggable connectors. Second, it executes queries as a pipelined, in-memory MPP computation that streams data between stages rather than writing intermediate results to disk. Hold those two and the rest follows, including the limitations.

Coordinator and workers

A Presto cluster is one coordinator and many workers. The coordinator is the brain: it parses SQL, plans and optimizes the query, breaks it into a distributed execution plan, and schedules the pieces onto workers. The workers are the muscle: they pull data through connectors, run the actual processing, and exchange intermediate data with each other. The coordinator doesn't process data itself; it orchestrates.

graph TD
    CLIENT["SQL client / BI tool"]
    COORD["Coordinator
parse → plan → optimize →
schedule stages, track tasks"] W1["Worker
tasks + operators"] W2["Worker
tasks + operators"] W3["Worker
tasks + operators"] SRC["Connectors → data sources
(S3/HDFS, MySQL, Kafka, ...)"] CLIENT --> COORD COORD --> W1 COORD --> W2 COORD --> W3 W1 <-->|"exchange (stream)"| W2 W2 <-->|"exchange"| W3 W1 --> SRC W2 --> SRC W3 --> SRC

The coordinator/worker model. The coordinator plans and schedules; workers fetch data through connectors and stream intermediate results to one another via exchanges. Crucially, data flows worker-to-worker in memory between stages — it is never landed to disk and re-read, which is the core difference from MapReduce and the source of Presto's interactive speed.

How a query becomes distributed work

The coordinator turns a SQL statement into a hierarchy with a precise vocabulary:

UnitWhat it is
StageA logical step of the plan (e.g. scan, partial aggregate, final aggregate). Stages are connected by exchanges.
TaskA stage's work on one worker — one task per worker per stage.
SplitA chunk of the input data (e.g. a file or file range) that a task processes. Parallelism = splits processed concurrently.
Driver / operatorWithin a task, a pipeline of operators (scan, filter, join, aggregate) that data flows through. Drivers run operators on splits.

Between stages sit exchanges — the mechanism that moves data from the tasks of one stage to the tasks of the next, including the repartitioning (by join or group key) that's the equivalent of a shuffle. The defining detail: this exchange streams in memory and over the network. A producing stage hands rows to the consuming stage as they're ready; there's no intermediate materialization to disk between stages. This pipelined execution is exactly why Presto returns first rows quickly and finishes interactive queries in seconds.

Connectors: one query, many sources

Because Presto has no storage, every data source is reached through a connector — a plugin that knows how to list splits, read data, and expose the source's schema as tables. There's a Hive connector for data lakes, connectors for relational databases, for Kafka, for many NoSQL stores. Two things make this powerful:

  • Federation. Because every source looks like SQL tables, a single query can JOIN a table in your data lake against a dimension table in a relational database — Presto pulls from both and joins in memory. One SQL dialect over many systems.
  • Pushdown. A good connector pushes work down to the source — predicates, column projection, sometimes partial aggregation — so Presto reads less. Against a columnar lake format like Parquet/ORC, that means pruning row groups and reading only needed columns before data ever reaches a worker.

The pushdown caveat worth knowing: federation is only as fast as the pushdown. If a connector can't push a filter down, Presto must pull the rows over and filter them itself — fine for a small table, painful for a large one. When a federated query is slow, the question is almost always "what got pushed to the source, and what got dragged back to a worker?"

The trade-off: speed bought with no mid-query fault tolerance

Here's the cost of that streaming, in-memory design, and it's the most important thing to internalize about Presto in 2020. Because stages stream to each other in memory with nothing materialized in between, there's no checkpoint to recover from if a worker dies mid-query. Lose a worker and the whole query fails; you re-run it. A batch engine like Spark, which writes shuffle output to disk between stages, can recompute a lost partition and survive — Presto trades that resilience away for latency.

For Presto's target workload — interactive queries that run for seconds — that's a sensible trade: a query short enough to just re-run doesn't need fault tolerance, and avoiding disk materialization is what makes it fast. But it means Presto has historically been a poor fit for very long, ETL-scale batch jobs, where a multi-hour query failing near the end because one worker hiccupped is unacceptable. (The community is actively working on optional fault-tolerant execution to widen Presto into batch territory, but as I write this the standard execution model is all-or-nothing per query.) The other face of the same coin is memory: since everything is in memory, a query whose hash tables or sorts exceed the cluster's memory limits is killed rather than spilling — so big joins and aggregations need either enough memory or a rewrite.

Where Presto fits

Put the pieces together and Presto's niche is sharp:

Great atNot built for
Interactive, ad-hoc SQL exploration (seconds)Multi-hour ETL needing mid-query fault tolerance
Federated queries across many sourcesBeing a system of record (it has no storage)
SQL over a data lake without loading data firstQueries whose working set exceeds cluster memory
A shared query layer for analysts and BI toolsHigh-concurrency point lookups (use a database)

What to carry away

Presto is a storage-free, MPP SQL engine whose speed comes from pipelined in-memory execution — a coordinator plans the query into stages, tasks, and splits; workers stream data to each other through exchanges with no disk materialization between stages; and connectors let one SQL query federate across many systems with pushdown to read less. The same design that makes it fast for interactive work removes mid-query fault tolerance and caps you at cluster memory, which is why it shines for exploration and federation and steps aside for long batch ETL.

Reach for Presto when analysts need to ask fast questions across your data — including across systems — without waiting on a batch engine or pre-loading everything into one warehouse. Reach for something else when the job is a long, must-not-fail transformation. Know which one you're doing, and the engine behaves exactly as designed.