ZooKeeper & Consensus: Paxos, Raft, and How Distributed Systems Agree

Here's a problem that sounds trivial and is genuinely one of the hardest in computer science: get a handful of servers to agree on a single fact — say, "who is the leader?" — when any of them can crash at any moment and the network can drop or delay messages. Every distributed system you rely on has to solve this. Kafka has to agree on which broker owns a partition; HBase on which server hosts a region; a database cluster on which node accepts writes. Get it wrong and you get the nightmare scenario: two nodes both believe they're in charge, both accept writes, and your data forks. This is the consensus problem, and the machinery that solves it — ZooKeeper, Paxos, Raft — is the quiet bedrock under most of the systems on this blog.

I'll start with ZooKeeper, the coordination service that so many systems outsourced this problem to, then go down a layer to the consensus algorithms — Paxos and the more teachable Raft — that make agreement possible, and finish with the practical rules (quorums, odd member counts) that fall out of the theory.

ZooKeeper: outsourcing the hard part

Most systems don't want to implement consensus themselves — it's subtle and easy to get fatally wrong — so for years they delegated it to Apache ZooKeeper. ZooKeeper presents a deceptively simple interface: a small, in-memory, hierarchical key-value store that looks like a filesystem. The nodes are called znodes, arranged in paths like /kafka/brokers/ids/1, and a cluster of ZooKeeper servers (an ensemble) keeps this tree strongly consistent across all of them.

What makes it a coordination tool rather than just a tiny database are two special features:

Ephemeral znodes exist only as long as the client that created them keeps its session alive. If that client crashes or disconnects, ZooKeeper automatically deletes the node. This is how membership and liveness work — a broker creates an ephemeral node to say "I'm alive"; when it dies, the node vanishes.
Watches let a client subscribe to a znode and get notified the instant it changes. No polling — you're told when the thing you care about happens.

From those two primitives, plus sequential znodes (ZooKeeper appends a monotonically increasing number), you can build the classic coordination patterns: leader election (whoever creates the lowest-numbered sequential node wins; everyone else watches the node ahead of them), distributed locks, configuration management (store config in a znode, watch it for live updates), and service discovery. ZooKeeper itself keeps its tree consistent using its own consensus protocol, ZAB (ZooKeeper Atomic Broadcast) — which, like the algorithms below, is built around a leader and a majority.

This is why, as of 2019, a Kafka or HBase cluster comes with a ZooKeeper ensemble bolted on: those systems delegate "who's the leader / who's alive / where does this live" to ZooKeeper rather than solving consensus internally. It's also a real operational cost — another distributed system to run, tune, and keep healthy — which is exactly why some systems are starting to move consensus in-house (Kafka's own metadata quorum is on the horizon). Understanding what ZooKeeper does makes it obvious why a project would eventually want to absorb it.

The consensus problem, precisely

Strip away the products and the core problem is this: a set of nodes must agree on a value (or an ordered sequence of values — a replicated log) such that they all agree on the same thing, they only agree on a value someone actually proposed, and once a value is decided it never changes — all while tolerating crashes and an unreliable network. The famous, unavoidable catch (the FLP result) is that in a fully asynchronous network you can't guarantee both perfect safety and guaranteed progress; practical algorithms stay safe always and make progress as long as a majority is up and the network behaves enough of the time.

The load-bearing word is majority (a quorum). Every workable consensus algorithm requires more than half the nodes to agree before anything is committed. That single requirement is what prevents two leaders from both succeeding — and it's why cluster sizes and failure tolerance work out the way they do.

Paxos and Raft: agreeing on a log

Paxos (Leslie Lamport, 1998) was the first proven-correct consensus algorithm and ran in production systems for years — but it's notoriously hard to understand and even harder to implement correctly, which became a real engineering problem. So in 2014, Raft was designed explicitly for understandability, and it has largely become the algorithm of choice (etcd, Consul, and many newer systems use it). Because Raft is the one you can actually reason about, it's the one worth knowing.

Raft breaks consensus into two clear pieces:

Leader election. Time is divided into terms. Nodes are followers by default; if a follower hears nothing from a leader within a timeout, it becomes a candidate and requests votes for a new term. A node that wins a majority of votes becomes leader. Randomized timeouts make it unlikely two candidates tie repeatedly, so a leader emerges quickly.
Log replication. All client writes go to the leader, which appends them to its log and replicates them to followers. Once a majority has stored an entry, the leader marks it committed and applies it; followers apply committed entries in the same order. Every node's log converges to the same sequence.

graph TD
    C["Client write"]
    L["Leader (current term)
append to log"]
    F1["Follower 1
replicate"]
    F2["Follower 2
replicate"]
    F3["Follower 3"]
    F4["Follower 4"]
    COMMIT["Majority stored it →
entry committed & applied"]
    C --> L
    L --> F1
    L --> F2
    L --> F3
    L --> F4
    F1 --> COMMIT
    F2 --> COMMIT

Raft log replication. The leader appends a client write and ships it to followers; the moment a majority (here 3 of 5, including the leader) has the entry, it's committed and applied — and stays committed forever. Requiring a majority is what guarantees a newly elected leader always already has every committed entry, so no decided value is ever lost or reversed.

Quorums and the odd-number rule

The majority requirement explains the operational rules people follow by rote. With N nodes, a quorum is ⌊N/2⌋+1, and the cluster keeps working as long as a quorum is alive — so it tolerates the failure of ⌊(N−1)/2⌋ nodes. That's why you see odd cluster sizes:

Nodes	Quorum	Failures tolerated
3	2	1
4	3	1
5	3	2

Notice that 4 nodes tolerate the same single failure as 3 — the extra node buys you nothing and costs you another machine to coordinate, so you always pick odd. The deeper reason for the majority rule is preventing split-brain: if a network partition splits the cluster, at most one side can hold a majority, so at most one side can elect a leader and accept writes. The minority side, unable to reach quorum, correctly refuses to act. That's the whole point — it's better for the minority to stop than for both halves to proceed and fork the data.

Consensus protects against split-brain only if you actually run a quorum. The most common self-inflicted outage is running an even number of coordination nodes, or spreading exactly two across two data centers — then a single partition leaves neither side with a majority and the whole system halts, or (worse, with a misconfiguration) both sides act. Run odd-sized ensembles, and if you span data centers, think hard about where the majority can survive a site loss. The math isn't optional; it's the guarantee.

What to carry away

Agreement under failure is the hard problem at the bottom of distributed systems, and it's solved by a small set of ideas. ZooKeeper packages them into a coordination service — znodes, ephemeral nodes, and watches — that systems like Kafka and HBase lean on for leader election, membership, and configuration instead of building consensus themselves. Underneath, consensus algorithms (Paxos, and the more understandable Raft with its leader election and majority-committed replicated log) provide the actual guarantee. And the recurring rule everything reduces to is the quorum: a majority must agree, which is why clusters are odd-sized and why a partitioned minority correctly stops rather than risking split-brain.

You may never implement Raft, but you operate systems that depend on it daily. Knowing why your coordination layer wants three or five nodes, why a two-node setup is a trap, and why a partition makes the minority go quiet turns a class of baffling outages into expected behavior. It's the foundation under Kafka, HBase, and every cluster that has to pick a leader and mean it.