Why Kafka Is So Fast — and How It Scales: Zero-Copy, Page Cache, KRaft, and Tiered Storage

People are surprised that Kafka, written largely in Java and persisting every message to disk, routinely pushes millions of messages a second per broker. The intuition says "disk + JVM = slow," and the intuition is wrong — not because Kafka cheats, but because it's engineered around how disks, the kernel, and the network actually behave rather than how we imagine they do. The performance is not one clever trick; it's a handful of decisions that each remove a layer of overhead, and they compound.

This is a companion to Kafka Internals, which covers the log abstraction, partitions, replication and the ISR, consumer groups, and exactly-once. Here I assume that foundation and answer two specific questions: why is Kafka so fast, and how does it scale as you throw more load and more hardware at it. If "what's a partition" is fuzzy, read that one first.

Speed, part 1: the disk is not the enemy you think

The belief that "disk is slow" conflates two very different things: random access and sequential access. Random reads and writes — seeking all over the platter, or scattering across an SSD — are slow. But sequential access is enormously faster, often within an order of magnitude of memory bandwidth, because it streams contiguous blocks with no seeking and lets the hardware prefetch aggressively.

Kafka's core data structure — the append-only commit log from the internals piece — is built precisely to make all I/O sequential. Producers append to the end of a partition's log; consumers read forward from an offset. Writes are appends; reads are linear scans. There are no in-place updates and no random seeks in the hot path. Kafka turned the one thing disks are good at into its entire storage model, which is why it can persist everything to disk and still be fast.

Speed, part 2: let the OS page cache do the caching

Most databases build an elaborate in-process cache in heap memory. Kafka deliberately doesn't. It writes to and reads from files, and leans on the operating system page cache to keep hot data in RAM. This is a bigger deal than it sounds, for several reasons.

No duplicated cache. Data isn't held once in a JVM cache and again in the OS cache — there's one copy, in the page cache, managed by the kernel.
No garbage-collection pressure. The cached data lives outside the JVM heap, so it doesn't create GC work. A big heap full of cached messages would mean punishing GC pauses; Kafka sidesteps that entirely.
Warm restarts. The page cache survives a broker process restart (it's the kernel's, not the process's), so a restarted broker doesn't start cold.
The common case never hits disk. Consumers usually read recently written data — which is still in the page cache from the write. So a "read from disk" is frequently a read from RAM, with no physical I/O at all.

Speed, part 3: zero-copy with sendfile

This is the famous one, and it's worth seeing exactly what it removes. Picture a consumer fetching messages. The naive path to get bytes from a file out to a network socket copies data four times and crosses the user/kernel boundary repeatedly: disk → kernel read buffer → application buffer → kernel socket buffer → network card. Each copy burns CPU and memory bandwidth, and the data never even needed to be looked at by the application — Kafka is just shipping bytes it already has.

Kafka uses the sendfile system call (via Java's FileChannel.transferTo) to tell the kernel: send these bytes from this file straight to this socket. The data goes from the page cache to the network interface without ever being copied into the application's memory. This is zero-copy. It eliminates the redundant copies and the context switches, and it's a primary reason a single broker can saturate a network link while barely touching the CPU.

graph TD
    subgraph N["Naive path — 4 copies, multiple context switches"]
        D1["Disk / page cache"] --> K1["Kernel read buffer"]
        K1 --> A1["Application buffer (user space)"]
        A1 --> S1["Kernel socket buffer"]
        S1 --> NIC1["Network card"]
    end
    subgraph Z["Kafka with sendfile — zero-copy"]
        D2["Page cache"] --> NIC2["Network card
(kernel copies directly; app never touches the bytes)"]
    end

Zero-copy collapses four copies and several user/kernel transitions into a direct page-cache-to-NIC transfer. Because the broker is forwarding bytes it doesn't need to inspect, skipping the trip through application memory is pure win — and it pairs with the page cache: data written moments ago is still in cache, so serving a consumer is often RAM → NIC with no disk read.

Speed, part 4: batching and compression end to end

The last big lever is amortization. Sending one message at a time means one network round trip and one set of per-request overhead per message. Kafka producers instead batch: they accumulate messages destined for the same partition and send them as one request. Two settings control it — linger.ms (wait briefly to let a batch fill) and batch.size (cap the batch). A few milliseconds of linger.ms can multiply throughput by turning thousands of tiny requests into a handful of fat ones.

Compression then rides on top, and Kafka's design here is elegant: the producer compresses the whole batch (lz4, zstd, snappy, gzip), and the batch stays compressed all the way through — across the network, on the broker's disk, and back out to the consumer, which decompresses it. The broker doesn't recompress per message; it stores and serves the compressed batch as-is, which is exactly what makes zero-copy possible on the read side. Compressing a batch also gets far better ratios than compressing messages individually, because there's more shared structure to exploit.

# Producer throughput essentials
linger.ms=10            # wait up to 10ms to fill a batch
batch.size=65536        # 64 KB target batch
compression.type=lz4    # compress the batch; stays compressed end to end
acks=all                # durability; see Kafka Internals for the ISR story

The unifying idea: every one of these removes work rather than doing work faster. Sequential I/O removes seeks. The page cache removes a redundant cache and GC. Zero-copy removes memory copies and context switches. Batching removes per-message overhead. Kafka is fast because, in the hot path, it does almost nothing per message — it appends, it forwards, it gets out of the way.

Scaling, part 1: the partition is the unit of everything

Kafka's horizontal scaling story is, fundamentally, the partition. As the internals piece covers, a topic is split into partitions, each an independent log that lives on a broker. Partitions are simultaneously the unit of parallelism (producers write to many partitions at once; each partition is consumed by exactly one consumer in a group, so consumer throughput scales with partition count), the unit of distribution (partitions and their replicas spread across brokers, so adding brokers adds capacity), and the unit of ordering (order is guaranteed within a partition, not across).

So you scale a Kafka workload by adding partitions and brokers, and rebalancing partitions across the enlarged broker set. There's a genuine tension baked in, though, and it's worth stating plainly: more partitions buy more parallelism, but each partition has real cost — open file handles, memory, and metadata the cluster must track and the controller must manage on failover. Historically this set a practical ceiling on total partitions per cluster, and that ceiling was set by the metadata layer — which is exactly what changed.

Scaling, part 2: KRaft removes the ZooKeeper ceiling

For most of Kafka's life, cluster metadata — which brokers exist, which partitions they host, who leads each partition — lived in ZooKeeper, a separate system you had to deploy and operate. ZooKeeper worked, but it was the scaling bottleneck: the controller loaded metadata from it, and on a controller failover or a large cluster change, propagating that metadata got slow. It also capped how many partitions a cluster could reasonably hold and made operations heavier.

KRaft (Kafka Raft) replaces ZooKeeper by storing metadata in Kafka itself — in an internal metadata log managed by a quorum of controller nodes via the Raft consensus protocol. Production-ready since Kafka 3.3 and the default for new clusters by 2024 (with ZooKeeper on the path to removal), KRaft changes the scaling picture concretely:

Far more partitions per cluster. Metadata as a log scales to millions of partitions, well past the old ZooKeeper-era limits.
Much faster failover. The new controller already has the metadata log; recovery is reading a log tail, not reloading state from an external store — failover drops from seconds toward sub-second.
One system to operate. No separate ZooKeeper ensemble to deploy, secure, and tune. Fewer moving parts, simpler scaling.

Scaling, part 3: tiered storage decouples storage from compute

The other historical scaling constraint was that a broker's capacity coupled compute and storage: to retain more data, you added brokers (and their local disks) even if you had plenty of CPU and network. Long retention got expensive fast, and rebalancing a broker holding terabytes of local log was slow and risky.

Tiered storage (KIP-405) breaks that coupling. Recent "hot" log segments stay on fast local disk; older segments are offloaded to cheap object storage (S3, GCS, Azure Blob) while remaining transparently readable through the normal consumer API. The effects on scaling:

Retention is no longer a broker-count decision. You can keep weeks or months of data without provisioning local disk for all of it — storage scales independently and cheaply.
Brokers get lighter and more elastic. Less local data per broker means faster rebalancing and quicker recovery, so the cluster is easier to scale up and down.
Compute and storage scale on separate axes — the same principle that reshaped the data-warehouse world, arriving for streaming.

Maturing through the 3.x line and reaching production readiness by 2024, tiered storage turns Kafka from "a fast transport you drain quickly" into a system that can also be a durable, long-retention event store without the cost spiraling.

The honest limits

None of this makes Kafka infinitely scalable or free of sharp edges, and pretending otherwise sets teams up to get burned:

Partition count is hard to reduce. You can add partitions; you effectively can't remove them. And adding them changes key-to-partition mapping, which breaks per-key ordering — so over- or under-provisioning partitions is a decision that follows you. (Kafka Internals has the ordering trap in full.)
Hot partitions cap real throughput. A single partition is handled by one consumer per group and lives on one broker. A skewed key (everything hashing to one partition) means one partition does all the work while the rest idle — your scaling is only as good as your key distribution.
Rebalances still hurt. Adding consumers or brokers triggers rebalancing, and during it consumption pauses. Scaling out is not free of disruption.
Tiered-storage reads are slower for cold data. Serving an old segment from object storage is fine for backfills, but it's not the page-cache-and-zero-copy hot path — don't expect cold reads at hot-read latency.

The whole thing in three sentences

Kafka is fast because it does almost nothing per message in the hot path — sequential appends, the OS page cache instead of an in-heap cache, zero-copy sendfile from cache to socket, and batching with end-to-end compression. Kafka scales because the partition is the unit of parallelism, distribution, and ordering — and the two constraints that used to cap it have been lifted: KRaft replaced the ZooKeeper metadata ceiling, and tiered storage decoupled retention from broker count. What's left to get right is yours: pick partition counts and keys deliberately, because those are the levers Kafka can't fix for you after the fact.

For the mechanics underneath all of this — what a partition and an offset actually are, how replication and the ISR keep data safe, and why a consumer group gets stuck — see Kafka Internals: Partitions, Replication, and Why Your Consumer Group Is Stuck. And for one of the most common things people point Kafka at, the ClickHouse Kafka-engine streaming pattern shows this throughput feeding a real-time analytics database.