LLM Inference Internals: KV Cache, PagedAttention, and vLLM

The first time you self-host a model, the bill makes no sense. You provisioned a GPU that does hundreds of teraflops, you're serving one user, and tokens dribble out like the thing is asleep. Then you batch a hundred users onto the same GPU and throughput barely drops. That contradiction — idle on one request, fine on a hundred — is the whole story of LLM inference, and it's not a tuning detail. It falls out of how autoregressive generation works, and once you see why, every serving optimization that matters (KV cache, continuous batching, PagedAttention, speculative decoding, quantization) lines up as an answer to the same problem.

The one sentence to anchor on: LLM decoding is memory-bandwidth-bound, not compute-bound. The GPU spends most of generation waiting to move data — model weights and cached state — through memory, not doing math. Almost every technique below is about feeding that bandwidth better. I'll start from the two phases of inference, which is where the asymmetry begins. (For how the model itself works, the Transformer explainer is the prerequisite; this is about running one.)

Two phases: prefill and decode

Generating a response happens in two distinct phases with completely different performance characteristics, and conflating them is the root of most confusion.

Prefill processes your entire prompt at once. Every token of the input is pushed through the model in parallel in a single forward pass to produce the first output token. Because it's one big batched matrix operation over all prompt tokens, prefill is compute-bound — it uses the GPU's math units well. This is the latency you feel as "time to first token."

Decode then generates the rest, one token at a time. Each new token requires a full forward pass through the model — but with only a single token as input, because generation is sequential: token N+1 depends on token N. So each step loads the entire multi-gigabyte set of model weights from memory just to compute one token's worth of math. The math is trivial relative to the data movement, so decode is memory-bound. This is the "tokens per second" you feel, and it's why a lone request leaves the GPU's compute mostly idle — it's waiting on memory.

graph LR
    P["PREFILL
whole prompt in one pass
compute-bound
→ time to first token"]
    D["DECODE
one token per pass, sequential
memory-bound
→ tokens per second"]
    KV["KV cache
(built in prefill,
grown each decode step)"]
    P --> D
    P --> KV
    KV --> D

The two phases. Prefill ingests the prompt in a single parallel pass and is compute-bound; decode emits tokens one at a time and is memory-bound. Both lean on the KV cache — prefill fills it, every decode step reads all of it and appends one entry. The decode phase is where serving efficiency is won or lost.

The KV cache: the central data structure

Why don't we recompute the whole sequence each decode step? Because attention lets every new token attend to all previous tokens, and the per-token "key" and "value" vectors for those previous tokens don't change once computed. Recomputing them every step would make generation quadratic in length. So we store them — that store is the KV cache: the keys and values for every token processed so far, kept in GPU memory and reused on each subsequent step. Prefill populates it; each decode step reads the whole thing and appends one token's worth.

The KV cache is the single most important object in LLM serving because it dominates memory after the weights themselves, and its size is brutal: it grows linearly with sequence length and with the number of concurrent sequences. A long context times many users can need tens of gigabytes — frequently more than the model weights. And reading this ever-growing cache every single decode step is a big part of why decode is memory-bound. Manage the KV cache well and you can serve many users; manage it badly and you run out of memory or waste most of it.

Here's the lever that explains batching. In decode, the GPU loads all the model weights from memory to generate one token. If you batch many sequences together, you load those same weights once and reuse them to generate a token for every sequence in the batch — the expensive memory read is amortized across the whole batch. That's why throughput barely drops as you add users until you hit a wall: the wall is KV-cache memory, not compute. Serving more users is a memory-management problem.

Continuous batching: stop waiting for the slowest request

Naive ("static") batching collects a group of requests, runs them together, and waits for all of them to finish before starting the next batch. The problem is obvious once stated: requests have wildly different output lengths, so the whole batch is held hostage by its longest generation while finished sequences sit idle, wasting GPU.

Continuous batching (also called in-flight batching) fixes this by scheduling at the granularity of a single decode step instead of a whole request. When any sequence in the batch finishes, it's evicted and a waiting request is slotted into its place on the very next step — the batch is continuously refilled. The GPU stays saturated with useful work instead of waiting on stragglers. This one scheduling change is among the biggest real-world throughput wins in serving, and it's standard in every serious engine now.

PagedAttention and vLLM: paging the KV cache

The clever, OS-flavored idea that powers vLLM is PagedAttention. The problem it solves: classic serving reserved one big contiguous block of memory per sequence, sized for the maximum possible length. Since most generations are far shorter, that reservation wasted enormous amounts of KV-cache memory to fragmentation and over-allocation — by some measures the majority of it. Less usable KV memory means fewer sequences you can batch, which means lower throughput.

PagedAttention borrows virtual memory paging from operating systems. The KV cache is split into fixed-size blocks (pages), and a sequence's cache need not be contiguous — it's a list of blocks, allocated on demand as the sequence grows, with a block table mapping logical positions to physical blocks. Almost no memory is wasted, so you fit far more concurrent sequences in the same GPU and throughput jumps. As a bonus, blocks can be shared across sequences — multiple requests with the same prompt prefix (a shared system prompt, or parallel samples from one prompt) point at the same physical blocks via copy-on-write, saving both memory and the cost of recomputing that prefix.

graph TD
    SEQ["Sequence's KV cache
(logical token positions)"]
    BT["Block table
logical → physical mapping"]
    B1["Physical block 7"]
    B2["Physical block 3"]
    B3["Physical block 9"]
    SHARE["Shared prefix block
(reused by many requests,
copy-on-write)"]
    SEQ --> BT
    BT --> B1
    BT --> B2
    BT --> B3
    BT --> SHARE

PagedAttention. A sequence's KV cache is a list of fixed-size physical blocks rather than one contiguous reservation, mapped through a block table — like OS virtual memory. This nearly eliminates the wasted reservation of classic serving, so more sequences fit in GPU memory, and identical prompt prefixes can share the same physical blocks. More usable KV memory is directly more throughput.

Two more levers: speculative decoding and quantization

Speculative decoding attacks the sequential bottleneck of decode. A small, fast "draft" model proposes several tokens ahead; the big model then verifies all of them in a single forward pass (which it can do cheaply, since verification is parallel like prefill). Accepted tokens are kept, the first rejected one is corrected, and you've produced multiple tokens for roughly the cost of one big-model pass — with identical output to the big model alone, because every token is verified. When the draft model guesses well, it's a clean multiplier on decode speed.

Quantization attacks the memory-bound nature directly: store weights (and sometimes the KV cache) in fewer bits — 8-bit or 4-bit instead of 16-bit. Smaller weights mean less data to move through memory every decode step, which is exactly the bottleneck, so it speeds up decode and frees memory for more KV cache and bigger batches. The trade is some loss of accuracy, which good quantization schemes keep small. It's also what lets a large model fit on a smaller GPU at all.

Technique	Bottleneck it attacks	What it buys
KV cache	Quadratic recompute	Linear-cost decode (the enabling structure)
Continuous batching	GPU idle on stragglers	Higher throughput, steady GPU utilization
PagedAttention (vLLM)	Wasted KV memory	More concurrent sequences; prefix sharing
Speculative decoding	Sequential decode latency	Multiple tokens per big-model pass
Quantization	Memory bandwidth + footprint	Faster decode, bigger batches, smaller GPUs

Latency and throughput are in tension — decide which you're optimizing before you tune. Bigger batches raise tokens-per-second across all users (throughput) but can raise the latency any single user feels. Batch-friendly settings that look great on a throughput dashboard can make an interactive chat feel sluggish, and vice versa. There's no universally "fast" config; there's a config for your target. Pin down whether you're serving a latency-sensitive chat or a throughput-sensitive batch job first — and remember the offline metrics here don't capture quality drift, which is a separate observability problem entirely.

What to carry away

LLM inference splits into a compute-bound prefill (the prompt in one pass; time-to-first-token) and a memory-bound decode (one token at a time; tokens-per-second). Decode is slow because each step drags the whole model through memory to produce a single token — so inference is fundamentally a memory-bandwidth problem, and serving many users is a memory-management problem. The KV cache makes decode linear instead of quadratic but then dominates memory and grows with users and context.

The serving stack is a stack of answers to that. Continuous batching keeps the GPU busy by scheduling per-step instead of per-request; PagedAttention (vLLM) pages the KV cache so almost none is wasted and prefixes can be shared; speculative decoding produces several tokens per big-model pass; quantization shrinks the data that has to move. Reach for them in roughly that order, and always with a clear target — throughput or latency, not both at once. Get the mental model right and the GPU bill finally starts making sense.

Frequently asked questions

Why is LLM inference memory-bound rather than compute-bound?

During decode the model generates one token at a time, and each step must read the entire multi-gigabyte set of weights plus the growing KV cache from memory to produce a single token. The math is tiny relative to that data movement, so throughput is limited by memory bandwidth, not compute.

What is the KV cache and why does it dominate memory?

The KV cache stores the key and value vectors for every token processed so far, so attention doesn't recompute them each step. It grows with sequence length and the number of concurrent requests, often exceeding the model weights — which is why serving many users is fundamentally a memory-management problem.

How does PagedAttention in vLLM increase throughput?

It manages the KV cache in fixed-size blocks like operating-system virtual-memory pages, instead of one contiguous reservation per sequence, which nearly eliminates wasted memory. More usable KV memory means more sequences batched together, and identical prompt prefixes can share blocks — both raise throughput.