# How Vector Search Works: HNSW, IVF, and Product Quantization

Everyone's bolting a vector database onto their app this year, and almost nobody using one can answer the question that decides whether it works: when you ask for the 10 most similar items to a query, how does it avoid comparing your query against all ten million stored vectors? Because the naive version — measure the distance to every vector and sort — is exactly what you can't do at scale and at interactive latency. The whole field of vector search is built around dodging that comparison while still returning answers that are good enough. Understanding how is the difference between tuning a vector store and cargo-culting its defaults.

The thing to internalize up front: **vector search is approximate on purpose.** These systems do *approximate* nearest-neighbor search (ANN) — they trade a small, controllable amount of accuracy for orders of magnitude in speed. The three ideas that make that trade are the **HNSW** graph, **IVF** inverted lists, and **product quantization**. I'll build up from why exact search fails, through each technique, to the recall-vs-latency knobs you actually turn.

## Why exact nearest-neighbor doesn't scale

An embedding model turns text (or an image, or audio) into a vector — a list of, say, 768 or 1536 floating-point numbers — positioned so that similar meanings sit close together in that high-dimensional space. "Similar" usually means small cosine distance or Euclidean distance. So search becomes geometry: find the stored vectors closest to the query vector.

Exact search (brute force, or "flat" indexing) computes the distance from the query to *every* stored vector. For a million 1536-dimensional vectors that's a million dot products of length 1536 per query — fine for a prototype, ruinous at scale and concurrency. Worse, the "curse of dimensionality" means the clever low-dimensional tricks (kd-trees and friends) collapse to no better than brute force once you're in the hundreds of dimensions. We need structures designed for high dimensions, and we have to give up the guarantee of finding the *exact* top-k to get speed. Hence approximate.

Brute-force isn't always wrong, and saying so saves teams from over-engineering. If you have a few hundred thousand vectors or fewer, a flat index (exact search) is simple, has perfect recall, and is plenty fast — `pgvector`'s IVFFlat and even a numpy loop will serve you. ANN indexes earn their complexity at millions of vectors and tight latency budgets. Reach for them when you measure a problem, not before.

## HNSW: search as navigation through a graph

**HNSW** (Hierarchical Navigable Small World) is the algorithm most production vector engines default to, and the mental model is delightfully physical: it builds a graph where each vector is a node connected to its near neighbors, then *navigates* that graph toward the query like following a chain of "who's closer than me?" referrals.

The "hierarchical" part is the trick that makes it fast. HNSW builds multiple layers. The top layer is sparse — a few nodes with long-range links that let you cross the whole space in a few hops. Each layer down is denser and more local. A search starts at the top, greedily hops toward the query through the long-range links to land in roughly the right region, then drops a layer and refines, and repeats — coarse to fine. It's the same idea as a skip list applied to spatial navigation.

```mermaid
graph TD
    Q["Query vector enters at top layer"]
    L2["Layer 2 (sparse)long-range hops →cross the space fast"]
    L1["Layer 1 (denser)navigate toward the region"]
    L0["Layer 0 (all nodes)refine to the closest neighbors"]
    Q --> L2
    L2 -->|"drop down"| L1
    L1 -->|"drop down"| L0
    L0 --> R["Return approximate top-k"]
          
```

HNSW search descends a layered graph. Sparse top layers with long links get you into the right neighborhood in a few hops; dense lower layers refine to the actual nearest neighbors. Each step just asks neighbors "are any of you closer to the query?" and moves there — greedy navigation, not a scan.

HNSW gives excellent recall at low latency, which is why it's the default in FAISS, hnswlib, Weaviate, Qdrant, and Milvus. Its costs are the trade you're accepting: the graph lives in memory and is sizable (all those neighbor links), and building it is more expensive than a flat index. Two parameters govern it — `M` (links per node; higher = better recall, more memory) and `efConstruction`/`efSearch` (how many candidates to explore when building/searching; higher `efSearch` = better recall, slower query). That last one, `efSearch`, is your live recall-vs-latency dial.

## IVF: don't search everywhere, search the right cluster

**IVF** (Inverted File index) takes a different route to the same goal: instead of comparing against everything, partition the space and only search the relevant partitions. At build time, IVF runs k-means to cluster all the vectors into, say, a few thousand cells, each with a centroid. Every vector is filed under its nearest centroid — an inverted list per cell, exactly analogous to the inverted index behind [Elasticsearch](elasticsearch-internals), but over geometry instead of words.

At query time you find the few centroids closest to the query and search only the vectors in those cells. The parameter `nprobe` says how many cells to scan: `nprobe=1` is fastest but misses neighbors that fell just over a cell boundary; raising it improves recall at the cost of scanning more. That boundary problem is IVF's characteristic weakness — a true nearest neighbor sitting in an adjacent unprobed cell is simply never seen.

| Index type | How it prunes | Strength | Main cost / weakness |
| --- | --- | --- | --- |
| Flat (exact) | Doesn't — compares all | Perfect recall, trivial to run | Linear in dataset size; doesn't scale |
| HNSW | Graph navigation | High recall at low latency | Memory-hungry; slower builds |
| IVF | Search only nearby clusters | Fast, tunable via `nprobe` | Misses neighbors across cell boundaries |
| IVF + PQ | Clusters + compressed vectors | Tiny memory footprint at huge scale | Quantization lowers precision |

## Product quantization: making vectors small

The third idea attacks a different bottleneck: memory. A million 1536-dimensional float32 vectors is about 6 GB raw, and billions of vectors blow past any reasonable RAM budget — a real problem since both HNSW and IVF want vectors in memory. **Product quantization (PQ)** compresses each vector dramatically, often 10–50×, so far more of them fit.

PQ works by splitting each vector into chunks — say a 1536-dim vector into 8 sub-vectors of 192 dims each — and running k-means within each chunk position to learn a small codebook (commonly 256 centroids per chunk). Each sub-vector is then replaced by the *id* of its nearest centroid, a single byte. The whole vector becomes 8 bytes instead of 6,144. Distances are then estimated from the codebooks with precomputed lookup tables, fast and approximate. You lose precision — you're representing each chunk by its nearest codebook entry, not its true value — but you can store and scan an enormous corpus. In practice PQ is layered onto IVF (the well-known `IVF-PQ` combination: cluster to prune, quantize to fit).

```mermaid
graph LR
    V["Vector: 1536 float32(~6 KB)"]
    SP["Split into 8 sub-vectorsof 192 dims each"]
    CB["Each sub-vector →nearest centroid id(codebook of 256)"]
    C["Compressed code: 8 bytes(~750x smaller)"]
    V --> SP --> CB --> C
          
```

Product quantization. Each vector is chopped into sub-vectors; each sub-vector is replaced by the id of its nearest codebook centroid. Storage collapses from kilobytes to a handful of bytes, so billions of vectors fit in memory — at the cost of representing each chunk approximately. PQ is what makes web-scale vector search affordable.

## The knob you're really turning: recall vs latency

Every one of these techniques exposes the same fundamental dial under a different name. Recall here means "of the true top-k nearest neighbors, what fraction did we actually return?" — and you trade it against speed and memory:

- **HNSW:** raise `efSearch` → explore more candidates → higher recall, higher latency.

- **IVF:** raise `nprobe` → scan more cells → higher recall, higher latency.

- **PQ:** more sub-vectors / bigger codebooks → less compression loss → higher recall, more memory.

So the right way to choose an index is to fix a recall target your application actually needs (95%? 99%?), then find the configuration that hits it at the lowest latency and memory for your data size. That's why benchmarks quote recall@10 against queries-per-second, not a single "fast/slow" number — the curve is the answer, and the operating point is a product decision.

**The benchmark that lies to you is the one without filtering.** Real queries almost always combine vector similarity with metadata predicates — "similar documents *where tenant = X and date > Y*." Filtering wrecks the clean ANN story: pre-filtering can leave a cluster or graph region with too few candidates to return k results, and post-filtering can throw away most of what the index found, forcing it to over-fetch. How an engine handles filtered search (and whether its published recall numbers were measured *with* filters) is the single most important thing to test on your own data, and it's exactly what vendor benchmarks quietly omit.

## What to carry away

Vector search is fast because it refuses to be exact. **HNSW** turns search into greedy navigation down a layered proximity graph — high recall, low latency, hungry for memory. **IVF** clusters the space and searches only the nearest cells — fast and tunable, but blind across cell boundaries. **Product quantization** compresses vectors 10–50× by replacing chunks with codebook ids, so billions fit in RAM at the cost of precision, and it pairs naturally with IVF. All three expose the same lever: spend more candidates, cells, or bits to buy recall, and pay in latency or memory.

Pick the index by setting a recall target and minimizing cost to reach it on data shaped like yours — and test it *with* your real metadata filters, because that's where the textbook numbers fall apart. If you want the product-level comparison of the engines that implement these algorithms, the [vector databases comparison](vector-databases) is the companion piece, and [RAG from the ground up](rag-fundamentals) covers where this retrieval sits in a larger system.
