# Probabilistic Data Structures at Scale: Bloom Filters, HyperLogLog, Count-Min Sketch

A junior engineer once asked me why we'd deliberately ship code that gives wrong answers. We were sizing a Bloom filter to skip disk reads for keys that definitely weren't in a table, and he'd just noticed the false-positive-rate parameter in the config. The honest answer: we weren't choosing wrong over right, we were choosing *bounded, cheap wrong* over *exact, expensive right* — and at the volume that system ran at, exact was never actually on the table. That trade is the whole reason probabilistic data structures exist, and once you see it, you start noticing it everywhere: in your database's storage engine, in Redis, in every analytics warehouse you've ever queried.

Three structures do almost all of the real work in production systems, and they answer three different questions: is this element in the set (Bloom filter), how many distinct elements are there (HyperLogLog), and how often does this element appear (Count-Min Sketch). Same underlying trade in all three — sacrifice a small, mathematically bounded amount of accuracy for space that doesn't grow with the size of what you're tracking.

## What is a Bloom filter, and why does it never produce a false negative?

A **Bloom filter** is a space-efficient structure for testing whether an element is *possibly* in a set — it can say "definitely not in the set" with certainty, or "probably in the set" with some chance of being wrong. It's a bit array of size `m`, all zeros to start, plus `k` independent hash functions. Inserting an element hashes it `k` times and sets the corresponding `k` bits to 1. Checking membership hashes the query the same `k` ways and checks whether all `k` bits are set — if any single bit is 0, the element is guaranteed never to have been inserted, because insertion always sets every one of its bits. That's the asymmetry that makes Bloom filters safe to use as a pre-filter: false positives happen (bits set by other elements can coincidentally line up), but false negatives are mathematically impossible.

The false-positive rate is a closed-form function of three knobs you control: `p ≈ (1 − e^(−kn/m))^k`, where `n` is the number of elements inserted and `m` is the bit array size. For a given `m` and `n`, the optimal number of hash functions is `k = ln(2) × (m/n)`, and at that optimum roughly 9.6 bits per element gets you a 1% false-positive rate, while 14.4 bits per element gets you 0.1% — the relationship is logarithmic, so chasing very low false-positive rates gets expensive fast, and most production filters settle for 1% because the marginal cost of 0.1% rarely pays for itself. In practice you pick a fast, well-distributed non-cryptographic hash (MurmurHash3 is the standard choice) and derive the `k` hash values cheaply from two base hashes rather than running `k` genuinely independent hash functions, which is a real implementation detail worth knowing before you write one from scratch.

| Target false-positive rate | Bits per element |
| --- | --- |
| 10% | ~4.8 |
| 1% | ~9.6 |
| 0.1% | ~14.4 |

Where does this actually show up? LSM-tree storage engines — Cassandra, HBase, RocksDB — keep a Bloom filter per SSTable file specifically to answer "could this key possibly be in this file" without touching disk; a negative answer skips the read entirely, and since a read for a key that doesn't exist would otherwise have to check every SSTable on the read path, this is a huge win in exactly the workload where LSM trees already carry a read-amplification cost (see [B-trees vs LSM-trees](btree-vs-lsm-storage-engines) for that trade-off in full). CDNs and browsers use Bloom filters for malicious-URL and safe-browsing lists — millions of URLs, checked on every navigation, in a structure small enough to ship to a client. Cache existence checks ("don't bother hitting the cache for a key we know isn't there") are the same pattern one layer up the stack.

```mermaid
graph LR
    subgraph INSERT["Insert element x"]
        H1["hash1(x)"] --> B1["set bit"]
        H2["hash2(x)"] --> B2["set bit"]
        H3["hash3(x)"] --> B3["set bit"]
    end
    subgraph QUERY["Query element y"]
        Q1["hash1(y)"] --> C1{"bit set?"}
        Q2["hash2(y)"] --> C2{"bit set?"}
        Q3["hash3(y)"] --> C3{"bit set?"}
        C1 --> ALL{"all bits set?"}
        C2 --> ALL
        C3 --> ALL
        ALL -->|"no"| NO["definitely not in set"]
        ALL -->|"yes"| MAYBE["probably in set(bounded false-positive rate)"]
    end
          
```

The asymmetry that makes a Bloom filter safe as a pre-filter. Any unset bit proves the element was never inserted — that answer is exact. Only the "all bits set" branch is probabilistic, and the false-positive rate is a known, tunable function of the bit-array size and hash count, not a mystery.

## How does HyperLogLog count billions of distinct values in 12 kilobytes?

**HyperLogLog (HLL)** estimates the number of distinct elements in a stream — cardinality — using memory that doesn't grow with the cardinality being counted, which is the property that makes it interesting: exact `COUNT(DISTINCT)` needs memory proportional to the number of distinct values (you have to remember what you've already seen), while HLL needs a fixed handful of kilobytes whether you're counting thousands or billions of unique visitors.

The mechanism: hash every incoming element to a uniformly distributed binary string, then route it into one of `2^p` buckets using the first `p` bits of the hash. Within each bucket, track only the maximum number of leading zeros seen among the hashes routed there. The intuition is probabilistic — seeing a hash with a long run of leading zeros is rare, and the longer the longest run you've observed, the more distinct hashes you must have thrown at that bucket to find it. Averaging that "longest run" signal across all `2^p` buckets gives an estimate of cardinality, but averaging it with a simple arithmetic mean is thrown off badly by any one bucket that got an unusually long run by chance — which is why HLL uses a **harmonic mean** instead, since it dampens the influence of outliers far more than an arithmetic mean does, followed by bias correction for known small-sample skew.

Redis's `PFADD`/`PFCOUNT`/`PFMERGE` implementation is the reference point most engineers encounter first: default precision 14 (2^14 = 16,384 buckets), capped at roughly 12KB regardless of how many elements you've added, with a standard error around 0.81%. That's the trade in one sentence — a fixed 12KB structure, ~2% error, independent of whether you're counting ten thousand or ten billion distinct users. The same idea powers `APPROX_COUNT_DISTINCT` in Snowflake and BigQuery, which exist specifically because an exact `COUNT(DISTINCT)` over a multi-billion-row fact table is often the single most expensive operation in a dashboard query, and most dashboards don't actually need the fourth significant digit.

## What does Count-Min Sketch get you that HyperLogLog doesn't?

HyperLogLog answers "how many distinct things," **Count-Min Sketch (CMS)** answers "how many times did this specific thing occur" — frequency estimation rather than cardinality estimation, and it's the structure behind "find the heavy hitters in this stream" (top URLs by traffic, top products by clicks, trending hashtags) without keeping an exact counter per item.

Structurally it's a 2D array — `d` rows, each of width `w`, each row using an independent hash function. Incrementing an item's count hashes it into one cell per row and increments all `d` cells. Estimating an item's frequency hashes it the same `d` ways and takes the *minimum* of the `d` cell values — not the average, the minimum, because any individual cell can be inflated by hash collisions with other items, and the true count can never be lower than what actually landed in the least-collided row. This is why CMS has a distinctive, useful error property: it only ever **overestimates**, never underestimates. The estimated count `f̂` satisfies `f ≤ f̂ ≤ f + εN` with high probability, where `ε` is set by the sketch width and `N` is the total stream size — a one-sided error bound, which matters for heavy-hitter detection because you'll never miss a genuinely frequent item due to undercounting, you might occasionally flag a borderline one.

**The one-sided-error property is the reason CMS is usually paired with a second, exact structure rather than used alone for a top-K list.** A common production pattern is CMS plus the Space-Saving or Misra-Gries algorithm: CMS gives a cheap, fast frequency estimate for candidate filtering, and the paired algorithm maintains an exact, bounded-size list of the actual top-K items. Using CMS alone to report "the top 10 by count" can surface a false heavy hitter that got unlucky with hash collisions — know that failure mode before you ship a "trending now" feature on CMS output alone.

## Which one do you actually reach for?

All three are variations on the same trade — bounded error for space independent of scale — but they answer different questions, so picking the wrong one wastes the trade for nothing.

| Structure | Question answered | Space | Error direction |
| --- | --- | --- | --- |
| **Bloom filter** | Is this element in the set? | Bits per element (logarithmic in target FP rate) | False positives only, never false negatives |
| **HyperLogLog** | How many distinct elements? | Fixed (~12KB at Redis defaults), independent of cardinality | Symmetric, bounded standard error (~2%) |
| **Count-Min Sketch** | How many times has this element occurred? | d × w cells, independent of stream length | Overestimates only, never undercounts |

The real-world tell for "you need one of these" is almost always the same shape: someone asks for an exact answer over a dataset that's grown past the point where an exact answer is cheap, and the business need doesn't actually require the fourth decimal place. Approximate aggregation is exactly this pattern applied inside an OLAP engine — [Druid and Pinot](druid-pinot-realtime-olap) both lean on HLL-style sketches internally for fast distinct-count queries over real-time data precisely because exact distinct counts at that ingestion rate would blow the latency budget the whole system is built around.

**The trap isn't picking the wrong structure — it's forgetting that "approximate" has to be a decision someone signed off on, not a default nobody noticed.** I've seen a finance-adjacent metric quietly served from an HLL-backed approximate count because it lived next to genuinely approximate analytics dashboards, and nobody flagged that this particular number needed to reconcile exactly to a downstream billing system. Before reaching for any of these three, confirm the consumer of the number actually tolerates a bounded error — and if the answer is "no, this feeds billing/compliance/an SLA," these structures are the wrong tool regardless of how well they'd perform.

## What to carry away

All three structures make the same bet: give up an exact, guaranteed-correct answer in exchange for space that stays flat as the data grows, and get a mathematically bounded error in return rather than an unpredictable one. Bloom filters trade space for one-sided membership certainty (no false negatives, ever) and are the right call whenever you need to cheaply rule out "definitely not here" before paying for an expensive lookup. HyperLogLog trades space for cardinality accuracy (~2% standard error at ~12KB, flat regardless of scale) and is the right call for distinct counts over data too large to hold in memory exactly. Count-Min Sketch trades space for frequency accuracy with a one-sided overestimate bound, and pairs naturally with an exact top-K structure for heavy-hitter detection.

None of the three are a substitute for exact computation when exactness is the actual requirement — the discipline is knowing, explicitly, which of your numbers can tolerate a bounded error and which can't, and never letting that decision get made by default just because a fast approximate function happened to be sitting right there in the query engine.
