# Designing a Deduplication System at Scale: Idempotency Keys, Content Hashing, and Bloom Filters

"Just make the consumer idempotent" is correct advice and an incomplete answer — it tells you the goal, not how to check whether you've already seen a given event when you're doing it trillions of times a day and can't afford to remember every one forever. [At-least-once delivery](kafka-production-pipeline-patterns) guarantees duplicates will arrive; the actual system-design problem is building something that can answer "have I seen this before" cheaply, at volume, without either missing real duplicates or accumulating unbounded memory to guarantee it never does.

This is that system: the two ways to define what counts as a duplicate in the first place, why a Bloom filter is the standard structure behind a dedup check at scale, the dedup-window and TTL trade-off that determines whether the system actually catches what it's supposed to, and where in a pipeline dedup logic should live.

## Why is "have I seen this before" a genuinely hard question at scale?

The naive answer — keep a set of every event ID you've ever processed, check membership before processing a new one — works fine at low volume and breaks down exactly when it matters most: a system processing trillions of events needs a structure that doesn't grow linearly with total volume forever, or the membership check itself becomes the bottleneck and the memory footprint becomes unbounded. The real design problem splits into two genuinely separate questions that get conflated too often: **what defines a duplicate** (identity, not just timing), and **how do you check membership cheaply** at the volume the system actually sees.

## What actually defines a duplicate — an idempotency key or content hashing?

An **idempotency key** is a unique identifier attached to a request or event by whoever produced it — typically containing a random component and a time component — with the guarantee that reprocessing the same key always produces the same result. This is the right approach when the producer is cooperative and can generate a stable key per logical operation: a payment API where the client attaches an idempotency key to a charge request, so a network retry that resends the identical request is recognized as "the same operation" rather than a second charge. The dedup check becomes a lookup on that key, and correctness depends entirely on the producer generating keys consistently — the same logical retry must carry the same key every time, or the whole mechanism is defeated at the source.

**Content hashing** is the fallback when there's no reliable producer-supplied key — hash the event's actual content (or a normalized subset of fields that define its identity) and use that hash as the dedup key instead. This is what you reach for with uncooperative or unreliable producers, or when the "same event" question is really "are these two payloads identical" rather than "did the same logical operation happen twice." The trade-off: content hashing is more expensive per event (you have to hash the payload, and normalize it consistently first) and it can't distinguish two genuinely different events that happen to produce identical content — which matters if that distinction is meaningful for your domain.

|  | Idempotency key | Content hashing |
| --- | --- | --- |
| **Requires** | A cooperative producer generating stable keys | Nothing from the producer — computed on receipt |
| **Cost per event** | Low — direct key lookup | Higher — hashing + normalization |
| **Fails when** | The producer generates a new key for what should be the same retry | Two genuinely distinct events produce identical content |
| **Best for** | APIs you control the client SDK for (payments, order submission) | Ingested data from external or unreliable sources |

## Why does a Bloom filter show up in almost every dedup system at scale?

Once you have a key — idempotency key or content hash — the remaining problem is the membership check itself: has this exact key been seen before, checked against a set that could hold billions of entries. An exact set (a hash set, a database unique-key lookup) works but costs real memory or a real round-trip per check at that volume. A [Bloom filter](probabilistic-data-structures) is the standard structure that trades a small, bounded false-positive rate for checking membership in a fixed, small amount of memory regardless of how many keys have been inserted — which is exactly the shape a high-volume dedup check wants: cheap, fast, and its error is one-sided in the safe direction, because a Bloom filter *never* produces a false negative. A "definitely new" answer from the filter is always trustworthy; a "possibly seen before" answer needs a secondary, more expensive check (an actual lookup against a real store) only for the minority of cases the filter flags as candidates.

```mermaid
graph TD
    EVT["Incoming event"]
    KEY["Compute key(idempotency key or content hash)"]
    BF{"Bloom filter:possibly seen before?"}
    NEW["Definitely new —process it"]
    CHECK["Secondary exact check(real store lookup)"]
    DUP["Confirmed duplicate —skip"]
    EVT --> KEY --> BF
    BF -->|"no"| NEW
    BF -->|"yes"| CHECK
    CHECK -->|"confirmed"| DUP
    CHECK -->|"false positive"| NEW
          
```

The Bloom filter does the cheap, high-volume triage — "definitely new" is trusted immediately, and only the smaller set of "possibly seen before" candidates pay for an exact secondary check. This is what keeps the common case (the vast majority of events genuinely are new) fast, while still guaranteeing no true duplicate slips through as a false negative.

## What is a dedup window, and why does its size decide whether the system actually works?

No dedup system remembers every key forever — that's the unbounded-memory problem this whole design exists to avoid. A **dedup window** is the bounded time range (or bounded key count) the system actually checks against, with keys aging out via a TTL once they're old enough that a legitimate duplicate arriving that late is considered out of scope. The window size is a direct trade-off against two failure modes on opposite sides: too short, and a duplicate that arrives after a slow retry, a delayed replay, or a downstream system's own backlog gets treated as new and processed twice — the exact failure the system was built to prevent. Too long, and the filter (or the exact backing store) grows unnecessarily large for keys that will never actually see a duplicate arrive that late, wasting memory and, for a Bloom filter specifically, degrading its false-positive rate as it fills past its designed capacity.

The right window size is a function of your actual retry and replay behavior, not a round number picked without data: a system whose worst-case replay delay is measured in seconds needs a much shorter window than one where a downstream consumer might legitimately reprocess a day-old backlog after an outage. Size the window to your observed worst-case delay, with margin, and revisit it when upstream retry behavior changes — a producer's retry policy getting more aggressive, or a new downstream consumer with a slower processing SLA, can silently invalidate a window that was sized correctly when it was first set.

```python
# A dedup check against a TTL'd key store — Redis's SET NX pattern
# is the standard building block: atomic "set if not exists"
def is_duplicate(redis_client, key, window_seconds):
    # returns True if the key was newly set (i.e., genuinely new),
    # False if it already existed (i.e., a duplicate)
    was_new = redis_client.set(key, "1", nx=True, ex=window_seconds)
    return not was_new
```

**The dedup-window bug that's hardest to catch in testing is the one where the window is sized correctly for normal operation and silently wrong the one time it matters — during an incident.** A downstream outage that causes a backlog to replay after your dedup window's TTL has already expired means every event in that backlog looks "new" again, and you get exactly the duplicate processing the system exists to prevent, precisely during the recovery from an incident, which is the worst possible time for it to fail. Size the window against your worst plausible replay delay, not your typical one, and treat any incident that involves a backlog replay as a signal to re-examine whether the window held — don't assume it did just because it usually does.

## Where in the pipeline should dedup logic actually live?

As early as possible, and specifically before the expensive or side-effecting work happens — deduplicating after a payment has already been charged or a write has already landed defeats most of the purpose. The common pattern: a lightweight Bloom-filter check at ingestion (cheap, catches the overwhelming majority of duplicates before they consume any real processing), backed by an exact check (database unique constraint, Redis `SET NX`) immediately before the actual side-effecting operation, as the final guarantee for the minority of cases the filter flagged as candidates or couldn't rule out. Relying on the Bloom filter alone is a mistake specific to forgetting its false-positive property applies in one direction only — it never lets a true duplicate through as "new," but it also never gives you a hard guarantee against a false positive being treated as new when it shouldn't have been skip-checked; the exact check is what actually enforces the guarantee, and the filter is what makes checking cheap enough to run on every event in the first place.

## What to carry away

Deduplication at scale splits into two separable design decisions: what defines a duplicate (a producer-supplied idempotency key when the producer is cooperative, content hashing when it isn't), and how you check membership cheaply against a huge and growing set of previously-seen keys (a Bloom filter for the fast, high-volume triage, backed by an exact check for the smaller set of candidates it flags). Neither decision is optional at real volume — an exact-only check doesn't scale, and a Bloom-filter-only check has no way to enforce a hard guarantee on its own.

The dedup window's size is the parameter that actually determines whether the system works in practice, not just in the common case — size it against your worst plausible replay delay, not your average one, because the failure mode (a window too short) shows up exactly when a backlog replays after an incident, which is the single worst time for a "solved" duplicate problem to reappear.
