# Skip Lists: The Structure Behind Redis Sorted Sets and LSM Memtables

Balanced binary search trees are correct and famously fiddly — a red-black tree's insertion logic branches into a half-dozen rebalancing cases, and getting every one of them right, especially under concurrent access, is the kind of code that ends up with a comment warning "do not touch this without really understanding it." Skip lists solve the same problem — keep a sorted structure searchable in O(log n) — with an approach so different in spirit it barely feels like the same category of algorithm: instead of a deterministic rule that keeps the tree balanced no matter what, a skip list just flips coins, and gets the same expected performance *on average*, with dramatically simpler code. That trade — probabilistic balance for implementation simplicity — is why two very different systems, Redis and RocksDB, both reach for a skip list rather than a tree.

## What is a skip list, mechanically?

A **skip list** is a linked list with extra "express lane" levels layered on top. The bottom level contains every element, in sorted order, exactly like an ordinary linked list — which alone would only give O(n) search. Each level above the bottom skips over a random subset of the elements below it, so searching starts at the topmost, sparsest level and drops down a level each time the next node would overshoot the target — the higher levels let a search skip large chunks of the list in a few hops, only dropping to slower, denser levels to narrow in on the final answer, the same intuition an express train plus local stops gives you over a single, uniform local train.

The randomization is the entire trick: when a new element is inserted, its height (how many levels it participates in) is decided by a coin flip — participate in the next level up with some fixed probability, and if it does, flip again for the level after that, and so on. There's no rebalancing step, no rotation logic, no cases to enumerate — the expected O(log n) search/insert/delete performance falls out of probability theory rather than a deterministic invariant the code has to actively maintain. Redis specifically uses a random level between 1 and 32 with roughly 1/2 probability per additional level (some analyses suggest 1/e is closer to theoretically optimal, but the simplicity of a clean fractional probability like 1/2 or 1/4 is a real, deliberate engineering trade against a marginal theoretical gain).

```mermaid
graph LR
    subgraph L3["Level 3 (express)"]
        A3["1"] --> D3["25"]
    end
    subgraph L2["Level 2"]
        A2["1"] --> C2["10"] --> D2["25"]
    end
    subgraph L1["Level 1 (base, every element)"]
        A1["1"] --> B1["4"] --> C1["10"] --> E1["17"] --> D1["25"] --> F1["30"]
    end
    L3 -.->|"drop down when overshooting"| L2
    L2 -.->|"drop down when overshooting"| L1
          
```

Searching for 17: start at the top express lane (1 → 25, overshoots 17), drop to level 2 (1 → 10 → 25, still overshoots after 10), drop to level 1 and walk 10 → 17. Each level up is a randomly-thinned subset of the level below it — no rebalancing operation maintains that structure, it emerges from the coin flip made at each element's insertion time.

## Why does probabilistic balance matter more under concurrency than it looks like it should?

A balanced tree's rebalancing operations — rotations after an insert or delete — touch a genuinely unpredictable set of nodes, potentially far from the node that was actually modified, which makes fine-grained locking (locking only the nodes an operation actually needs) hard to reason about correctly: a rotation can cascade changes upward through the tree in ways that are difficult to isolate safely from a concurrent reader elsewhere in the structure. A skip list's insert touches only the new node and the predecessor pointers at each level it participates in — a strictly local, predictable set of modifications, with no cascading restructuring anywhere else in the list. That locality is exactly what makes **lock-free and highly-concurrent skip list implementations** practical in ways a comparably concurrent balanced tree implementation genuinely is not — it's a big part of why skip lists show up disproportionately often in high-concurrency systems code rather than balanced trees, despite both nominally offering the same asymptotic complexity on paper.

## Why does Redis pair a skip list with a hash table for its sorted set?

[Redis's](redis-internals) sorted set (`ZSET`) type needs two genuinely different capabilities at once: fast **rank and range queries** by score (give me the top 10 by score, or everyone between score 50 and 100 — inherently an ordered-traversal operation) and fast **membership and score lookup by member** (what's this specific member's score — inherently a point-lookup operation with no ordering requirement). A skip list alone gives you the first efficiently but not the second (finding an arbitrary member by value in a skip list ordered by score means a linear scan, since the list's ordering is by score, not by member identity). A plain hash table alone gives you the second but has no notion of order at all. Redis's actual implementation runs both structures over the same underlying data — a hash table mapping member → score for O(1)-ish point lookups, and a skip list ordered by score for O(log n) rank and range operations — which is the concrete answer to "why not just pick one," and it's a clean illustration of a broader pattern: reaching for two purpose-built structures over the same data is often simpler and faster than forcing one structure to serve two genuinely different access patterns adequately.

```text
ZADD leaderboard 1500 "alice"
ZADD leaderboard 2200 "bob"
ZRANK leaderboard "alice"        # rank query -> skip list traversal
ZSCORE leaderboard "alice"       # point lookup -> hash table, O(1)-ish
ZRANGEBYSCORE leaderboard 1000 2000   # range query -> skip list traversal
```

## Why do RocksDB and LevelDB use a skip list as the default memtable?

The **memtable** in an [LSM tree](btree-vs-lsm-storage-engines) is the in-memory staging area that absorbs writes before they're flushed to an immutable, sorted SSTable on disk — and it needs three things simultaneously: fast ordered insertion (writes arrive continuously and need to land in roughly the right sorted position), efficient in-order iteration (flushing to disk requires walking the memtable in sorted order to produce a sorted SSTable), and reasonable behavior under concurrent access (multiple threads may be writing to the active memtable at once). A skip list happens to be a close-to-ideal fit for exactly this combination: insertion is fast and doesn't require a global rebalancing pass, in-order iteration is a straightforward walk of the bottom level, and the concurrency-friendliness discussed above matters directly here, because memtable writes are squarely a concurrent-access hot path in a busy LSM system. This is why RocksDB and LevelDB both ship a skip-list-based memtable as their default implementation — not because a different structure couldn't technically work, but because a skip list happens to satisfy all three requirements at once with comparatively little implementation complexity.

**The pattern worth internalizing generalizes past skip lists specifically: when a structure needs to serve two genuinely different access patterns well, the answer is often two purpose-built structures over the same data, not one clever structure trying to do both adequately.** Redis's ZSET (hash table plus skip list) and a search engine pairing a B-tree index with a separate full-text inverted index over the same underlying rows are both the same move — accept the redundancy of maintaining two structures in exchange for each one being genuinely good at its own job, rather than reaching for one structure that's a mediocre compromise at both.

## What to carry away

A skip list gets to the same expected O(log n) search, insert, and delete performance a balanced tree offers, but through randomization — a coin flip per element decides how many "express lane" levels it participates in — rather than deterministic rebalancing rules. That trade buys real simplicity (no rotation logic, no cases to enumerate) and, less obviously but arguably more importantly in production systems, much friendlier behavior under concurrent access, because a skip list's modifications stay local to the inserted node instead of cascading through the structure the way a tree rebalance can.

Redis pairs a skip list with a hash table specifically because a sorted set needs both ordered range/rank queries and O(1) point lookups by member, and no single structure serves both well — two purpose-built structures over the same data beats one compromise structure. RocksDB and LevelDB default their memtable to a skip list because it happens to satisfy fast ordered insertion, easy in-order iteration for flushing, and concurrent-write friendliness all at once, which is a specific and non-obvious combination of requirements that a skip list turns out to fit unusually well.
