Database Internals

The Probabilistic Gamble:
Why Redis Bets Your Data on a Coin Flip

Redis stores every sorted set in a data structure that flips coins to decide its own shape. And it turns out that's the smart move.

Imloul Anas · 18 min read · Apr 2026

If you ever trace how ZADD works in Redis, follow the call stack into the sorted set implementation, you expect to land on a balanced tree or some well-known variant. What you find instead is a function called zslInsert with a loop that calls random() to decide how tall to build each new node.

Redis. The key-value store sitting behind a huge chunk of the internet’s caching layer. Deciding the shape of its index with a random number generator.

My first reaction was that I’d misread something. Then that it must be isolated to some edge case. But the more I dug, the clearer it became: this was deliberate. Not a shortcut, a tradeoff. One that gives up the deterministic guarantees of a balanced tree in exchange for simpler code, better cache behavior, and concurrency that scales well under real load.

That structure is the skip list.

The Problem

What a Sorted Index Actually Needs to Do

Redis sorted sets (ZSET) are not a simple key-value store. They are a fully ordered index that must support all of these efficiently:

Insert a member with a score: ZADD leaderboard 4200 "alice"
Delete a member by name: ZREM leaderboard "alice"
Look up a member’s rank: ZRANK leaderboard "alice"
Retrieve members in a score range: ZRANGEBYSCORE leaderboard 1000 5000
Retrieve members by rank range: ZRANGE leaderboard 0 99

The combination is the hard part. A sorted array handles range scans but needs $O(n)$ to insert. A hash map gives $O(1)$ lookup but has no ordering. A binary heap inserts in $O(\log n)$ but can’t do range queries without scanning everything.

You need both: fast single-item operations ($O(\log n)$) and fast range scans ($O(\log n + k)$, where $k$ is the number of results returned). That combination is what narrows the field. The textbook answer is a balanced binary search tree: a red-black tree, an AVL tree, or a B-tree. These work, but they carry costs that matter in Redis’s context, and those costs are what the skip list was designed to avoid.

The Classical Answer

What Balanced Trees Get Right, and What They Get Wrong

A red-black tree gives you everything the sorted set needs. Insertion, deletion, and search are all $O(\log n)$. An in-order traversal (visiting nodes in sorted order by walking left subtree, then node, then right subtree) gives you range scans in $O(\log n + k)$. It’s what most standard libraries reach for when they need a sorted map.

So why doesn’t Redis use one?

The algorithmic complexity is essentially the same. The real differences are implementation complexity, memory layout, and how well the structure holds up under concurrent writes. Redis itself is single-threaded for command processing, so the concurrency angle doesn’t bite it directly. But Redis inherited the skip list from a lineage of systems that do run concurrent writers (LevelDB, RocksDB, Java’s standard library), and got a free win in the process: the properties that make skip lists easy to make concurrent also make the code simpler to write and reason about. The critique that follows applies to balanced trees in general rather than to Redis specifically, but that simplicity payoff is what Redis is taking advantage of.

RBT

Red-Black Tree

A self-balancing binary search tree that guarantees $O(\log n)$ operations by enforcing structural rules after every insert and delete. Correct, but hard to implement and harder to make concurrent. Wikipedia

Red-black tree at a glance: a binary search tree whose nodes are colored red or black so the tree stays approximately balanced (same black-node count on every path from the root, no two consecutive reds, and similar rules). Dark nodes are black, bright nodes are red. For the full invariant list and history, see the Wikipedia article linked in the definition above.

When you insert into a red-black tree, the tree fixes itself with rotations (local swaps that change which node sits above which) and recoloring (flipping nodes between red and black to restore the balance rules). These fixups can touch nodes far from the insertion point, sometimes cascading all the way to the root. To insert safely when multiple threads are writing at once, you either take a single coarse-grained lock (one lock covering the whole tree, which forces writes to happen one at a time) or implement a protocol where threads hand off locks as they descend the tree, called lock coupling. Most implementations take the easy path and use the coarse lock.

Where the concurrency cost actually lands

Because Redis processes commands one at a time on a single thread, no two commands ever touch a sorted set simultaneously. The concurrency argument against balanced trees doesn’t apply to Redis directly. But systems like LevelDB, RocksDB, and many in-memory databases need real concurrent writes, and they’ve all found balanced trees painful to make lock-free (meaning threads coordinate without blocking each other). The skip list has a natural concurrent variant that only locks a small, bounded set of nodes per operation. That property is what made it the default choice across this whole family of systems, and it’s why the idea migrated to Redis.

There is a second problem, and this one does apply to Redis: memory access patterns. A red-black tree node holds left, right, and parent pointers, a color bit, and the key, roughly 32 bytes of overhead on a 64-bit system. Worse, each node is allocated separately and ends up scattered across memory. Scattered nodes are slow to visit, for reasons worth pinning down precisely.

Cache Line

The CPU doesn’t fetch memory one byte at a time. It grabs a 64-byte chunk called a cache line and keeps recently-used chunks in a small, fast on-chip memory called the CPU cache. Accessing data already in cache is roughly 100x faster than fetching it from main memory. When two pieces of data you need sit on different cache lines, you pay the full fetch cost for each one.

In a red-black tree, each pointer jump probably lands on a different cache line, forcing the CPU to wait for a fresh memory fetch. A traversal that visits 20 nodes likely incurs 20 of those waits. Skip list nodes are heap-allocated too, so per-node locality is roughly a wash between the two. The win shows up at the access-pattern level instead, for reasons we’ll get to.

The Invention

William Pugh’s Insight: Approximate Balance Is Enough

The skip list comes from a 1990 paper by William Pugh, “Skip Lists: A Probabilistic Alternative to Balanced Trees.” The core idea: you don’t need to guarantee balance. You just need to make imbalance vanishingly unlikely.

Start with a sorted linked list. Search is $O(n)$, you walk forward until you find your key. Too slow. But add a second, sparser layer on top, a “fast lane” that skips over roughly half the nodes. Scan the fast lane until you overshoot, drop down, scan the last few nodes. Two layers cut search time roughly in half.

Now add a third layer that skips half the second. And a fourth. If each layer contains roughly half the nodes of the layer below, you get something that behaves like binary search: each level roughly halves the search space, giving $O(\log n)$ expected time.

More generally, if each node is promoted to the next layer with probability $p$, each level cuts the search space by a factor of $1/p$. With $p = 0.5$ that’s a halving per level. With $p = 0.25$ each level cuts the space by 4, which means fewer levels overall but slightly more comparisons per level.

Skip List

A probabilistic data structure built as a hierarchy of sorted linked lists. The bottom layer (level 0) contains every element. Each higher layer contains a random subset of the layer below, where each element is independently promoted with probability p (typically 0.25 or 0.5). Searches descend through the layers, using higher layers to skip large portions and lower layers to refine the position.

The word “random” is doing critical work here. A balanced tree forces balance through explicit rotations after every insert or delete. A skip list achieves approximate balance by having each node decide its own height with a biased coin flip (a random draw where the probability of “heads” is p, not necessarily 50/50). No rebalancing is ever needed. The randomness is enough, almost all of the time, to keep the heights distributed the way a balanced tree would.

What 'with high probability' actually means

For a skip list with $n$ elements, the chance of ending up badly unbalanced shrinks so fast as $n$ grows that it stops mattering in practice. With one million elements, the odds of hitting pathological height on any given operation are less than one in a million. Skip lists handle billions of operations in production without ever tripping a worst case.

The Anatomy

Inside a Skip List Node

Instead of left and right child pointers, each skip list node contains a tower of forward pointers, one per level it participates in. A node at level 3 has three forward pointers: one for each level, each pointing to the next node at that level.

The interactive skip list below has every element on the bottom layer (level 0), with each higher layer a probabilistic subset of the one beneath it. Searching for 42: start at the top level of the head node. At level 3, the next node is 42, done. Searching for 55: advance to 42 at level 3, next is NULL, drop to level 2, next is 61 (overshoots), drop to level 1, 61 again, drop to level 0, find 55. The higher levels act as an express lane, skipping large chunks of the list in a single pointer jump.

Try it yourself. Search 55 to retrace that exact descent across the express lanes, or insert a value to watch the coin flips decide its tower height before it’s spliced in. No rebalancing ever runs.

Interactive skip list. Insert a value to watch the biased coin flips choose its height, then search to trace the level-by-level descent. Drag horizontally on smaller screens.

Forward Pointer Array

Each skip list node stores an array of next-pointers, one per level. A node at height h has h forward pointers. Most nodes never get promoted past level 0; with $p = 0.25$, only a quarter of the rest survive each coin flip to reach the next level. The expected height works out to $1/(1-p)$, so the average node carries about 1.33 forward pointers, only slightly larger than a regular linked list node.

The Operations

Search, Insert, and Delete: How the Coin Flip Fits In

Now that you know what a node looks like, here’s how the three core operations use that tower structure. All three share the same descent pattern, and keeping it in mind will make the code below easy to read: start at the top level, walk forward until you overshoot, drop down a level, repeat. The only thing that changes between operations is what you do once you get there.

The examples below are written in Go. If you don’t know the language, don’t worry about the syntax: the comments and the overall structure carry the idea, and that’s all you need to follow along.

Search

Start at the top level of the head node. At each level, advance forward while the next node’s key is less than the target. When advancing would overshoot, drop one level. At level 0, the next node is either your target or it doesn’t exist. Expected comparisons: $O(\log n)$, each level cuts the search space by a factor of $1/p$.

Notice the outer loop descending levels and the inner loop walking forward. That’s the two-dimensional search pattern the whole structure is built around.

Search
func search(list *SkipList, target int) *Node {
  node := list.head
  for level := list.maxLevel; level >= 0; level-- {
    for node.next[level] != nil && node.next[level].key < target {
      node = node.next[level] // advance along this level
    }
    // overshoots: drop down
  }
  node = node.next[0]
  if node != nil && node.key == target {
    return node
  }
  return nil // not found
}

Insert

Insert starts with a search, but along the way it records the last node visited at each level in an update array: a list of predecessors, one per level, whose forward pointers will need to be rewired to include the new node.

Then the coin flip. The new node’s height is chosen by repeatedly flipping a biased coin (probability p): each heads promotes it one level higher, each tails stops. Expected height works out to $1/(1-p)$, about 1.33 levels on average with the probability Redis uses. Most nodes stay short. A few grow tall enough to serve as express-lane stops on the upper levels.

The new node is then spliced in at every level up to its height, using the update array to rewire the forward pointers. No rotations. No recoloring. No cascade. The insert touches only the nodes in the update array.

Insert
// coinFlip picks a node's height: keep flipping while we're under p, capped at maxLevel.
func coinFlip(p float64) int {
  height := 0
  for random() < p && height < maxLevel { 
    height++
  }
  return height
}

func insert(list *SkipList, key int, value string) {
  update := make([]*Node, list.maxLevel+1) // rightmost node visited at each level
  node := list.head
  for level := list.maxLevel; level >= 0; level-- {
    for node.next[level] != nil && node.next[level].key < key {
      node = node.next[level]
    }
    update[level] = node // record the predecessor at this level
  }

  height := coinFlip(list.p) // coin flip determines how many levels to occupy
  newNode := &Node{key: key, value: value, next: make([]*Node, height+1)}

  for level := 0; level <= height; level++ {
    newNode.next[level] = update[level].next[level] // splice in
    update[level].next[level] = newNode
  }
}

Why the update array matters

The update array has at most $O(\log n)$ entries, one per level. The number of nodes modified by any insert is logarithmic and known in advance. This bounded footprint is what makes skip lists easy to lock correctly under concurrency: lock exactly the update array nodes, nothing else.

Delete

Delete mirrors insert. Search for the target, populate the update array, then bypass the target’s forward pointers at every level. The node becomes unreachable and can be freed.

No rebalancing needed. The remaining nodes’ random heights still produce a probabilistically balanced structure.

Delete
func delete(list *SkipList, key int) bool {
  update := make([]*Node, list.maxLevel+1)
  node := list.head
  for level := list.maxLevel; level >= 0; level-- {
    for node.next[level] != nil && node.next[level].key < key {
      node = node.next[level]
    }
    update[level] = node // record the predecessor at each level
  }

  target := node.next[0]
  if target == nil || target.key != key {
    return false // key does not exist
  }

  for level := 0; level < len(target.next); level++ {
    update[level].next[level] = target.next[level] // bypass the target node
  }
  return true // target is now unreachable and can be freed
}

The Complexity

The Numbers, Side by Side

With the structure understood, here’s how it compares. Two columns mention augmentation, meaning extra metadata stored on each node (like a subtree-size counter on a tree, or the span values Redis attaches to skip list pointers, covered later in this article) to unlock queries the plain structure can’t answer efficiently.

Operation	Sorted Array	Red-Black Tree	Skip List
Search	$O(\log n)$	$O(\log n)$	$O(\log n)$ expected
Insert	$O(n)$	$O(\log n)$	$O(\log n)$ expected
Delete	$O(n)$	$O(\log n)$	$O(\log n)$ expected
Range scan	$O(\log n + k)$	$O(\log n + k)$	$O(\log n + k)$ expected
Rank lookup	$O(\log n)$	$O(\log n)$ with augmentation	$O(\log n)$ with span augmentation
Concurrent insert	Full lock required	Complex lock coupling	Lock only update array nodes
Implementation complexity	Low	High	Low

The “expected” qualifier is the real difference. A red-black tree guarantees $O(\log n)$ worst case. A skip list guarantees it in expectation: the probability of exceeding it shrinks exponentially as $n$ grows, but it’s not zero. For most production workloads, this is irrelevant. For hard real-time systems with strict latency bounds, it matters.

Concurrency

Why Skip Lists Scale Under Concurrent Writes

When multiple threads try to modify a data structure at once, you need rules to stop them from corrupting each other’s work. Those rules (locks, atomic operations, ordering guarantees) are the hard part of building concurrent systems, and skip lists make them easier than trees do. Redis never exercises this directly, since it processes commands on a single thread, so it isn’t a Redis win in its own right. It’s the reason the structure took hold in systems that do run concurrent writers (LevelDB, RocksDB, Java’s standard library), and that lineage is what carried the skip list to Redis in the first place.

Remember the update array: the only nodes whose pointers change during an insert. In a concurrent setting, you lock exactly those nodes, splice in the new node, and release. Two inserts into different parts of the list touch disjoint update arrays and proceed in parallel without contention.

This is localized modification. The structural change lives in a small, known set of nodes instead of rippling outward. Compare it to a red-black tree, where a rotation at a leaf can cascade upward through ancestors that are shared by every traversal in the tree.

Localized Modification

The update array has at most $O(\log n)$ entries, all identified during the search phase before any locks are acquired. Simple, predictable locking.

No Rebalancing Cascade

No operation triggers unbounded structural changes. The worst case for any insert or delete is bounded by the node’s height, which is logarithmic with high probability.

Lock-Free Variants

The bounded modification footprint enables lock-free implementations using compare-and-swap, a CPU instruction that atomically updates a pointer only if it still holds an expected value (letting threads coordinate without explicit locks). Java’s ConcurrentSkipListMap uses this approach, so throughput keeps climbing as you add threads instead of plateauing.

Memory Layout

The Cache Behavior of a Skip List

A fair criticism of skip lists: variable-height nodes with individually allocated pointer arrays are harder to lay out in memory than fixed-size tree nodes. On 10 million members, the overhead is measurable.

But this misses an access pattern advantage. Recall from earlier that anything already in CPU cache is roughly 100x faster to read than main memory. CPUs actually have several cache levels (L1, L2, and L3, in increasing size and decreasing speed), and L3 is typically tens of megabytes on modern server CPUs. So anything small enough to fit in L3 and touched often enough to stay there is effectively free to access.

The higher levels of a skip list fit that profile exactly. The top few levels are touched on virtually every operation, so they stay hot in cache. With p = 0.25, level-3 nodes appear at roughly 1.5% of positions. A 10-million-element dataset has about 150,000 level-3 nodes, a few megabytes at most, small enough to sit comfortably in L3 cache under sustained load.

The upper levels behave like a cache-resident index

Just as the upper levels of a B-tree tend to stay in the database’s page cache because they’re accessed on every query, the upper levels of a skip list tend to stay in CPU cache because they’re visited on every search. The probabilistic structure incidentally recreates the locality properties of a deliberately designed hierarchical index.

The Redis Implementation

How Redis Actually Uses It

Redis doesn’t use a skip list for every sorted set. Small ones use a listpack, a compact encoding that stores everything sequentially in contiguous memory. Faster to scan, less memory. Once a sorted set crosses a threshold (more than 128 members, or any member longer than 64 bytes, by default), Redis automatically converts it in place to the skip list structure, transparently, on the write that crosses the limit.

One detail worth flagging before the definitions below: Redis sorted sets allow two members to have the same score (two players tied at 4200 points, for example). When that happens, Redis breaks the tie by the lexicographic order (dictionary order) of the member string, so the ordering is always deterministic.

Redis zskiplist

Redis’s skip list implementation, defined in t_zset.c. Each node stores the member string, its floating-point score, and a backward pointer for reverse traversal, alongside the forward pointer array. Sorted primarily by score, secondarily by dictionary order of the member string, so two members with identical scores still have a stable, deterministic ordering.

Redis zset (the dual structure)

Above the listpack threshold, a sorted set is two structures in parallel: a skip list for ordered traversal and range queries, and a hash table mapping member strings to scores for $O(1)$ lookup. Every ZADD writes to both. Every ZREM removes from both.

The skip list answers “give me the top 100 by score.” The hash table answers “what is alice’s score?” Neither alone handles both efficiently, so Redis maintains both and pays the memory cost.

Why Redis uses p = 0.25

Most textbook examples use p = 0.5, but Redis chose 0.25. Sparser upper levels mean each node carries fewer forward pointers on average (about 1.33 vs 2), saving memory at the cost of slightly more comparisons per search. For millions of members, the memory savings win. The maximum level is capped at 64, enough for $2^{64}$ elements.

The Rank Trick

How Redis Answers ZRANK in $O(\log n)$

A plain skip list can tell you whether a key exists, but not its rank (ordinal position) without scanning the entire base layer. Redis solves this by adding a span value to every forward pointer.

Span

An integer on each forward pointer recording how many base-layer nodes it skips over. A level-2 pointer jumping from A to C, skipping B1 and B2, has a span of 3. By summing spans along the search path, Redis gets the exact rank without touching the nodes in between.

A worked example helps. Imagine a level-2 forward pointer from A to C with a span of 5, meaning C sits 5 base-layer positions to the right of A. Now insert a new node X somewhere between A and C. If X stays at level 0 (the common case), it slips under the level-2 pointer and increments its span to 6, since the same pointer now jumps over one extra node. If X gets promoted to level 2, the original A-to-C pointer splits in two: A points to X with the span up to that point, and X points to C with the remainder. The two new spans sum to 6, preserving the total.

That accounting happens during the splice step, using the same update array the structural rewiring uses, so spans cost only a constant factor on top of insert and delete. The ranks themselves fall out of the search: as you descend, sum the spans of every pointer you traverse, and the running total is the target’s rank when you arrive.

The result: ZRANK runs in $O(\log n)$ with no extra data structure and no base-layer scan. It’s a clean example of augmenting a traversal structure (attaching small pieces of derived metadata to each node) to unlock a new query type at minimal cost.

Where Skip Lists Are Used

Beyond Redis: Where This Structure Appears

Redis is the most visible user, but skip lists show up in more places than you’d expect.

LevelDB and RocksDB (embedded storage engines used inside databases like Cassandra, CockroachDB, and many others) use a skip list as their in-memory write buffer. Writes land in the skip list first, queryable immediately, then get flushed to disk in batches. The skip list supports point lookups, range scans, fast inserts, and sorted iteration for flushing, everything the write buffer needs in one structure. RocksDB uses a lock-free variant for concurrent access.

Apache Lucene (the full-text search library behind Elasticsearch and Solr) uses skip lists to speed up boolean query evaluation. When intersecting or unioning result lists from different search terms, skip pointers let the merge algorithm jump forward over irrelevant entries instead of walking every position.

Java’s ConcurrentSkipListMap is the standard library’s concurrent sorted map, a lock-free implementation that avoids the bottleneck of wrapping a regular map in a synchronized block.

A pattern worth noticing

In every case, the skip list is chosen not for better big-O behavior (it’s the same) but because it is simpler to implement correctly under concurrency. The randomness that feels unprincipled is exactly what eliminates the rebalancing logic that makes concurrent balanced trees so hard to get right.

The Trade-offs

What the Skip List Gives Up

Three real tradeoffs worth knowing.

Worst-case complexity is probabilistic, not guaranteed. A red-black tree guarantees $O(\log n)$ for every operation. A skip list guarantees it in expectation. For most workloads, this is academic. For hard real-time systems or financial matching engines with strict latency requirements, the deterministic guarantee may matter.

Memory overhead is higher, but not by as much as it first looks. The per-node comparison depends on what you count. A height-4 Redis node is around 56 bytes total (4 forward pointers, a backward pointer, a score, a member pointer) against roughly 32 bytes for a red-black tree node. But height 4 is not the average: with $p = 0.25$, the average node is only about 1.33 levels tall, so the average skip list node carries roughly 10.7 bytes of forward-pointer overhead on top of the key/value payload, competitive with a red-black tree once you account for the tree’s color bit and three pointers. Put plainly: most nodes are small, a few tall outliers are genuinely larger, and the total memory footprint is modestly higher than a tree but not dramatically so. On a sorted set with 10 million members, that works out to a handful of extra megabytes, not gigabytes.

No on-disk variant. The skip list is an in-memory structure. Variable-length nodes and pointer-heavy layout make it unsuitable for disk storage. This is why LevelDB and RocksDB use skip lists for the in-memory write buffer but B-trees for the on-disk files.

Putting It Together

What This Means for the Redis Commands You Use

With the structure in hand, here’s what actually happens when you run the commands you already know.

ZADD and ZREM are $O(\log n)$. Descend the skip list, populate the update array, splice, update spans. At a million members, roughly 20 pointer comparisons.

ZRANGE and ZRANGEBYSCORE are $O(\log n + k)$. Descend to the start position in $O(\log n)$, then walk the base layer forward for $k$ results. The base-layer walk follows next pointers one at a time, and since nodes are individually heap-allocated and scattered across memory, this is ordinary pointer chasing with the same cache behavior as any linked list.

ZRANK is $O(\log n)$ thanks to span augmentation. Without spans, this would be an $O(n)$ base-layer scan. The span values accumulated during descent give the exact rank as a byproduct.

Efficient: $O(\log n + k)$ range scan
-- Return the top 100 players with scores between 4000 and 5000
ZRANGEBYSCORE leaderboard 4000 5000 LIMIT 0 100
✓ Skip list descends to the score boundary in $O(\log n)$, then walks the base layer for $k$ results. Cost scales with the result set, not the total member count.

Avoid on large sets: $O(n)$
-- Returns every member in the set, sorted by score
-- Catastrophic on large sorted sets
ZRANGE leaderboard 0 -1 WITHSCORES
⚠ Without a score or rank bound, Redis must walk the entire base layer. Because Redis processes commands on a single thread, a long-running command like this blocks every other client until it finishes. On a set with millions of members, that can mean seconds of stalled traffic.

The listpack threshold matters for memory. Below 128 members (configurable via zset-max-listpack-entries), Redis uses compact listpack encoding instead of the skip list. If your sorted sets stay small, you never pay the skip list overhead. A sorted set per user with a few dozen entries is far more memory-efficient than one global set with millions of entries, if your access pattern allows it.

Conclusion

The Bet That Paid Off

What looks like an unprincipled shortcut, deciding structure with a coin flip, turns out to be the point. The randomness replaces the most fragile part of balanced trees: the rebalancing logic. Accept probabilistic balance instead of deterministic balance and an entire category of implementation bugs and concurrency hazards disappears, at no cost to the big-O numbers anyone actually cares about.

That’s why the same structure keeps showing up whenever someone needs a sorted index that’s simple to write, simple to reason about, and simple to make concurrent: Redis, LevelDB, RocksDB, Java’s standard library.

The next time you call ZADD or ZRANK, a coin is being flipped somewhere in the call stack. In the decades since Pugh’s paper, it has never come up tails often enough to matter.