Consistent Hashing

Why mod-N sharding melts down the day you add a node, how the hash ring, virtual nodes, and bounded loads fix it, and where Cassandra, Dynamo, Envoy, and CDNs actually rely on it.

22 min readupdated 2026-06-28

On this page

There is a specific kind of outage that only happens when things are going well. Traffic is up, the cache tier is humming, so you add a node to keep ahead of demand — the most routine capacity operation there is. And the instant that node joins, the hit rate falls through the floor, every request stampedes to the origin, and the database you were protecting with the cache falls over. You scaled up and triggered an outage in the same motion.

The culprit is almost always hash(key) % N. It is the first sharding scheme everyone writes because it is one line and it distributes evenly. It is also a trap with a delay timer on it: the assignment of every key depends on the total number of nodes, so the day N changes, nearly every key moves. Consistent hashing is the fix, and it is one of the genuinely load-bearing ideas in distributed systems — the thing that lets Cassandra rebalance, DynamoDB spread partitions, and a CDN add edge servers without cold-starting the world.

This article is the working model: the precise failure mod-N causes, how the ring and virtual nodes bound the damage, what bounded loads do that vnodes can’t, and where the idea actually shows up in production systems. It underpins Sharding & Partitioning, the token ring in Cassandra and DynamoDB, the slot map in Redis Cluster, and session affinity in Load Balancing. For the failure-tolerance backdrop, see CAP Theorem & Tradeoffs; upcoming work is on the roadmap.

The one thing to carry out of here: consistent hashing solves rebalancing cost, and only rebalancing cost. It is not a load balancer, it is not a hot-key fix, and it will route a million requests for one key to a single node by design. Knowing exactly what it does and does not buy you is the difference between using it well and being surprised by it at 3am.

A motivating failure

A commerce team runs a memcached tier in front of Postgres. Eight nodes, client-side sharding with hash(key) % 8, hit rate steady at 96%. The origin sees 4% of read traffic and is bored. For a year, nobody touches it.

Black Friday is coming, so two weeks out they do the responsible thing and add four nodes to the pool — N goes from 8 to 12. The deploy is clean. No errors. And then the dashboards light up.

Because the shard for every key is hash(key) % N, changing N from 8 to 12 remaps roughly two-thirds of all keys to a different node. From the cache’s point of view, those keys simply vanish — they’re now looked up on a node that has never seen them. Hit rate craters from 96% to the low 30s in under a minute. Every miss falls through to Postgres. The origin, sized to absorb 4% of reads, suddenly eats 60%+. Connection pool exhausts, p99 goes from 5ms to multiple seconds, and the “capacity increase” becomes a full outage during the exact week it was meant to prevent one.

Nothing was broken. memcached did its job, the new nodes were healthy, the deploy was correct. The outage lived entirely in the sharding math: a scheme where adding capacity invalidates most of the existing cache. The fix that shipped that night was a consistent-hashing client ring, after which adding the same four nodes moved about a third of keys instead of two-thirds, and the hit rate dipped by single digits instead of collapsing. The lesson the team wrote down: with mod-N, the cost of scaling is proportional to how much you already have, which is exactly backwards from what you want.

The one-sentence mental model

Map both keys and nodes onto the same circular hash space, and a key belongs to the next node you hit walking clockwise — so adding or removing a node only disturbs the keys in one arc, not all of them.

flowchart TB
  subgraph Ring[Hash ring 0..2^32-1]
    A[Node A\npos 100]
    B[Node B\npos 8000]
    C[Node C\npos 24000]
  end
  K1[cart:42\nhash 150] --> A
  K2[user:9\nhash 9000] --> C
  K3[sku:7\nhash 24050] --> A

Every clause is an operational constraint, so unpack it slowly:

Same hash space → keys and nodes are hashed into the same range (commonly 0 to 2^32 - 1), which makes them directly comparable. A node is just a point on the circle, same as a key.
Next node clockwise → ownership is a range, not a modulus. Each node owns the arc from the previous node up to its own position. That arc is the unit that moves.
Only one arc moves → a topology change is a local event. The whole reason the ring exists is to convert a global rehash into a local handoff.

Here is the math that makes it matter. With mod-N, changing the node count remaps about (N-1)/N of all keys — for N=8 that’s 87.5%, which is why the story above collapsed. With consistent hashing, adding one node to a cluster of N moves only about 1/(N+1) of keys — the slice the new node now owns, lifted off a single neighbor. Removing a node folds its arc into the next node clockwise, again touching only that node. The blast radius is one arc, independent of how big the cluster already is. That single property is the entire reason the technique exists.

How it actually works

The mod-N trap, precisely

shard = hash(key) % N distributes keys evenly while N is fixed, and that’s the seduction — in a static benchmark it looks perfect. The poison is the % N: the output for a given key is a function of the current node count. hash("user:9") % 8 and hash("user:9") % 12 are essentially uncorrelated. So when N changes, the mapping for nearly every key changes with it.

For a cache, that means a near-total miss storm and the thundering herd from the opening story. For a database, it’s worse: rebalancing means physically moving almost the entire dataset between nodes — hours of streaming I/O, a window where data is in flight, and a real risk of inconsistency while it lands. mod-N makes the cost of any membership change scale with your total data, which is the one thing you can’t afford as you grow.

flowchart TB
  Add[Add one node\nN -> N+1] --> Q{Scheme?}
  Q -->|mod-N| MN[~all keys\nremap]
  Q -->|ring| RG[one arc\nmoves]
  MN --> Bad[miss storm\norigin overload]
  RG --> Good[bounded\nhandoff]

The ring and clockwise ownership

Consistent hashing places both nodes and keys on a circular keyspace. You hash each node’s identity — hash("10.0.1.7:6379") or a logical token — to a position on the circle. To place a key, hash it and walk clockwise to the first node position you encounter. That node owns the key. Lookups are a search for the successor point, which a sorted structure (a tree or a sorted array with binary search) does in O(log V) where V is the number of points on the ring.

Now watch a node join. Add node D at position 16000, sitting between B (8000) and C (24000). Before, C owned everything in (8000, 24000]. After, the keys in (8000, 16000] belong to D and the rest still belong to C. Only that one arc changed hands. Every key outside (8000, 16000] maps exactly where it did before — A’s keys, B’s keys, and the upper half of C’s keys never move. Remove a node and the reverse happens: its arc merges into the next node clockwise, and nothing else is disturbed.

sequenceDiagram
  participant Client
  participant Ring
  Client->>Ring: place key (hash=150)
  Ring->>Ring: walk clockwise from 150
  Ring-->>Client: owner = Node A
  Note over Ring: add Node D at 16000
  Client->>Ring: place key (hash=12000)
  Ring->>Ring: walk clockwise from 12000
  Ring-->>Client: owner = Node D now
  Note over Ring: only arc 8000..16000 moved

Virtual nodes

The naive ring has an ugly statistical flaw. With only a handful of nodes hashed to random positions, the arcs between them are wildly uneven — hashing eight nodes onto a circle does not produce eight equal slices. One node might own 35% of the ring by pure luck while another owns 4%. Worse, when a node dies, its entire arc dumps onto the single neighbor clockwise of it, instantly doubling that neighbor’s load. You’ve traded the mod-N rehash for a hot-spot lottery.

The fix is virtual nodes (vnodes, or “tokens”): each physical node is hashed to many positions on the ring instead of one. Cassandra historically defaulted to 256 tokens per node; Dynamo-style systems use a few hundred. With 256 small arcs per physical node, the law of large numbers smooths the distribution so each physical node owns close to its fair share. And when a node dies, its many small arcs scatter across all the remaining nodes rather than piling onto one — the recovery load is spread, not concentrated.

Vnodes also give you weighting for free. Heterogeneous hardware is the norm — you replace nodes over years and the new boxes have more RAM and faster disks. Give a bigger node more tokens and it owns proportionally more of the ring. Want a node to carry 2× the load of its peers? Give it 2× the vnodes. No special code, just more points on the circle.

flowchart LR
  PA[Physical A] --> VA1[token a1]
  PA --> VA2[token a2]
  PA --> VA3[token a3]
  PB[Physical B] --> VB1[token b1]
  PB --> VB2[token b2]
  PB --> VB3[token b3]
  VA1 --- R((ring))
  VA2 --- R
  VA3 --- R
  VB1 --- R
  VB2 --- R
  VB3 --- R

Replica placement on the ring

Real distributed stores don’t keep one copy of anything, and this is where the ring does double duty. To place RF replicas (replication factor), you don’t stop at the first node clockwise — you keep walking and pick the next RF distinct physical nodes. The ring therefore defines both the primary owner and the replica set in one mechanism, which is elegant and also a sharp edge.

The sharp edge: because each physical node owns many vnodes, a careless walk can pick two tokens of the same physical node as two “replicas,” silently dropping your real replication factor to 1 for those keys. A correct walk skips any physical node already chosen. And for multi-AZ durability you want the walk to be rack/zone-aware — Cassandra’s NetworkTopologyStrategy does exactly this, ensuring replicas land in distinct failure domains so a single rack or availability-zone loss doesn’t take all copies with it. Getting this walk right is more subtle than the ring itself, and it’s where most home-grown implementations have bugs.

Bounded loads

Even a perfectly balanced ring with thousands of vnodes can’t save you from one nasty reality: a single popular key. A celebrity’s profile, a viral product, the homepage object — that key hashes to exactly one point and lands on exactly one node, and the ring will faithfully route all of its traffic there while neighbors sit idle. Vnodes balance keyspace; they do nothing for load skew within a key.

Consistent hashing with bounded loads (the Google/Vimeo variant) addresses this when the ring routes requests to capacity-limited backends rather than data to storage. You set a capacity ceiling — say 1.25× the cluster mean. When a key’s natural owner is already at its bound, the algorithm walks further clockwise to the next under-capacity node and places the request there. You trade a little locality (the key’s load now lands in two places) for a hard ceiling on per-node load, which is precisely what you want for a load balancer feeding backends that fall over past a fixed RPS. Envoy implements this as the ring_hash and maglev load balancers; the maglev variant trades a touch of consistency for a much faster, fixed-size lookup table.

Where it actually shows up

Once you know the shape, you see it everywhere:

Cassandra and ScyllaDB partition the entire keyspace onto a token ring; the partition key’s hash picks the primary, and replicas are the next RF distinct nodes clockwise, optionally zone-aware.
DynamoDB descends directly from the 2007 Dynamo paper, which is where most of this vocabulary (ring, vnodes, preference list) entered the mainstream.
Redis Cluster is the interesting exception — it uses a fixed 16384 hash slots (CRC16(key) mod 16384) rather than a continuous ring, which is a bucketed cousin of consistent hashing: moving a slot moves only that slot’s keys, and the slot count never changes.
Distributed caches — memcached client libraries (ketama), older client-side Redis sharding — ring-hash keys so adding a node doesn’t cold-start the tier (the fix from the opening story).
Load balancers — Envoy’s ring_hash/maglev, HAProxy, Nginx’s consistent hash directive — use it for session affinity so a client keeps hitting the same backend across pool changes.
CDNs map object keys to edge cache servers so the cache survives capacity changes.

The common thread: a membership set that changes over time, where you need a stable mapping and cheap rebalancing.

The tradeoffs that bite

These are the choices that look free when you adopt the ring and bill you later:

Even distribution vs metadata cost. More vnodes means smoother balance but a larger ring to gossip, store, and search. Cassandra’s old 256-token default made bootstrap, repair, and gossip noticeably slower on large clusters; modern guidance dropped to 16 tokens paired with a smarter token-allocation algorithm precisely to cut that overhead.
Locality vs hot-key protection. Bounded loads spread an overloaded key’s overflow to neighbors, but now that key’s data or session lives in more than one place — you traded a hot spot for reduced cache locality and a fan-out on reads.
Replica overlay complexity. The ring defines replica placement, so the clockwise walk must skip already-chosen physical nodes and respect failure domains. This is the subtlest part to implement and the easiest to get silently wrong.
Per-request walk vs precomputed routing. Walking the ring is O(log V) per lookup. At LB-scale request rates you may materialize the assignment into a flat table for O(1) routing (maglev’s whole pitch), refreshed only on topology change.
Hash quality matters. A weak or non-uniform hash function reintroduces skew that no number of vnodes can fix. Use a well-distributed hash (Murmur3, which Cassandra uses, or xxHash) — never a naive string hash.

Scheme	Keys moved adding a node	Balance	Hot-key handling	Lookup cost
`mod-N`	~all (`(N-1)/N`)	even while static	none	`O(1)`
Ring, no vnodes	`~1/(N+1)`	poor (luck-dependent)	none	`O(log N)`
Ring + vnodes	`~1/(N+1)`	good	none	`O(log V)`
Ring + vnodes + bounded loads	`~1/(N+1)`	good	capped per node	`O(log V)` + walk
Fixed slots (Redis)	only moved slots	good	none	`O(1)` table

Performance and distribution analysis

The thing worth quantifying about consistent hashing is how well it balances and what it costs to look up, because both drive real decisions.

Distribution quality. With a single token per node, the expected load is fair but the variance is brutal — the standard deviation of arc sizes is on the order of the mean itself, so a small cluster routinely sees nodes carrying 2–3× their fair share. Adding vnodes shrinks that variance: with k vnodes per node, the load variance drops roughly proportional to 1/sqrt(k). This is why 100–256 tokens is the sweet spot — at k=256, per-node load typically lands within a few percent of the mean, while k=1 is a coin-flip. Concretely, if you measure steady-state per-node load variance above ~20%, you almost certainly have too few tokens or a bad hash function.

Lookup cost. A ring with N physical nodes and k vnodes has V = N·k points. A successor search over a sorted array is O(log V) — for a 100-node cluster at 256 tokens that’s log2(25600) ≈ 15 comparisons per lookup, which is nothing. The real cost isn’t the search, it’s keeping the ring current: every node must learn about membership changes (via gossip or a coordination service like ZooKeeper), and a 256-token ring is 256× more state to propagate than a single-token one. That’s the actual reason Cassandra walked the default back down.

Rebalance cost. This is the number that matters most operationally. Adding the (N+1)th node moves ~1/(N+1) of the data. For a 1 TB-per-node, 20-node cluster, adding one node streams roughly 1000/21 ≈ 48 GB onto the new node — bounded and predictable. The same operation under mod-N would move close to the entire 20 TB. The ring turns an unbounded migration into a bounded one, and that bound is the whole value proposition.

Bounded-loads overhead. The bounded-loads variant adds a “walk until under capacity” step, which in the worst case touches several nodes per placement. In practice, with a capacity factor around 1.25, the average walk is barely longer than one hop, because overflow is rare when load is already near-uniform. The cost only shows up under genuine skew — which is exactly when you want it spending cycles.

Failure modes

How consistent-hashing systems break in production, as symptom → root cause → prevention:

Skew without enough vnodes. Symptom: one node runs hot on CPU and disk while peers idle; per-node load variance of 2–3× at steady state. Root cause: too few tokens per node leaves uneven arcs, or a poor hash function clusters keys. Prevention: enough vnodes for the cluster size, a well-distributed hash (Murmur3/xxHash), and an alert on per-node ownership skew.

Rebalance storm on node loss. Symptom: a node dies and cluster-wide read latency spikes; the network saturates. Root cause: the lost node’s arcs move to neighbors and the system streams replicas to restore RF — a Cassandra replacement node bootstrapping can saturate links and starve foreground traffic. Prevention: throttle streaming throughput (stream_throughput_outbound_megabits_per_sec), bring replacements up during low traffic, and alert on streaming bandwidth.

Hot key defeats the ring. Symptom: one shard is pegged while the rest of the cluster is idle, and adding nodes does nothing. Root cause: the unit of distribution is the key; a single key is one point on the ring no matter how many vnodes exist. Prevention: bounded loads, key splitting (counter:{0..N} summed on read), or a fronting cache. Do not add more vnodes expecting relief — it can’t help.

Ring disagreement / split mapping. Symptom: reads and writes for the same key land on different nodes; transient inconsistency. Root cause: nodes gossip stale membership and disagree about who owns an arc during a topology change. Prevention: a converging membership protocol with a version/epoch, alerting on prolonged disagreement, and for strong cases a coordinator like ZooKeeper as the source of truth for the ring.

Silent replica collision. Symptom: data loss after a single node failure that “should” have been survivable at RF=3. Root cause: a buggy replica walk picked two vnodes of the same physical node, so the real replication factor was 1 for those keys. Prevention: the walk must dedupe by physical node and be zone-aware; test it explicitly with a topology that has few physical nodes and many tokens.

Consistent hashing solves rebalancing cost, not load skew from individual keys. If one key is 1000× hotter than the rest, the ring routes all 1000× to one node — that is the design working as intended, not a bug. Reach for bounded loads or split the key. Adding more vnodes to fix a hot key is the single most common mistake people make with the ring, and it accomplishes exactly nothing.

Scaling it

What changes as the cluster grows by 10× and 100×:

Scale out incrementally. The headline property is that adding capacity moves only ~1/(N+1) of data. Add nodes one at a time and let each finish absorbing its arc (and replica streams) before adding the next — adding ten at once multiplies the concurrent streaming load and can saturate the network.
Re-tune vnode count with cluster size. What’s smooth at 10 nodes may be wasteful gossip overhead at 500. Re-evaluate token count when the cluster grows an order of magnitude; this is exactly the journey behind Cassandra’s 256 → 16 default change.
Make the replica walk failure-domain-aware. At scale you span racks and zones. Define replica placement as the next RF distinct physical nodes clockwise, constrained to distinct zones, so a zone loss never takes all copies. See Database Replication for the durability side.
Bounded loads at the LB tier. When the ring routes requests to capacity-limited backends (not data to storage), cap per-node load so a hot key can’t overwhelm one server while others idle. See Load Balancing.
Precompute routing where rates demand it. At very high request rates, materialize the ring into a lookup table (maglev-style) refreshed on topology change rather than walking it per request. You trade a little memory and a touch of consistency-on-change for O(1) routing.
Mind the gossip ceiling. The wall you eventually hit isn’t the hashing — it’s propagating membership across hundreds of nodes fast enough that the ring stays consistent. That’s when a dedicated coordination layer earns its keep.

When to reach for it (and when not to)

Reach for consistent hashing whenever the set of backends changes over time and you need a stable key-to-node mapping with cheap rebalancing: distributed caches (memcached/Redis client rings), partitioned databases (Cassandra, DynamoDB), sticky load balancing across a changing pool, sharded queues, and CDN edge placement. The signal is “I will add and remove nodes, and I can’t afford to remap everything when I do.”

Don’t reach for it when the node set is genuinely fixed and tiny — plain mod-N is simpler and you’ll never resize, so the ring is just overhead. And don’t expect it to serve range queries: hashing deliberately destroys key ordering, so “all orders between two timestamps” becomes a scatter-gather across every node on the ring. If ordered scans dominate your access pattern, use range partitioning instead (split by key range, keep order) — see Sharding & Partitioning. The honest tradeoff is that hash partitioning gives you even distribution and kills range scans; range partitioning preserves scans and risks hot ranges. Pick the one that matches your dominant query.

When to consider alternatives

Ordered/range scans dominate → range partitioning, not a hash ring → Sharding & Partitioning.
Fixed, small, never-resized node set → plain mod-N is simpler and fine.
One key is the bottleneck (hot key) → bounded loads, key splitting, or a fronting cache in Redis — the ring alone won’t help.
You need strong agreement on who owns what → a coordination service as the source of truth → ZooKeeper.
Per-request O(1) routing at LB scale → maglev-style precomputed tables → Load Balancing.
Even data spread with no resize concern, plus secondary indexing → a managed store that hides the ring → DynamoDB.

Operational checklist

Set vnodes/tokens per node deliberately for your cluster size; measure steady-state per-node load variance and re-tune if it exceeds ~20%.
Use a well-distributed hash function (Murmur3, xxHash); never a naive string hash that clusters keys.
Make replica placement walk to distinct physical nodes, and zone/rack-aware for multi-AZ durability.
Test the replica walk on a small-physical-node, many-token topology to catch silent replica collisions.
Alert on per-node ownership skew and on rebalance/streaming throughput during topology changes.
Throttle bootstrap/decommission streaming so a single node replacement can’t saturate the network.
Identify hot keys explicitly (request sampling, key-level metrics); apply bounded loads or key splitting rather than over-provisioning vnodes.
Verify the ring converges after membership changes (gossip epoch / state version), and alert on prolonged disagreement.
Add nodes one at a time; let each absorb its arc and replica streams before the next.
For cache rings specifically, confirm hit-rate impact of a node addition in staging before doing it in production.

Summary

Consistent hashing exists to answer one question: how do you add or remove a node without remapping everything? mod-N fails that question catastrophically — changing the node count moves ~(N-1)/N of all keys, which is why adding a cache node can melt your origin. The ring fixes it by mapping keys and nodes into the same circular space and assigning each key to the next node clockwise, so a topology change disturbs only one arc — ~1/(N+1) of keys. Virtual nodes smooth the otherwise-lottery distribution and let you weight heterogeneous hardware; the clockwise replica walk must dedupe physical nodes and respect failure zones; and bounded loads cap per-node load when the ring routes requests to finite backends. The one belief to internalize: the ring solves rebalancing cost and nothing else — it will route a hot key’s entire load to one node by design, so reach for bounded loads or key splitting for skew, and range partitioning when you need ordered scans. Get the vnode count, the hash function, and the replica walk right, and the ring becomes the quiet machinery under Cassandra, Dynamo, Envoy, and every CDN — invisible until someone tries to scale with mod-N and learns why it isn’t.

Appendix: the hashing math, briefly

For readers who want the fundamentals restated:

A hash function maps an arbitrary key to a fixed-size integer, ideally uniformly — every output equally likely, no clustering. Consistent hashing’s balance depends entirely on this uniformity; a biased hash means biased arcs that no number of vnodes can rescue.
The keyspace is the range of those integers, treated as a circle: 2^32 (or 2^64, or Cassandra’s signed 2^64 token range) values where the maximum wraps back to the minimum. “Clockwise” is just increasing integer value, wrapping at the top.
Successor lookup is “find the smallest node position ≥ hash(key), wrapping around” — a binary search over the sorted node positions, O(log V).
1/(N+1) intuition: when a new node claims its arc, it takes a fair slice from the circle, and a fair slice of N+1 equal owners is 1/(N+1) of the whole. That’s the expected fraction of keys that move — bounded and shrinking as the cluster grows, the exact opposite of mod-N.

The deeper data-distribution treatment is in Sharding & Partitioning; the coordination that keeps a ring consistent across nodes is in Consistency & Consensus.

Incidents & deep-dives

Where this system breaks in production — and how it comes back.

No incident deep-dives yet. See the roadmap for what's coming.