Practical rebuilds of these systems — real failovers & chaos drills — are in production onYouTube, soon.

Consistency & Consensus

Consistency models and consensus in production: linearizable vs causal vs eventual, quorum overlap, Raft and Paxos, leader election, multi-raft scaling, and where consensus is overkill.

23 min readupdated 2026-06-28
On this page

“Just make it strongly consistent” is the most expensive sentence a senior engineer can say without flinching. It sounds like asking for correctness. What you are actually ordering is a system where every write coordinates across a majority of machines, where latency is floored by the distance between your nodes, and where a network partition that leaves no side with a majority will stop accepting writes on purpose. Consistency is not a feature you turn on. It is a budget you spend, and most teams spend it in the wrong places.

The confusion starts because two different ideas wear the same word. A consistency model is a contract about what a read is allowed to observe — what counts as “up to date.” Consensus is a protocol that lets independent, failure-prone machines agree on a single ordered sequence of decisions despite crashes, slow links, and lost messages. Consensus is the machinery that implements the strongest consistency contracts. You can have weak consistency with no consensus at all, and you can run consensus and still hand clients stale reads if you wire it up carelessly. Keeping the two separate in your head is the first real skill here.

This article is the long-form mental model: enough to reason about consensus under real load, not just to recite the Raft paper. I lean on Database Replication throughout, because consensus is what you reach for when async replication’s silent data-loss window becomes unacceptable, and on CAP Theorem & Tradeoffs, because every consensus decision is a CAP decision wearing a different hat. The systems that actually ship this — ZooKeeper & etcd — are where the abstractions meet a config file.

The blunt reality up front: consensus does not make your system more available. It makes it more correct under failure, and it does that by being less available during partitions, on purpose. If you do not need that trade, you are paying for an expensive guarantee you will only ever see on your latency graphs.

A motivating failure

A platform team runs Kubernetes for a few hundred services. The control plane’s brain is a 3-node etcd cluster, all in one region, sub-millisecond between members. It is boring and fast — exactly what a control plane’s datastore should be.

Then a well-meaning reliability push lands: “etcd is a single point of failure for the whole cluster, let’s make it survive a region loss.” Someone adds two more voting members in a second region, 70ms away, and calls it a 5-node cluster spanning two regions. On paper it tolerates more failures. In practice it just changed the physics of every write.

etcd commits a write only when a majority — 3 of 5 — has persisted it to disk. With three members local and two remote, the third-fastest acknowledgment now frequently comes from across the region link. Every Kubernetes write — every pod status update, every lease renewal, every kubectl apply — started paying +70ms. The API server, which hammers etcd constantly, got sluggish but stayed up. Tolerable, barely.

The real outage came two weeks later when the inter-region link degraded to 200ms with packet loss. The remote members fell behind, heartbeats started missing their window, and etcd triggered a leader election. The new leader couldn’t keep a stable majority either, so it triggered another. Terms climbed. For eleven minutes the control plane couldn’t commit anything: no pods scheduled, no deployments progressed, no leases renewed, controllers that depend on etcd leases started flapping. Running pods kept serving traffic — the data plane was fine — but the cluster was brain-dead. They couldn’t even roll back the change, because rolling back is a write.

Nothing was buggy. etcd did exactly what a consensus system must: refuse to commit without a stable majority. The outage lived entirely in a misunderstanding — that adding voting members across a slow link improves availability. It does the opposite. The fix was to move the voting set back to three low-latency members and add the remote region as non-voting learners, which replicate the log and serve reads without slowing down a single commit.

The one-sentence mental model

Consensus is getting a set of independent, failure-prone nodes to agree on a single ordered sequence of decisions, such that once a decision is committed every node agrees on it and no failure can un-commit it — and the price of admission is that you need a majority of nodes alive and able to talk to each other before you can decide anything at all.

flowchart TD
  C[Client proposes] --> L[Leader]
  L -->|replicate| F1[Follower 1]
  L -->|replicate| F2[Follower 2]
  F1 -->|ack| L
  F2 -->|ack| L
  L -->|majority acked\ncommit| S[Apply to\nstate machine]
  S --> C

Every clause is a constraint you will meet operationally:

  • A single ordered sequence of decisions → this is why consensus gives you linearizability. Every node replays the same log in the same order, so they all converge on the same state. The log is the source of truth; the database is just a cache of replaying it.
  • Once committed, no failure can un-commit it → this is the durability guarantee you are actually paying for. A committed entry survives any number of leader changes, crashes, and restarts, as long as a majority survives.
  • You need a majority → a 3-node cluster survives 1 failure and needs 2 to make progress; a 5-node cluster survives 2 and needs 3. A partition where neither side holds a majority commits nothing on either side. That is not a bug; it is the entire point.
  • Alive and able to talk → “alive” is not enough. A node that is up but unreachable, or reachable but slow, is as good as dead to the quorum. This is why network jitter, not crashes, causes most consensus pain.

The word majority (strict majority, more than half) is doing enormous work. It is what makes split-brain impossible: two disjoint sets of nodes cannot both contain a majority of the same cluster, so at most one side can ever commit. Every safety property downstream rests on that arithmetic.

How it actually works

Quorum overlap is the whole trick

Before any protocol, understand the math underneath. With N replicas, if every write must land on W nodes and every read must consult R nodes, and you choose them so that W + R > N, then any read set and any write set are guaranteed to overlap in at least one node. That overlapping node has seen the latest committed write, so the read can find it.

Consensus protocols are a disciplined, automated way of running write quorums over an ordered log rather than over individual keys. Set W to a majority and require reads to also touch a majority (or go through the leader), and you get linearizability. This is the same quorum idea that leaderless stores like Cassandra and DynamoDB expose as tunable R and W knobs — consensus just fixes the knobs at “majority” and adds an elected leader to impose a global order.

Consistency models, strongest to weakest

Not every system needs the strongest contract. The art is picking the weakest model that is still correct for the operation, because cost drops sharply as you move down this ladder.

flowchart LR
  LIN[Linearizable\nreal-time order] --> SEQ[Sequential\nsome global order]
  SEQ --> CAU[Causal\ncause before effect]
  CAU --> EVE[Eventual\nconverges later]
  • Linearizable — every operation appears to take effect at a single instant between its invocation and its return, and that ordering respects real (wall-clock) time. A read always sees the most recent committed write. This is the only model where “read the latest value” means what your intuition thinks it means. It requires consensus or a full quorum on every operation, and it is the most expensive thing on this list.
  • Sequential — all nodes agree on some total order of operations, but that order need not match real time. Two clients can disagree about which of their concurrent writes “happened first” as long as everyone agrees on the same answer. Rarely the exact guarantee you want, but cheaper than linearizable.
  • Causal — operations that are causally related (a reply must come after the comment it replies to) are observed in that order everywhere; operations with no causal link can be seen in any order. This is the sweet spot for a huge class of collaborative and social systems. It is dramatically cheaper than linearizable because unrelated writes never coordinate, and far safer than eventual because it never shows you an effect before its cause.
  • Eventual — replicas converge eventually, assuming writes stop long enough. It says nothing about what you read in the meantime; you might read your own write, an old value, or a value that later loses a conflict resolution. Cheapest and most available — the model behind leaderless, partition-tolerant stores.

The jump from causal to linearizable is where coordination cost explodes, because linearizability is the first model that forces a global, real-time agreement on every operation. In my experience most “we need strong consistency” requirements are actually satisfied by causal consistency plus read-your-writes, which you can often get by routing a user’s reads to wherever their writes went, no consensus required.

Raft, at a level you can operate

Two famous protocols solve the “agree on an ordered log” problem: Paxos (older, proven, notoriously hard to reason about; Multi-Paxos powers Google’s Chubby and parts of Spanner) and Raft (designed for understandability; powers etcd, Consul, CockroachDB, TiKV). In production you will almost always meet Raft, because a protocol you can implement correctly beats one you can only admire. It has three moving parts.

1. Leader election. Every node starts as a follower. Each runs a randomized election timeout — in etcd, controlled by --election-timeout (default 1000ms). If a follower hears no heartbeat from a leader within that window, it increments the term number, becomes a candidate, and requests votes from everyone else. A candidate that wins votes from a majority becomes leader and starts sending heartbeats (etcd’s --heartbeat-interval, default 100ms). Randomizing each node’s timeout makes simultaneous candidacies rare, so elections usually converge in one round.

2. Log replication. All client writes go to the leader — there is exactly one writer at a time. The leader appends the write to its own log and replicates it to followers. Once a majority has persisted the entry to disk, the leader marks it committed, applies it to its state machine, and tells the client “done.” Followers that receive a write from a client just redirect it to the leader.

sequenceDiagram
  participant Cl as Client
  participant Le as Leader
  participant F1 as Follower 1
  participant F2 as Follower 2
  Cl->>Le: write x=5
  Le->>Le: append to log
  Le->>F1: AppendEntries
  Le->>F2: AppendEntries
  F1-->>Le: ack
  Note over Le: majority reached
  Le->>Le: commit and apply
  Le-->>Cl: ok
  F2-->>Le: ack (late, fine)

3. Safety. This is the part that makes Raft correct rather than merely live. A node only grants its vote to a candidate whose log is at least as up-to-date as its own. This guarantees that any entry already committed survives every future leader change — a node missing committed entries can never win an election. Combined with the majority rule, it means a committed write is never lost while a majority survives.

The single most important operational relationship in the whole protocol is the ratio between the heartbeat interval and the election timeout. The election timeout must be comfortably larger than the heartbeat interval — etcd recommends the election timeout be at least 10× the heartbeat, and the heartbeat should be roughly the round-trip time between members. Get this ratio wrong relative to your real network jitter and healthy leaders get voted out by followers who simply didn’t hear a heartbeat in time. That is the failure in the opening story, and it is the most common self-inflicted consensus wound there is.

ZooKeeper, the other primitive you will meet, uses its own protocol (ZAB) but the operational shape is identical: a tickTime (default 2000ms) sets the heartbeat unit, and initLimit/syncLimit bound how many ticks a follower may lag before it’s considered out of sync. Same tuning problem, different knob names.

The tradeoffs that bite

These are the decisions that look free in a design review and bill you in production.

The latency floor is non-negotiable. Every linearizable write needs a majority round-trip. With all voters in one availability zone, that round-trip is sub-millisecond and nobody notices. Spread the quorum across regions for durability and every write now waits on the round-trip to the second-closest voter — easily +60–100ms per write, on every write, forever. You bought survival of a region loss and you pay for it on every keystroke. There is no caching your way around it: the coordination is the product.

Availability collapses under the wrong partition. When a partition leaves no side with a majority, the cluster stops committing writes. This is the CP corner of CAP: given a partition, consensus chooses Consistency over Availability. An eventually-consistent system makes the opposite bet — it keeps accepting writes on both sides and reconciles later, trading correctness for uptime. Neither is wrong. They are different products for different jobs, and pretending one is strictly better than the other is how people pick the wrong one.

Cluster size cuts both ways. More nodes tolerate more failures but require a larger majority to coordinate, which makes every write slower. Five nodes (tolerates 2 failures) is the common sweet spot; seven rarely pays for itself; and even node counts are strictly worse than the odd number below them — a 4-node cluster still needs 3 for a majority, so it tolerates the same single failure as a 3-node cluster while coordinating more machines per write. Running an even number of voters is almost always a misconfiguration.

ModelCoordination costReads see latest write?Survives no-majority partition?
LinearizableMajority quorum per opAlwaysNo — minority side blocks
SequentialGlobal order, no real-timeSome agreed orderDepends on impl
CausalTrack dependenciesCausally consistentYes
EventualNoneNo guaranteeYes

The cost of agreement: latency and throughput

Consensus performance is mostly a story about two numbers: how far a majority round-trip travels, and how often the cluster has to elect a new leader.

Write latency is bounded below by the time for the leader to reach a majority and get an fsync-durable acknowledgment back. That is network RTT to the median voter + disk fsync time. On local NVMe with same-AZ voters you can see commit latencies under a millisecond. The instant a voter sits across a region, your floor jumps to that link’s RTT. This is why the geography of your voting set is the single biggest performance decision you will make — bigger than CPU, bigger than the size of the dataset.

Write throughput does not scale by adding voters. Adding a sixth and seventh node to a 5-node cluster makes writes slower, because the leader fans out more replication traffic and waits on a larger majority. A single Raft group tops out at what one leader can push and one majority can fsync. For etcd that’s typically in the low tens of thousands of writes per second on good hardware — plenty for a control plane, nowhere near enough for high-volume application data. That ceiling is exactly why you shard into many groups (below) rather than growing one.

Read performance has more room, and it’s where freshness-vs-cost shows up. There are three flavors:

  • Linearizable reads must confirm the leader is still the leader (a “read index” round-trip to a majority, or a full quorum read like etcdctl get --consistency=l). They pay roughly write-level latency.
  • Leader-local reads trust the current leader’s state. Faster, but a deposed-but-doesn’t-know-it leader can serve a stale value, so they’re only safe with a lease mechanism.
  • Follower / learner reads are the cheapest and most scalable, served from any replica, at the cost of bounded staleness.

The lever almost everyone misses: don’t pay linearizable-read cost for a read that can tolerate 100ms of staleness. Route those to learners and reserve quorum reads for the operations whose correctness genuinely depends on real-time recency. Watch etcd_disk_wal_fsync_duration_seconds and etcd_network_peer_round_trip_time_seconds — when commit latency climbs, it is almost always one of those two, not CPU.

Failure modes

The recurring ways consensus bites, each as symptom → root cause → prevention.

Leader flapping under network jitter. Symptom: the term number climbs steadily, the leader changes every few seconds, and write latency spikes in lockstep with each election. Root cause: the election timeout is tuned for the latency you measured on a calm day, so when real jitter arrives, followers stop hearing heartbeats in time and vote out a perfectly healthy leader; the new leader hits the same jitter and the cycle repeats. Each election stalls writes for the duration of the term change. Prevention: tune the election timeout against p99 inter-node latency under load, not the median on an idle cluster; keep heartbeats well under it; and never run voters on noisy-neighbor hosts that introduce scheduling jitter.

If your consensus cluster’s election timeout is tuned for the latency you saw on a good day, your first bad network day looks like a total write outage. Tune against p99 inter-node latency under real load, give yourself an order of magnitude of headroom over the heartbeat interval, and treat the hosts your voters run on as latency-critical infrastructure — not spare capacity to pack other workloads onto.

Split-brain from a misconfigured quorum. Symptom: after a partition heals, two sides have accepted conflicting writes. Root cause: the quorum size was misconfigured (or an even node count let a 2-2 split feel “balanced”), so a strict majority wasn’t actually enforced. Prevention: trust the algorithm’s majority rule and never override it; run odd voter counts; and verify that a minority partition genuinely refuses writes before you rely on it.

Quorum loss freezes all writes. Symptom: the cluster is up, reads of stale data may still work, but every write hangs or errors. Root cause: a majority is gone — two nodes down in a 3-node cluster — so no leader can be elected and nothing can commit. Prevention: this is correct behavior, so the prevention is operational: size the cluster to tolerate your expected concurrent failures, spread voters across failure domains that don’t fail together, and budget recovery time. Teams that treat the consensus store as “just config” get blindsided here.

The consensus store is a hidden single point of failure. Symptom: an unrelated control plane, scheduler, or service mesh goes dark, and the root cause traces back to an overloaded etcd or ZooKeeper. Root cause: everything that depends on that store for leader election, locks, or membership inherits its availability. A small, quiet quorum can have an enormous blast radius. Prevention: monitor the consensus store as a tier-0 dependency, give it dedicated capacity, and explicitly model what breaks when its quorum is lost — because in a Kubernetes cluster, that’s everything in the control plane (see Kubernetes).

Scaling it

Consensus does not scale by adding voters — that makes writes slower, as we covered. You scale it with three moves, in roughly this order.

1. Keep the voting set small and stable, add learners for reads. Pin the voting membership at 3 or 5 low-latency nodes and add non-voting learners (etcd’s learner feature, or read replicas in CockroachDB) anywhere you need read scale-out or geographic reach. Learners replicate the full log and serve reads but never participate in the commit quorum, so they add read capacity and remote-region presence without slowing a single write. This is exactly the fix from the opening story.

2. Shard the keyspace into many independent consensus groups. Do not run one giant Raft group over your whole dataset. Instead run thousands of small Raft groups, each owning a slice of the keyspace, each doing its own independent consensus. This is the multi-raft model behind CockroachDB and TiKV: the system scales write throughput by adding groups (and the leaders spread across machines), not by adding voters to any one group. This is where consensus meets Sharding & Partitioning — each shard is its own little Raft cluster, and the partition map decides which group owns which keys.

flowchart TD
  KS[Key space] --> P1[Partition A]
  KS --> P2[Partition B]
  KS --> P3[Partition C]
  P1 --> G1[Raft group A\n3 voters]
  P2 --> G2[Raft group B\n3 voters]
  P3 --> G3[Raft group C\n3 voters]
  G1 --> N[Scale by adding\ngroups not voters]
  G2 --> N
  G3 --> N

3. Route reads by required freshness. Send linearizable reads through the leader’s read-index path; send everything that tolerates a little staleness to learners. A surprising fraction of read traffic — dashboards, list views, analytics — is perfectly happy with 100ms-old data and should never touch a quorum.

The general lesson: a single consensus group is a coordination primitive, not a database engine. You build a scalable system out of many small groups, each kept fast and small, and you push the scaling problem up into partitioning and routing.

When to reach for it (and when not to)

Reach for consensus when you need a single source of truth that survives node failure with zero lost committed writes, and where the cost of two nodes disagreeing is worse than the cost of a brief write outage. The canonical jobs: leader election, distributed locks that must actually be mutually exclusive, cluster membership, configuration that must be globally agreed, service discovery, and money-movement ledgers where a lost or duplicated entry is a real-world financial event. This is precisely what ZooKeeper & etcd exist for.

Reach for linearizability specifically when correctness depends on real-time recency: a uniqueness constraint, a lock acquisition, a “has this exact request already been processed” idempotency guard (API Design & Idempotency), or a fencing token that must strictly increase. These are the cases where reading a slightly stale value produces a wrong answer, not just an old one.

Don’t reach for consensus for high-volume application data that tolerates eventual or causal consistency: user feeds, product catalogs, view counts, session state, analytics, caches. Wrapping a counter in Raft buys you a latency tax and an availability dependency you did not need. A cache belongs in Redis; a feed belongs in something built for throughput, not agreement.

Don’t run your own consensus implementation. Implementing Raft correctly — including the membership-change, snapshotting, and log-compaction edge cases — is a multi-year effort with a long tail of subtle bugs that only appear under partition. Use etcd, ZooKeeper, or a database that embeds consensus, and spend your complexity budget on your actual product.

Don’t confuse durability with consensus. If your real requirement is just “don’t lose the last few writes on failover,” semi-synchronous Database Replication is far cheaper and may be all you need. Full consensus is for when you need agreement on order, not merely don’t lose data.

When to consider alternatives

  • High-throughput data that tolerates eventual consistencyCassandra or DynamoDB with tunable quorums, not a single Raft group.
  • A fast, lossy coordination cache (rate-limit windows, ephemeral locks)Redis, accepting that its locks are an optimization, not a correctness guarantee.
  • Durable ordered event log at scaleKafka, which gives you per-partition order and durability without per-operation global consensus.
  • “Don’t lose the last few writes” durability → semi-synchronous Database Replication instead of full consensus.
  • Off-the-shelf coordination primitives (locks, leader election, membership)ZooKeeper & etcd rather than anything hand-rolled.
  • Understanding the underlying availability tradeCAP Theorem & Tradeoffs.

Operational checklist

  • Run an odd voting count (3 or 5); never an even number — it tolerates the same failures while coordinating more machines per write.
  • Keep all voting members on a low-latency network (same region, ideally spread across AZs within it); push remote regions to non-voting learners.
  • Tune the election timeout against p99 inter-node latency under load, with a heartbeat well under it (etcd: heartbeat ≈ RTT, election timeout ≥ 10× heartbeat); widen both at the first sign of leader flapping.
  • Alert on term-number churn and leader-change rate (etcd_server_leader_changes_seen_total); rising terms mean elections are firing, which means write stalls.
  • Watch commit latency drivers: etcd_disk_wal_fsync_duration_seconds and peer round-trip time, not just CPU.
  • Place voters on dedicated, low-jitter hosts; never co-locate them with noisy-neighbor workloads that steal scheduling time.
  • Budget the consensus store’s own availability as a tier-0 dependency; explicitly model what breaks when its quorum is lost.
  • Scale reads with learners and writes with many small consensus groups — never by adding voters to one group.
  • Reserve linearizable reads for operations that truly need real-time recency; route stale-tolerant reads to learners.

Summary

Consistency and consensus are two ideas that share a word: one is a contract about what a read may observe, the other is the protocol that enforces the strongest such contracts. Consensus works by getting a majority of nodes to agree on a single ordered log, which buys you linearizability and zero lost committed writes — and costs you a majority round-trip on every write plus a deliberate refusal to make progress when no side holds a majority. The numbers that govern your life are the geography of your voting set (it sets the latency floor) and the ratio of heartbeat to election timeout (get it wrong and a jittery network becomes a write outage). Keep the voting set small, odd, and local; add learners for reads; shard into many small groups for write scale; and reserve the whole expensive apparatus for the handful of decisions — locks, leader election, ledgers, uniqueness — where two machines disagreeing is genuinely worse than a brief outage. For everything else, a weaker model is cheaper, faster, and correct enough. The skill is not knowing how Raft works. It is knowing the smallest place to put it.

Appendix: quorum arithmetic refresher

If the body moved fast over the math, here is the floor.

  • A quorum is a minimum number of nodes that must participate in an operation. A majority quorum is more than half: floor(N/2) + 1.
  • Why majorities prevent split-brain: any two majorities of the same set must share at least one member (you can’t split N nodes into two groups that each have more than half). So two disjoint partitions can never both reach a write quorum — at most one side commits.
  • Failure tolerance: a cluster of N voters tolerates floor((N-1)/2) simultaneous failures and still has a majority. N=3 tolerates 1; N=5 tolerates 2; N=7 tolerates 3.
  • Why even is wasteful: N=4 needs 3 for a majority and so tolerates only 1 failure — identical to N=3, but with an extra machine to coordinate and pay for.
  • Read/write overlap: if writes touch W nodes and reads touch R nodes and W + R > N, every read overlaps every write by at least one node, guaranteeing the read can observe the latest committed value. Consensus fixes W to a majority and reads through the leader or a majority to achieve linearizable reads.

That arithmetic — strict majorities and guaranteed overlap — is the entire safety foundation. Every Raft and Paxos safety proof is, underneath, a careful argument that these overlaps hold across every possible sequence of failures.

Further reading

Incidents & deep-dives

Where this system breaks in production — and how it comes back.

Documenting next

  • 🔒 Raft Leader Flapping Under Network Jitterroadmap →