Database Replication
Replication isn't a backup feature — it's a distributed system you opt into. Sync vs async, leader topologies, lag, read-your-writes, and the failover that loses your writes.
On this page
Replication gets sold as “high availability, for free — just add a replica.” Then a failover quietly drops the last forty seconds of writes, a read replica serves a profile edit that hasn’t landed yet, and one bad afternoon two nodes both decide they’re the leader. Replication is not a checkbox you tick on the way to production. It is the precise moment your single database stops being a single database and becomes a distributed system, with every consistency problem that implies dragged along for the ride.
The whole subject collapses into one question that every replication setup answers, whether the team meant to answer it or not: when do you tell the client the write succeeded — before the copies land on other nodes, or after? That single decision sets your durability, your write latency, and the size of the data-loss window you’ll eat on failover. Everything else — topologies, quorums, lag dashboards, fencing — is downstream of it.
This is the long-form context article. It’s the thing I wish someone had walked me through before my first 2am “the database is fine but the writes are gone” page. It leans on PostgreSQL for the concrete WAL mechanics, on Consistency & Consensus and CAP Theorem & Tradeoffs for the theory underneath the operational choices, on Sharding & Partitioning for what happens when replicas stop being enough, and on Cassandra and DynamoDB for the leaderless model. If you want the in-memory cousin of all of this, Redis makes the same async-failover bet with even sharper edges.
One framing to carry through the whole article: a replica is not a backup. A backup is a point-in-time copy you restore from deliberately. A replica is a live, lagging mirror that will faithfully replicate your DROP TABLE to every node in under a second. They solve different problems. Confusing them is the first mistake, and it’s a surprisingly common one.
A motivating failure
A logistics company runs Postgres behind their dispatch system. After an async-failover scare lost a few minutes of delivery updates, leadership made a reasonable-sounding decree: no more data loss, turn on synchronous replication. The platform team obliged. They set synchronous_commit = on and pointed synchronous_standby_names at their one standby in another availability zone. Tested it, writes still worked, latency was up a couple milliseconds, everyone moved on.
Three weeks later, that standby’s EBS volume hit a noisy-neighbor I/O stall. Not a crash — just a standby that slowed to a crawl, taking 400ms to fsync each batch instead of 2ms. On the primary, nothing was wrong. CPU was fine, disk was fine. But every single commit was now blocking until that one sick standby acknowledged, because that’s exactly what synchronous replication promises.
Write latency went from 4ms to 400ms across the entire fleet. The dispatch API’s connection pool filled with requests all stuck waiting on commits, the pool exhausted, health checks started failing, and the load balancer pulled API nodes out of rotation. A storage hiccup on a machine that wasn’t even serving traffic had become a full write outage on the primary. The fix that was supposed to prevent an outage caused a bigger one.
The root cause wasn’t the I/O stall — that’s just weather. It was a single named synchronous standby with no fallback, which is strictly less available than async: you’ve added a second machine that can take you down without removing the first. The lesson the team carried out of the postmortem is the spine of this article: synchronous replication couples your write availability to your slowest acknowledging replica, and “one sync standby” is the worst of both worlds. The right config was ANY 1 (s1, s2, s3) — durable, but able to lose any single standby without blocking a thing.
The one-sentence mental model
Replication is copying a stream of writes from one node to others, and choosing how long the writer waits for those copies to land — wait for nothing and you’re fast but lossy, wait for everyone and you’re durable but you’ve welded your write latency to your slowest replica and your availability to your flakiest network link.
flowchart LR App[Client] -->|write| L[Leader] L -->|ack now\nor after?| App L -->|write stream| R1[Replica 1] L -->|write stream| R2[Replica 2] R1 -. stale reads .-> App R2 -. stale reads .-> App
Three independent knobs hide inside that one sentence, and almost every replication mistake is two of them getting confused for each other:
- How many copies the leader waits for before acking — the durability knob. Zero copies (async), one copy (semi-sync), all copies (full sync). This sets your data-loss window on a leader crash.
- Which topology carries the stream — the write-availability knob. One leader, many leaders, or no leader. This sets where writes are even allowed and whether you have conflicts to resolve.
- What a read sees off a replica that’s behind — the consistency knob. Every replica you add for read throughput is a replica that can hand back stale data.
The trap is treating these as one dial. A team “tunes for safety” by cranking durability to full sync and accidentally destroys write availability (the motivating failure). Another team “scales reads” by adding replicas and accidentally breaks read-your-writes for every user. The knobs are separate. Reason about them separately.
How it actually works
It’s all log shipping underneath
Strip away the marketing and every replication mode is the same primitive: ship a log of changes from one node and replay it on the others. Postgres ships its Write-Ahead Log (WAL); MySQL ships the binlog; MongoDB has the oplog; Cassandra streams mutations and hints. The leader appends each change to its log before applying it (that’s the “write-ahead” part, and it’s the same log that gives you crash recovery), then replicas pull or receive that log and replay it in order.
Two details matter operationally. First, the log has finite retention. Postgres keeps WAL segments around until replicas confirm they’ve consumed them (wal_keep_size, or a replication slot that pins them). A replica that falls far enough behind can fall off the end of retained WAL and need a full re-sync from a base backup — hours of pain. Second, replay is itself work. A replica isn’t a passive tape recorder; it’s executing every change the leader did. A replica on weaker hardware, or one running heavy read queries, can replay slower than the leader writes, and then lag grows without bound.
Synchronous, asynchronous, and the middle that you actually want
The durability knob has three settings, and contrary to most people’s instinct, the middle one is where the majority of production systems should live.
sequenceDiagram participant App participant Leader participant Replica App->>Leader: write Note over Leader: async path Leader-->>App: OK (own WAL durable) Leader->>Replica: stream later Note over Leader,Replica: gap = data-loss window
- Asynchronous. The leader acks as soon as its own WAL is durable on local disk, then streams to replicas whenever it gets around to it. Fastest possible writes. The cost: if the leader crashes with un-streamed changes in flight, those acknowledged writes are gone, and the client was told they succeeded. This is the Postgres default, the MySQL default, and the Redis default.
- Synchronous. The leader blocks the client until at least one replica confirms the change is durable on its disk too. Zero loss on a single-node failure. The cost: every write now pays a network round-trip to a replica, and — the part that bites — if that replica stalls, your writes stall with it. One sick standby becomes a write outage, as the motivating failure showed.
- Semi-synchronous. Wait for one replica to confirm, let the rest follow asynchronously. You get a guaranteed second durable copy without coupling latency to your slowest node. Postgres expresses this as
synchronous_commit = onwithsynchronous_standby_names = 'ANY 1 (s1, s2, s3)'. For systems that need durability without heroics, this is almost always the right answer.
The non-obvious rule, learned the hard way: ANY N over a pool, never a single named standby. synchronous_standby_names = 's1' means losing s1 blocks every write. ANY 1 (s1, s2, s3) means any one of three confirming is enough, so you can lose two and keep serving. Same durability guarantee, wildly different availability.
Single-leader, multi-leader, leaderless
The topology knob decides where writes are allowed to happen, and therefore whether you have conflicts to resolve at all.
flowchart TD
subgraph Single
SL[Leader] --> SR1[Replica]
SL --> SR2[Replica]
end
subgraph Multi
ML1[Leader A] <--> ML2[Leader B]
end
subgraph Leaderless
Q1[Node] --- Q2[Node]
Q2 --- Q3[Node]
Q1 --- Q3
end
| Topology | Writes accepted at | Conflicts? | Typical home |
|---|---|---|---|
| Single-leader | One node only | None — total order | Postgres, MySQL, most OLTP |
| Multi-leader | Several nodes | Yes — must resolve | Multi-region active-active |
| Leaderless | Any node (quorum) | Yes — read-repair | Dynamo-style, Cassandra |
Single-leader is the default for a deeply good reason: one writer means one authoritative order of events, so there is nothing to reconcile. Ever. You scale reads by adding replicas behind the leader. You do not scale writes this way — every write still funnels through the one leader, and that leader’s capacity is your write ceiling.
Multi-leader lets two regions both accept writes locally, which is wonderful for write latency and survives a region partition — right up until both regions update the same row and you must decide who wins. Last-write-wins (by wall-clock timestamp) is the seductive default and it silently discards data, made worse by clock skew between regions. The honest options are conflict-free replicated data types (CRDTs) or application-level merge logic, and both are real engineering, not a config flag.
Leaderless drops the leader entirely: clients write to W nodes and read from R nodes out of N total, and when W + R > N the write set and read set are guaranteed to overlap, so a read sees the latest write. This is the Dynamo model — see DynamoDB and Cassandra. It trades the single point of failure for the ongoing burden of quorum tuning, read-repair, and reasoning about consistency without a single source of truth.
Read-your-writes: the routing problem nobody plans for
Add a read replica and you’ve created a subtle correctness bug for free. A user updates their email, the app immediately reads it back to render the confirmation page, that read happens to hit a replica that’s 300ms behind, and the page shows the old email. The write succeeded. The read lied. Multiply by every “submit then redirect to view” flow in your app.
sequenceDiagram participant User participant App participant Leader participant Replica User->>App: update email App->>Leader: write (returns LSN) Leader-->>App: OK + LSN 0/3A2F User->>App: view profile App->>Replica: read if replayed >= LSN Replica-->>App: caught up, serve fresh
The two production-grade fixes: leader-pinning (route a user’s reads to the leader for a few seconds after they write — simple, costs the leader some read load) and LSN tracking (capture the write’s log position, and only serve the read from a replica that has replayed at least that far, otherwise fall back to the leader). LSN tracking is more precise and scales better but needs plumbing through your data layer. Pick one before you ship replicas, because retrofitting it after the bug reports come in is miserable archaeology.
The tradeoffs that bite
These are the decisions that look free in the design review and send you the invoice in production.
| Tradeoff | The free-looking choice | What it actually costs |
|---|---|---|
| Durability vs write latency | ”Turn on sync replication to be safe” | Every write pays a replica RTT; cross-region = +60–150ms |
| Availability vs durability | One named sync standby | Losing that standby blocks all writes |
| Read scale vs read consistency | Route reads to replicas | Stale reads; read-your-writes silently breaks |
| Failover speed vs split-brain | Aggressive auto-failover | Two leaders accept writes during a partition |
| Write scale vs simplicity | Multi-leader “for scale” | Conflict resolution you have to design and own |
| Replica as DR vs replica as backup | ”We have a replica, we’re covered” | Replicas faithfully copy your DROP TABLE |
Two deserve a second look. Durability vs write latency is where good intentions go to die: cross-AZ sync adds 1–2ms per write, which is tolerable; cross-region sync adds 60–150ms per write, which means a leader in Virginia waits on the Atlantic for every commit. Teams flip it on for safety, watch p99 write latency 10x, and only then learn geography is now in their commit path.
And replica-as-backup is the one that ends careers. A replica protects you against hardware failure. It does nothing against logical failure — a bad migration, a buggy DELETE without a WHERE, ransomware. All of those replicate to every node in milliseconds. You need real point-in-time backups (pg_basebackup + WAL archiving, or equivalent) in addition to replicas. They are not the same tool and one does not cover for the other.
Replication lag: what’s fast, what stalls, and the levers
The headline performance metric in any replication setup is lag — the gap between when the leader commits a change and when a replica has it durable and replayed. It shows up in two flavors you must not conflate: write_lag/flush_lag (how far behind the replica is at receiving and flushing WAL) and replay_lag (how far behind it is at applying WAL to become visible to reads). Read-your-writes correctness depends on replay_lag, not flush_lag. Watch the right one.
What’s fast: under normal load on healthy hardware on the same network, lag is single-digit milliseconds. Streaming WAL is cheap; the leader is mostly shipping bytes it already wrote.
What stalls it, in rough order of how often I’ve seen it:
- Single-threaded replay. Postgres replays WAL with one process (recovery is largely serial). The leader generated that WAL using many concurrent backends. So a leader doing heavy parallel writes can outpace a replica that’s replaying serially — lag climbs even though both machines look healthy. This is the most common surprise.
- A long query on the replica blocking replay. On a hot-standby replica, a long-running read query can conflict with WAL replay (it needs rows the incoming WAL wants to vacuum away). Default behavior pauses replay until the query finishes, so one analyst’s 20-minute report makes the whole replica lag 20 minutes.
max_standby_streaming_delayandhot_standby_feedbackare the knobs; both have costs. - Undersized replica I/O. Cheaper disks on replicas than the leader. WAL replay is write-heavy; if the replica’s disk can’t keep up, lag grows under load and shrinks at night, sawtoothing forever.
- Network saturation. A write-heavy burst can saturate the replication link, especially cross-region. WAL compression (
wal_compression) and adequate bandwidth headroom matter.
The number that matters for risk is the product of throughput and lag. If the leader sustains 5,000 writes/sec and replay_lag averages 200ms at peak, a hard leader crash loses roughly 5,000 × 0.2 = 1,000 un-replicated writes. That’s not a vibe — it’s a figure you put in front of a product owner: “async on this table means up to ~1,000 lost writes on a crash; is that a shrug or a headline?” If it’s a headline, you move that table to semi-sync (ANY 1). Putting a real number on the loss window is the single most useful thing this article can teach you to do.
Failure modes
How replication actually breaks at 2am — symptom → root cause → prevention.
Stale reads after a write. Symptom: users report “my change didn’t save,” then it appears on refresh; a write-then-read-to-confirm flow reports failure for a write that succeeded. Root cause: the read hit a replica behind the leader by the replication lag. Prevention: read-your-writes via leader-pinning or LSN tracking; alert on replay_lag before it’s user-visible.
Lost writes on async failover. Symptom: acknowledged writes vanish after a promotion; the client believes they succeeded. Root cause: leader acked locally, hadn’t streamed yet, then died — pure async’s unavoidable window. Prevention: size and document the loss window; move loss-intolerant tables to semi-sync; keep money movement off async-only paths and behind idempotency keys (see API Design & Idempotency).
Write outage from a sync stall. Symptom: primary is healthy but every commit hangs; connection pools exhaust; API falls over. Root cause: a synchronous standby stalled (I/O, network) and the leader is dutifully waiting for it — the motivating failure. Prevention: ANY N over a pool, never a single named sync standby; alert on commit latency, not just CPU.
Split-brain. Symptom: after a network partition heals, two divergent write histories with no clean merge. Root cause: a partition isolated the leader from the failover controller, which promoted a replica while the old leader kept taking writes. Prevention: fencing — covered in the blockquote below.
Replica falls off the WAL and needs a full re-sync. Symptom: a lagging replica suddenly errors that required WAL is gone; rebuild takes hours. Root cause: lag exceeded wal_keep_size retention (or a replication slot wasn’t used / was removed). Prevention: use replication slots so the leader retains WAL until the replica consumes it — but then alarm on slot-pinned WAL bloat, because a dead replica with a slot can fill the leader’s disk.
Cascading failover under load. Symptom: failover promotes a replica, the full load hits the new leader, it tips over, and you flap between nodes. Root cause: the real problem was capacity, and failover just relocated the fire. Prevention: don’t auto-failover on latency alone; ensure any node can carry full load before promoting it.
If your failover system can promote a new leader without fencing the old one — STONITH (“shoot the other node in the head”), a lease that provably expires, or a quorum that the minority side cannot win — then you don’t have failover, you have a coin flip that occasionally lands on two leaders. The fence must come before the promotion, never after. Fence first, promote second. Everything else is theater.
Scaling it
At 10x, single-leader with a handful of async read replicas carries it comfortably. Two things you do on day one, not day ninety: wire in read-your-writes routing (leader-pin or LSN), and alert on replay_lag at a threshold like > 5s sustained, before it’s user-visible. Add a semi-sync standby (ANY 1) the moment a bounded loss window stops being acceptable for some table.
At 100x, the single leader’s write ceiling is the wall, and replicas cannot help you climb it — replicas scale reads, full stop. The move is sharding: partition the data so each shard has its own leader and its own independent replication domain. See Sharding & Partitioning. The price is that a transaction spanning two shards now spans two replication domains, and you’ve inherited distributed-transaction problems (two-phase commit, sagas, or designing them away). Sharding multiplies your operational surface — N shards is N leaders, N replica sets, N failover plans.
Geo-scale forces the topology question you’ve been avoiding. If users in two regions must both write with low latency and survive a full region outage, single-leader can’t do it — the far region eats a cross-ocean RTT on every write, or loses writes when the leader’s region goes dark. Your honest options are multi-leader with real conflict resolution, or leaderless quorums tuned per-region. Both are a step-change in complexity, not a config toggle, and both push you firmly into CAP and consistency/consensus territory. A common pragmatic middle ground: shard by region (each region is the single leader for its own users’ data) so cross-region writes are rare by design.
When to reach for it (and when not to)
Reach for single-leader async replication when you want read scale-out plus a warm standby for failover and you can tolerate a small, bounded, documented data-loss window. That’s the overwhelming majority of OLTP systems, and it’s the right default. Layer semi-sync (ANY 1) on top for the specific tables where the loss window isn’t acceptable.
Reach for leaderless quorums when write availability during node and region failures matters more to you than having a single authoritative copy — high-write, partition-tolerant workloads where the data model tolerates eventual consistency and read-repair.
Don’t reach for synchronous cross-region replication to “be safe.” You will couple every write to inter-region latency and convert a routine network blip into a write outage. Use async cross-region with a documented loss window, or shard by region.
Don’t reach for multi-leader unless you have genuinely designed conflict resolution and can name what happens when both sides edit the same row. Active-active with last-write-wins is a data-loss generator wearing a high-availability badge.
Don’t reach for replicas as your backup strategy. They replicate your mistakes perfectly. Keep real point-in-time backups alongside them.
When to consider alternatives
- Durable, transactional source of truth on one node → PostgreSQL — get replication right here before distributing.
- Horizontal write scale past one leader → Sharding & Partitioning, then a sharded engine.
- Tunable consistency, partition-tolerant, write-heavy at scale → Cassandra or DynamoDB.
- Strong agreement / leader election / locks that must be correct across failures → Consistency & Consensus and ZooKeeper & etcd; replication alone won’t give you this.
- Fast, rebuildable read layer in front of the source of truth → Caching Strategies and Redis.
- Understanding which guarantee you’re even allowed to want → CAP Theorem & Tradeoffs.
The pattern: replication makes one logical database survive node failure and serve more reads. The moment the requirement becomes “scale writes horizontally” or “guarantee agreement across nodes,” replication is the wrong layer and you want sharding or consensus.
Operational checklist
- Alert on replication lag (
pg_stat_replication.replay_lag, MySQLSeconds_Behind_Master) before it’s user-visible — page at> 5ssustained, and distinguishflush_lagfromreplay_lag. - Configure synchronous replication as
ANY Nover a pool, never a single named standby you can’t afford to lose. - Implement read-your-writes (leader-pinning or LSN tracking) for any flow that reads back its own write — decide this before shipping replicas.
- Require fencing (STONITH / expiring lease / quorum) in your failover controller; never promote without guaranteeing the old leader is dead.
- Size and document the async data-loss window as
throughput × lag; keep money movement and idempotency-critical writes off async-only paths. - Use replication slots so the leader retains WAL for lagging replicas — then alarm on slot-pinned WAL disk usage so a dead replica can’t fill the leader’s disk.
- Keep real point-in-time backups in addition to replicas; a replica is not a backup.
- Test failover on a schedule, under load. An untested failover is a hypothesis, not a safety net.
Summary
Replication is log shipping with three independent knobs, and almost every incident is two of them getting confused. The durability knob (async / semi-sync / full sync) sets how many writes you lose on a leader crash and how much latency you pay — semi-sync ANY N is the sweet spot for most systems that can’t shrug off lost writes. The topology knob (single / multi / leaderless) sets where writes are allowed and whether you’ve signed up to resolve conflicts — single-leader unless you’ve truly designed for more. The consistency knob decides whether reads off a replica can lie to you — they can, so build read-your-writes before you ship replicas. Put a real number on your loss window, fence before you promote, never trust a single sync standby, and never mistake a replica for a backup. Get those right and replication is the quiet workhorse it’s sold as. Get one wrong and it’s the page that ruins your night.
Appendix: terms worth pinning down
If the body assumed vocabulary you’d like restated:
- Leader / primary — the node that accepts writes and is the authoritative ordering of changes. Replica / standby / follower — a node replaying the leader’s change log.
- WAL / binlog / oplog — the ordered log of changes the leader writes first and replicas replay. Replication is shipping this log.
- LSN (Log Sequence Number) — a monotonic position in the WAL. “Has this replica replayed past LSN X?” is the precise question behind read-your-writes.
- Lag — how far a replica trails the leader.
flush_lag= received and on disk;replay_lag= actually applied and visible to reads. The second one governs correctness. - Quorum — in leaderless systems, the rule
W + R > Nthat forces read and write sets to overlap so reads see the latest write. - Fencing / STONITH — provably disabling a failed leader before promoting a new one, so two nodes never both accept writes (split-brain).
- Replication slot — a leader-side marker that retains WAL until a specific replica consumes it, preventing “fell off the end of the log” re-syncs (at the risk of WAL pile-up if the replica dies).
Further reading
- Related: PostgreSQL · Sharding & Partitioning · Consistency & Consensus · CAP Theorem & Tradeoffs · Cassandra · DynamoDB · Redis · upcoming work on the roadmap
- PostgreSQL docs — High Availability, Load Balancing, and Replication
- MySQL docs — Replication
Incidents & deep-dives
Where this system breaks in production — and how it comes back.
Documenting next
- 🔒 Replication Lag & Stale Read Bugsroadmap →