Load Balancing

A production tour of load balancing: L4 vs L7, the algorithms that actually shed load, stale health checks, gray failures, retry storms, draining, and scaling the balancer itself.

23 min readupdated 2026-06-28

On this page

A load balancer gets drawn as a friendly box that “spreads traffic evenly,” and that picture is fine right up until the night a backend goes slow-but-not-dead. The balancer keeps the node in rotation because its health check still returns 200. Real requests pile up behind the slow node. Clients time out and retry. Retries land on an already-drowning fleet. A degradation that should have stayed a blip becomes a full outage, and the box you installed to absorb failure is now the thing amplifying it.

That is the core thing to internalize: a load balancer is not a passive splitter. It is an active control system making routing decisions on information that is always a little bit out of date. It believes things about your backends — who is up, who is fast, who has capacity — and it acts on those beliefs. Every interesting failure mode lives in the gap between what the balancer believes and what is actually true right now.

The boring questions (“round-robin or least-connections?”) almost never matter. The questions that decide whether you sleep through the night are: how does it decide a backend is healthy, what does it do when one is half-healthy, and how does it avoid multiplying a small failure across the whole fleet.

This is the long-form context article — the mental model and the failure modes, not a config tutorial. Load balancing sits next to a lot of other machinery on this site: it leans on consistent hashing when backends hold per-key state, it pairs with rate limiting and the API gateway at the edge, it pushes session state into Redis so backends can stay stateless, and the retry discipline it demands is the same idempotency story told in API design. The transport-layer details underneath it live in networking essentials.

The single most expensive mistake is trusting the green dot on the health dashboard. A backend that answers /healthz is not the same as a backend that can serve traffic, and the distance between those two facts is where most of this article lives.

A motivating failure

A checkout service runs twelve stateless backends behind an L7 balancer. p99 sits at 40ms, the dashboards are green, and nobody has touched the load balancer config in a year.

One afternoon a single backend’s disk starts failing slowly. Not dead — slow. Reads that took 2ms now take 900ms because the kernel is retrying bad sectors. The balancer’s health check hits GET /healthz, which only checks “is the process up and can it allocate a response,” and that path never touches the failing disk. So the check returns 200 in 3ms, every five seconds, cheerfully. The node stays in rotation.

Real checkout requests, though, read product data off that disk. They crawl. The balancer is configured for least-connections, and here is the cruel twist: a slow backend accumulates open connections because its requests don’t finish. To least-connections, “many open connections” looks like “busy and healthy,” so it keeps… no, worse — the slow node’s connections climb, and the algorithm sends new work to the other nodes, which is almost right. But the client-side library has a 1-second timeout and retries twice. Every request that hits the slow node times out at 1s and retries, and roughly one in twelve lands on the slow node again.

Retries triple the offered load. The eleven healthy nodes, now carrying the retry overflow plus their normal traffic, cross their own latency knee. Their p99 climbs past the 1s client timeout too. Now everything retries. Offered load triples again. Within ninety seconds the entire fleet is in a metastable collapse, serving almost nothing, and it stays collapsed even after someone finally ejects the bad disk — because the retry backlog keeps the fleet pinned.

The disk was one node out of twelve. The outage was global. Nothing was a bug: the health check did what it was written to do, least-connections did what it promised, and the retries were “best practice.” The failure lived entirely in the gap between “answers /healthz” and “can actually serve a request,” widened by a retry policy with no brakes.

The one-sentence mental model

A load balancer is a routing decision made on stale health information: it sends each request to the backend it believed, a few hundred milliseconds ago, was the best choice — and every failure mode is the gap between that belief and what is true now.

Unpack each clause, because each is a constraint you will eventually meet at 3am:

Stale health information — health checks run on an interval. For up to one full interval, the balancer will route to a backend that just died, just filled its disk, or just started timing out. The green dot is always a little bit in the past.
A few hundred milliseconds ago — load and latency signals lag reality. By the time “least-loaded” is computed and acted on, the node you picked may already be the most-loaded, because every other balancer instance picked it too (the thundering-herd-of-one problem).
The best choice — “best” is only as good as the signal. Connection count is not load. A 200 on /healthz is not “can serve traffic.” Picking the wrong signal is picking the wrong backend with confidence.
Every failure mode is the gap — retry storms, gray failures, sticky-session hotspots, draining bugs: all of them are the balancer acting correctly on a belief that has gone stale.

flowchart LR
  C[Clients] --> LB[Load balancer\nVIP]
  LB -->|algorithm picks| B1[Backend 1]
  LB --> B2[Backend 2]
  LB --> B3[Backend 3]
  HC[Health checker\nevery 5s] -.probe.-> B1
  HC -.probe.-> B2
  HC -.probe.-> B3
  HC -.marks up/down.-> LB

The dotted line is the whole game. The balancer routes on the solid lines in real time, but it only learns about backend health on the dotted line, on a delay. Widen that delay, or let the probe measure the wrong thing, and you have built an amplifier instead of a shield.

How it actually works

L4 vs L7: the fork that decides everything

The biggest design decision is which layer the balancer operates at, because that determines what it can see, and what it can see determines what it can do.

Layer 4 (transport) balances TCP and UDP connections. It sees source/destination IPs and ports and nothing above that — no URLs, no headers, no HTTP at all. It picks a backend when the connection opens and pins the entire connection there for its lifetime. This makes it blisteringly fast and protocol-agnostic (it’ll balance Postgres, Redis, gRPC, anything), and it can do TLS passthrough so the backend terminates encryption. The price: it cannot route on a path, cannot retry an individual request, and cannot see that a connection is carrying garbage.

Layer 7 (application) terminates the connection, parses HTTP, and routes per request. It can send /api/* to one pool and /static/* to another, retry an idempotent request that failed, terminate TLS to inspect and rewrite, inject tracing headers, and enforce per-route limits. The cost is CPU — it parses and re-serializes every request — and the fact that it is now a stateful proxy sitting in the critical path, with its own connection tables and its own ways to fall over.

flowchart TD
  REQ[Incoming request] --> Q{Which layer?}
  Q -->|L4| L4[See IP + port\npin connection]
  Q -->|L7| L7[Terminate TLS\nparse HTTP]
  L4 --> FAST[Fast\nprotocol-agnostic]
  L4 --> BLIND[No per-request\nrouting or retry]
  L7 --> SMART[Path routing\nper-request retry]
  L7 --> COST[More CPU\nstateful proxy]

The practical rule: use L4 when you need raw throughput, non-HTTP protocols, or TLS passthrough and you don’t need per-request decisions. Use L7 when you need path/host routing, per-request retries, TLS termination, or header work — which is to say, most web APIs. Many real stacks run both: an L4 balancer spreading connections across a tier of L7 proxies.

The balancing algorithms

The algorithm answers “which backend gets the next request.” The naive ones spread evenly; the good ones spread by load, and the difference shows up exactly when a node gets sick.

Round-robin — rotate through backends in order. Dead simple and completely blind: it hands the same share to a struggling node as to an idle one. Fine for uniform, stateless backends; dangerous the moment request cost varies.
Weighted round-robin — round-robin with per-node weights, so a bigger box gets proportionally more. Useful in heterogeneous fleets, but the weights are static and don’t react to live load.
Least-connections — pick the backend with the fewest active connections. Better when request durations vary. Its trap is the one from the opening story: a slow node accumulates connections, so the signal can invert under gray failure.
Least-response-time / EWMA — track an exponentially-weighted moving average of each backend’s recent latency and route to the fastest. This is the one that actually saves you: a node going slow watches its EWMA rise and automatically sheds traffic without anyone intervening. It needs tuning (the smoothing factor) and keeps more state, but it is the default I reach for on anything with variable latency.
Power of two choices (P2C) — pick two backends at random, send to the less-loaded of the two. This sounds too simple to work, and it is shockingly good: it avoids the herd problem where every balancer piles onto the single “least-loaded” node, while still steering away from hot ones. Many modern proxies default to P2C with EWMA.
Consistent hashing — hash a request key (user ID, cache key) so the same key consistently hits the same backend, and a topology change moves only ~1/N of keys instead of reshuffling everything. Essential when backends hold per-key state. This is the same machinery covered in consistent hashing and sharding & partitioning.

Algorithm	Routes on	Best for	Weakness
Round-robin	position	uniform, stateless backends	ignores actual load
Least-connections	open conns	variable request duration	conn count inverts under gray failure
EWMA / least-time	latency history	mixed latency, slow-node shedding	needs tuning, more state
Power of two	two random samples	avoiding herd on “best” node	slightly more variance
Consistent hashing	request key	sticky-to-shard state	hot keys overload one node

Health checks: active, passive, and the lie in between

Health checking is where most balancers earn or lose their keep. There are two kinds and you want both.

Active checks hit an endpoint (GET /healthz) on a fixed interval and mark a backend down after a threshold of failures. They are predictable and catch a fully-dead node fast. Their weakness is the opening story: a shallow health endpoint returns 200 while the real request path is broken. A good health check exercises the actual dependencies the request path uses — a cheap query against the database pool, a touch of the disk — not just “is the process alive.” But make it too deep and a single slow dependency flaps your whole fleet at once.

Passive checks (outlier detection) watch real traffic and eject a backend after N consecutive errors or timeouts on live requests. They catch the failures active checks miss, because they measure the thing users actually experience. Their weakness is that they only react after real users have hit the problem.

sequenceDiagram
  participant Client
  participant L7 as L7 LB
  participant B1 as Backend A
  participant B2 as Backend B
  Client->>L7: GET /api/orders
  L7->>B1: forward (idempotent)
  B1--xL7: 503
  Note over L7: passive check counts a failure for A
  L7->>B2: retry within budget
  B2-->>L7: 200 OK
  L7-->>Client: 200 OK
  Note over L7,B1: 3 consecutive 5xx ejects A for a cooldown

The combination is what works: active checks to remove dead nodes quickly and predictably, passive checks to catch the gray failures where the node lies about its own health. Run only active checks and you trust the green dot. Run only passive and you only learn after the damage.

The tradeoffs that bite

These look free at design time and bill you in production.

Tradeoff	The free-looking choice	What it actually costs
Check frequency	Probe infrequently to save load	Route to a dead node for a full interval
Check aggressiveness	Probe hard, eject on one failure	One dropped probe shrinks the pool, overloads the rest
Check depth	Shallow `/healthz` for speed	Node lies green while the real path is broken
Sticky sessions	Pin users for in-memory state	Hotspots, broken balance, painful draining
Retry on failure	Retry everything to be safe	Retry storm; double-charges on non-idempotent calls
Single balancer	One LB, simple topology	It is now your single point of failure

Three of these deserve more than a table row.

Health-check tuning is a latency-vs-flapping dial. Check too infrequently (interval 30s) and you route to a corpse for half a minute. Check too aggressively (interval 1s, unhealthy_threshold 1) and one dropped probe — a momentary GC pause, a blip — ejects a healthy node, shrinking the pool and overloading everyone else, which causes more blips, which ejects more nodes. The sane compromise for HTTP services: interval 5s, unhealthy_threshold 3, healthy_threshold 2, timeout 2s (well under the interval). Slower to eject, far less likely to flap your fleet into a hole.

Sticky sessions trade balance for state. Pinning a user to a backend (by cookie or source-IP hash) lets that backend cache session state in memory — convenient, fast. But it defeats even balancing, concentrates a noisy user on one node, and makes draining miserable: you can’t remove a node without dropping its sessions or waiting them out. The cleaner answer is almost always stateless backends with session state in Redis, so any node can serve any request and you can deploy without ceremony. Reach for stickiness only when the in-memory state is genuinely expensive to externalize.

Connection draining trades deploy speed for correctness. When you remove a backend (deploy, scale-in), draining lets in-flight requests finish before the node dies. Set the drain timeout too low and you sever live requests (users see 502s during every deploy); too high and rollouts crawl. 30–60s covers typical HTTP; long-lived connections (WebSockets, gRPC streams) need longer or a deliberate forced cutover.

Performance: where the latency and capacity go

A balancer adds a hop, and that hop has a cost profile worth understanding before you tune anything.

What’s cheap: L4 forwarding. Once the backend is chosen, an L4 balancer is essentially shuffling packets — microseconds of added latency, and it scales to millions of packets per second on commodity hardware because it isn’t parsing anything. Connection setup is the only real cost, and with long-lived connections that amortizes to nothing.

What’s expensive: L7 work, and specifically TLS termination. Parsing HTTP, matching routes, and re-serializing costs CPU per request. TLS handshakes cost real CPU (asymmetric crypto on every new connection), which is why connection reuse and session resumption matter so much — a balancer doing fresh handshakes for every request can be CPU-bound at a fraction of its packet-forwarding ceiling. Keep-alive and HTTP/2 connection reuse are the biggest levers here.

The levers that actually move performance, in rough order of impact:

Connection reuse. Keep-alive between balancer and backends, and HTTP/2 multiplexing, eliminate per-request connection and handshake overhead. This is usually the largest single win on an L7 tier.
Pick a load-aware algorithm. EWMA or P2C over round-robin means a node trending slow sheds traffic before it falls over, which protects tail latency far more than any timeout tuning.
Right-size health-check cost. A health check that runs a heavy query every 5s against every backend from every balancer instance can itself become meaningful load. Make checks cheap and stagger them.
TLS session resumption and offload. Resumption avoids full handshakes; dedicated TLS hardware or a separate termination tier removes crypto from the hot path on huge fleets.
Buffering vs streaming. A balancer that fully buffers request and response bodies protects slow backends from slow clients but adds latency and memory; streaming is lower-latency but couples client speed to backend connection hold time. Choose deliberately per route.

Watch the right signals. Per-backend p99 latency and error rate (not just fleet averages, which hide a single sick node), connection-table utilization on the balancer, active vs healthy backend count over time (a sawtooth means flapping), retry rate as a fraction of requests, and the spread of load across backends (a widening spread is your early warning of a hot node or a pinning imbalance). Fleet averages are where dying nodes go to hide — always break metrics down per backend.

Failure modes

A load balancer’s failure modes are mostly amplification: it takes a small, local problem and multiplies it across the fleet. Symptom → root cause → prevention.

The retry storm. Symptom: a small degradation (one slow node) becomes a fleet-wide collapse that persists even after the trigger is gone. Root cause: a backend slows, requests time out, clients and/or the balancer retry, retries are additional load on an already-struggling fleet, so more requests slow, so more retries fire — load multiplies by the retry count at every layer that retries independently, and the system locks into a metastable failure state. Prevention: the four defenses below, non-negotiable in any retrying system.

flowchart TD
  S[One node\ngoes slow] --> T[Requests\ntime out]
  T --> R[Clients retry]
  R --> L[Offered load\nmultiplies]
  L --> M[More nodes\ncross latency knee]
  M --> T
  L --> COLLAPSE[Metastable\ncollapse]
  style COLLAPSE fill:#e11d48,color:#fff
  style L fill:#171717,color:#fff

The defenses, in order of importance:

Retry budgets — cap retries fleet-wide to a small fraction of requests (e.g. 10%). When the budget is spent, fail fast instead of retrying. This is the single most important brake, because it bounds the multiplier no matter how bad things get.
Circuit breakers — after N consecutive failures to a backend, trip open and stop sending it traffic for a cooldown, giving it room to recover instead of piling on a node that’s already down.
Exponential backoff with jitter — never retry immediately, and never on a synchronized schedule, or every client retries in lockstep and you get a thundering herd that hammers the recovering fleet in waves.
Retry only idempotent requests — a retried non-idempotent POST is a double-charge waiting to happen. Use idempotency keys; see API design.

A retry without a budget, a circuit breaker, and jitter is not resilience — it is a loaded gun pointed at your own backends. The retry storm is the most common way a load balancer turns one slow node into a full outage, and it is the one failure mode I check for first in any design review. If the system retries and you can’t tell me the budget, the answer is already broken.

Gray failure / the lying health check. Symptom: a node is “up” on every dashboard but real requests to it fail or crawl. Root cause: the health endpoint doesn’t exercise the broken dependency (the opening story). Prevention: health checks that touch the real request path, plus passive outlier detection that ejects on actual traffic errors.

The balancer as a single point of failure. Symptom: the balancer dies and everything behind it is unreachable, regardless of backend health. Root cause: a single balancer in the critical path. Prevention: at least two in active-active or active-passive with health-checked failover — a floating VIP, anycast, or DNS — and remember DNS failover is slow because clients cache records well past the TTL.

Timeout misalignment. Symptom: duplicated backend work, or exhausted connection pools. Root cause: if the balancer’s timeout is shorter than the backend’s, it gives up and retries while the backend is still working (duplicate effort); if longer, slow requests hold connection slots until the pool is exhausted. Prevention: timeouts must descend the stack — client > LB > backend > database — so the innermost layer gives up first.

Connection-pinning imbalance. Symptom: new nodes sit idle while old nodes are saturated, even though all are healthy. Root cause: L4 balancers (and HTTP/2 at L7) pin long-lived connections at connect time; a node that joins after connections are established gets zero traffic until clients reconnect. Prevention: periodic connection recycling (max connection age), or L7 per-request balancing over HTTP/2 with bounded connection lifetimes.

Scaling it

At small scale, one L7 balancer in front of a handful of stateless backends is plenty and you should not overthink it. The decisions change as both traffic and the fleet grow.

Scaling the balancer itself. A single balancer instance hits a ceiling — CPU from TLS termination, or its connection table. You scale out to multiple balancer nodes, which raises a new question: how does traffic get spread across the balancers? At modest scale, DNS round-robin across their IPs. At large scale, ECMP / anycast at the network layer, where multiple balancers advertise the same IP and the router hashes flows across them. Now the balancer tier is itself horizontally scalable and self-healing — pull a node and its flows rehash to the survivors.

flowchart TD
  C[Clients] --> ANY[Anycast VIP\nsame IP]
  ANY --> R[Router ECMP\nhash flows]
  R --> LB1[Balancer 1]
  R --> LB2[Balancer 2]
  R --> LB3[Balancer 3]
  LB1 --> POOL[Backend pool]
  LB2 --> POOL
  LB3 --> POOL

Global load balancing. Across regions, DNS-based or anycast global balancing routes users to the nearest healthy region. Health is now a cross-region signal with real propagation delay, and you must pick a failover policy explicitly: fail to another region (added latency, possible data-locality and replication-lag issues — see database replication) or fail closed in-region. Getting this wrong means a regional blip becomes a global one as all traffic stampedes to the survivors.

Autoscaling interplay. The balancer’s health and load signals often drive autoscaling, and the coupling can oscillate viciously: a latency spike adds nodes, new nodes start cold (empty caches, JIT not warm) and serve slowly, latency stays high, more nodes spin up, and you’ve built a feedback loop. Scale on a smoothed signal — request rate or p95 over a window — not instantaneous latency, and give new nodes a warm-up period (slow-start) where the balancer ramps their traffic gradually instead of dumping full load on a cold process. The broader orchestration of this lives in Kubernetes, and the signals you scale on come from observability.

When to reach for it (and when not to)

You essentially always need load balancing once you have more than one backend instance — the real question is which kind, not whether.

Reach for an L4 balancer when you need raw throughput, protocol-agnostic balancing (non-HTTP: databases, message brokers, gRPC at the transport level), or TLS passthrough to the backend, and you don’t need per-request routing decisions.

Reach for an L7 balancer when you need path or host routing, per-request retries, TLS termination, header manipulation, or per-route rate limiting (see rate limiting). Most web APIs and anything fronted by an API gateway want L7.

Reach for consistent hashing when backends hold per-key state — a distributed cache, a sharded store — and you need the same key to reliably hit the same node with minimal reshuffling on topology change.

Don’t put a dedicated balancer in the path where client-side balancing is simpler — the client picks a backend from a service-discovery list and connects directly. Inside a service mesh or an internal RPC fabric, this avoids an extra network hop and an extra single point of failure, at the cost of pushing the balancing logic into every client. For internal, high-trust, latency-sensitive service-to-service traffic, that’s often the better trade.

When to consider alternatives

Per-key sticky routing with minimal reshuffling → consistent hashing rather than naive stickiness or round-robin.
Edge concerns: auth, rate limits, request transformation → an API gateway, which is an L7 balancer with a job description.
Session state that’s forcing you toward sticky sessions → externalize it to Redis and keep backends stateless.
Spreading load across data partitions, not just replicas → sharding & partitioning; load balancing replicas and sharding data are different problems.
Internal service-to-service traffic at scale → client-side balancing via service discovery, often inside Kubernetes, to drop a hop.

The pattern: a load balancer spreads requests across interchangeable backends. The moment backends stop being interchangeable — they hold distinct state, or the routing decision is really about data location — you’ve crossed into hashing, sharding, or a gateway’s territory.

Operational checklist

Run at least two balancer instances with health-checked failover; never a single balancer in the critical path.
Tune active health checks to interval 5s, unhealthy_threshold 3, healthy_threshold 2, timeout 2s; make the check exercise the real request path, not a shallow /healthz.
Run passive checks (outlier detection) alongside active ones; eject a backend after a few consecutive 5xxs on live traffic.
Enforce a fleet-wide retry budget (cap ~10% of requests) and per-backend circuit breakers — this is your retry-storm insurance.
Always retry with exponential backoff plus jitter, and never retry non-idempotent requests without an idempotency key.
Order timeouts so they descend the stack: client > LB > backend > database.
Prefer a load-aware algorithm (EWMA or power-of-two) over round-robin for any workload with variable request cost.
Prefer stateless backends with session state in Redis over sticky sessions; if you must pin, set drain timeouts to 30–60s and plan for it.
For HTTP/2 / gRPC and L4 pinning, recycle connections (max age) so new nodes actually receive traffic.
Monitor per-backend p99 and error rate (never just fleet averages), retry rate, and load spread across backends; alarm on a widening spread.

Summary

A load balancer is an active control system that routes on stale beliefs about backend health, and almost every way it hurts you is amplification — it takes one sick node and multiplies the damage across the fleet. Decide L4 vs L7 first, because that fixes what the balancer can see and do. Pick a load-aware algorithm so a slow node sheds traffic instead of accumulating it. Run both active and passive health checks, because the green dot lies. And above all, treat retries as dangerous: a retry budget, circuit breakers, and jittered backoff are the difference between a balancer that absorbs a failure and one that detonates it. Get those right, run the balancer itself in pairs, and keep your backends stateless and interchangeable — then the load balancer becomes the quiet, boring layer it was always supposed to be.

Appendix: balancing vs proxying vs gateways

These three get conflated; the distinctions matter when someone asks “do we need a load balancer or a gateway?”

Reverse proxy — anything that terminates a client connection and forwards to a backend on the client’s behalf (TLS termination, caching, compression). A load balancer is a reverse proxy that also chooses among many backends.
Load balancer — a reverse proxy whose defining job is spreading requests across interchangeable backends and reacting to their health. Can be L4 (connection-level) or L7 (request-level).
API gateway — an L7 load balancer specialized for edge policy: authentication, rate limiting, request/response transformation, API key management, routing by API version. It does load balancing plus a pile of cross-cutting concerns. See API gateway.
Service mesh — pushes balancing, retries, and health checking into a sidecar next to every service, so balancing happens client-side without a central hop. Common in Kubernetes.

The continuum is: proxy (forwards) → load balancer (forwards + chooses + heals) → gateway (all that + edge policy) → mesh (all that, decentralized). Pick the least machinery that covers your actual requirements.

Incidents & deep-dives

Where this system breaks in production — and how it comes back.

No incident deep-dives yet. See the roadmap for what's coming.