API Gateway

A production tour of the API gateway: the request pipeline, auth and rate limits at the edge, retry storms, bulkheading, BFFs, distributed rate limiting, and the outages a front door causes.

23 min readupdated 2026-06-28

On this page

An API gateway gets pitched as “the thing that routes requests to services.” That description is true and useless, the same way “a plane is a thing that moves people” is true and useless. What the gateway actually is, operationally, is the single front door every external request passes through — and the moment you put it there, you have made a deliberate decision to concentrate your cross-cutting concerns and your blast radius into one tier. That is the whole trade. When the gateway is healthy nobody thinks about it. When it is degraded, your entire product is down at once, and your backend dashboards are all green, which makes the on-call engineer doubt their own eyes for the first ten minutes.

I have a specific scar from this. We ran a gateway in front of about forty services, and for two years it was the most boring box in the fleet. Then one afternoon a single recommendations service got slow — not down, just slow — and within ninety seconds checkout, login, and search were all timing out. None of those paths called recommendations. They just shared a connection pool with it inside the gateway, and a slow downstream had quietly eaten every worker. That afternoon taught me the thing this article is built around: the gateway’s failures are not the failures of the things behind it. They are their own category, and they live at the edge.

This is the long-form mental model for that front door: what it is genuinely responsible for, how it differs from a load balancer and a service mesh (people conflate all three constantly), what the request pipeline actually does in order, and the edge-specific ways it falls over that simply do not exist deeper in the stack. It assumes you have operated something behind a gateway and been surprised by it.

The concerns here overlap heavily with Load Balancing for how traffic gets spread, Rate Limiting for the algorithms the gateway enforces, API Design & Idempotency for why retries are safe (or aren’t), and Observability because the edge is where you see trouble first. When state has to live somewhere, it lands in Redis. If you want the relational counterpoint to the data behind all this, read PostgreSQL.

The single biggest mistake teams make is treating the gateway as plumbing — a passive pipe that requests flow through. It is not a pipe. It is an active tier that runs code on every request on the critical path, holds connections, makes decisions, and fails in ways a pipe never could.

A motivating failure

A retail platform runs a Kong gateway in front of its services. Black Friday morning, traffic is 4x a normal peak, and everything is holding. Then the /recommendations service — which decorates product pages with “you might also like” — starts garbage-collecting badly and its p99 climbs from 40ms to 9 seconds. It never returns an error. It just gets slow.

The gateway has a default route timeout of 60 seconds and a single shared upstream connection pool. Each request to /recommendations now holds a gateway worker and a connection for up to 9 seconds instead of 40ms — a 200x increase in occupancy. The pool has a fixed number of slots. Within about a minute, nearly every slot is occupied by a request parked on the slow recommendations backend.

Now /checkout comes in. It has nothing to do with recommendations. The checkout service is healthy, responding in 30ms. But the gateway has no free worker to forward the request, so checkout queues at the edge. Then it times out. So does /login. So does /search. The product is effectively down, and every backend service dashboard is green, because the backends are fine — they just never receive the requests.

The on-call engineer restarts the gateway. The pool clears, traffic floods back in, recommendations is still slow, and within ninety seconds the pool fills again. Restarting made it worse, because the reconnect stampede added load.

Nothing here is a bug. The timeout was a default. The shared pool was a default. The recommendations slowness was a normal degradation. The outage lived entirely in the absence of isolation — in the fact that one non-critical, slow backend could consume the shared resource that every critical backend depended on. That is the failure this article exists to prevent, and the fix is three settings, not a rewrite.

The one-sentence mental model

An API gateway is a reverse proxy that terminates the client connection, runs an ordered pipeline of cross-cutting policy — TLS, auth, rate limits, validation, routing, transformation — against every request, and then forwards a clean, trusted, time-bounded request to exactly one backend.

Each clause is a job you would otherwise duplicate, inconsistently, in every service you own:

Terminates the client connection → the TLS handshake, HTTP/2 or HTTP/3 negotiation, and slow-client keep-alive all happen here, so your services speak plain, fast internal protocols and never see a raw socket from the internet.
Runs an ordered pipeline → authentication, authorization, rate limiting, request validation, and observability live in one place. The order is a correctness property, not a style choice — get it wrong and you can rate-limit before you know who the caller is, or forward a header the caller forged.
Against every request → the gateway is on the critical path of 100% of external traffic. Every millisecond it spends, every dependency it calls, every pool it holds, is multiplied by your entire request volume.
Forwards to exactly one backend → by the time a service sees the request, the caller is authenticated, the request is shaped, and a timeout is set. The service gets to trust what it receives.

flowchart LR
  CL[Clients\nweb mobile\npartners] -->|TLS| GW[API Gateway]
  GW --> AUTH[Auth\nintrospection]
  GW --> RL[Rate-limit\nstore]
  GW -->|/orders| S1[Order svc]
  GW -->|/users| S2[User svc]
  GW -->|/pay| S3[Payment svc]
  GW -. traces .-> OBS[Observability]

The gateway is the seam where the untrusted internet meets your trusted network. That seam is the entire point. Everything in front of it is hostile and unverified; everything behind it has been authenticated, rate-limited, and shaped into something a service can trust without re-checking. The instant you let unshaped or unverified traffic slip past the seam — a forged X-User-Id header, an unauthenticated path, a request with no timeout — you have defeated the reason the gateway exists.

How it actually works

The request pipeline, and why order is a correctness property

A request does not “pass through” the gateway. It runs a pipeline of filters (Envoy calls them filters, Kong calls them plugins, AWS calls them stages — same idea), and the sequence is load-bearing.

The rules that have actually burned people:

Authenticate before you rate-limit per user. If you rate-limit first, you are limiting by IP or by nothing, and a single NAT’d corporate network or mobile carrier looks like one abusive client.
Strip client-supplied trust headers before routing. If your services trust an X-User-Id header that the gateway injects after auth, you must delete any incoming X-User-Id at ingress first. Otherwise a caller sets X-User-Id: 1 themselves and walks straight in as the admin. I have seen this in a real pentest finding; it is embarrassingly common.
Validate before you forward. Reject malformed requests, oversized bodies, and bad content-types at the edge so backends never spend a cycle on garbage.

sequenceDiagram
  participant C as Client
  participant GW as Gateway
  participant A as Auth
  participant S as Backend
  C->>GW: TLS + HTTP request
  GW->>GW: terminate TLS, parse
  GW->>A: validate token
  A-->>GW: identity + scopes
  GW->>GW: rate-limit per identity
  GW->>GW: strip forged headers
  GW->>S: forward (timeout + budget)
  S-->>GW: 200 or 5xx
  GW-->>C: response + limit headers

Read that sequence as a checklist. Every arrow into the gateway box is a place latency and failure can be introduced. The validate token call to the auth service, in particular, is a synchronous dependency on the hot path — hold that thought, it comes back as a failure mode.

Routing: more than matching a path

Routing looks trivial — match /orders/*, send it to the order service. In production it is the place you express deployment strategy. A mature gateway routes on path, host, method, header, and weight:

# weighted routing for a canary
route /checkout:
  - backend: checkout-v1   weight: 95
  - backend: checkout-v2   weight: 5    # 5% canary

That weight split is how blue/green and canary deploys work without DNS changes. The gateway is also where you do header-based routing for things like “internal beta users go to v2,” which means routing decisions can depend on identity, which means routing must come after auth. The pipeline order shows up again.

Request aggregation and the BFF pattern

A single mobile screen often needs data from three services: profile, order history, recommendations. You have two choices. Let the client fan out three calls over a flaky, high-latency mobile link — three TLS setups, three chances to fail, triple the radio wake time and battery. Or aggregate at the edge: one client call, the gateway (or a service just behind it) fans out internally over the fast network and stitches one response.

flowchart TD
  M[Mobile\none request] --> B[BFF\nmobile]
  B --> P[Profile svc]
  B --> O[Order svc]
  B --> R[Recs svc]
  P --> B
  O --> B
  R --> B
  B --> M2[One shaped\nresponse]

Aggregation cuts round trips, which on mobile is the dominant cost. But it couples the aggregator to backend schemas — every backend change now risks the aggregator. That coupling is exactly why the BFF (Backend-for-Frontend) pattern exists. Instead of one generic gateway aggregating for everybody, you run a thin per-client edge: one BFF for web, one for mobile, one for partners. Each BFF shapes responses for exactly its client, so the mobile team changes the mobile BFF without breaking web. The price is more services to run; the payoff is that client-specific logic stops leaking into either a shared gateway or your core services. The rule I use: if two clients keep wanting different shapes of the same data, that is the signal to split into BFFs.

Where it sits relative to the other boxes

The gateway, the load balancer, and the service mesh get confused because all three “move traffic around.” They operate at different layers and solve genuinely different problems:

	Load balancer	API gateway	Service mesh
Primary job	Spread connections	L7 edge policy	Service-to-service traffic
Direction	North-south (in)	North-south (in)	East-west (internal)
Layer	L4 or L7	L7	L7 via sidecars
Knows about	Hosts, ports, health	Routes, auth, limits	Service identity, mTLS
Example	NLB, HAProxy	Kong, Envoy, AWS APIGW	Istio, Linkerd

The common production shape is all three stacked, not one replacing another: a load balancer spreads incoming connections across gateway instances; the gateway applies edge policy (auth, rate limits, routing) and forwards to services; a service mesh then handles mTLS, retries, and traffic shifting between those services. North-south traffic hits the first two; east-west traffic between services is the mesh’s job. Routing internal service calls back out through your public gateway is a classic anti-pattern — it adds a hop, a choke point, and exposes internal traffic to edge policy it does not need.

The tradeoffs that bite

These are the decisions that look free on the architecture diagram and bill you in production.

Tradeoff	The free-looking choice	What it actually costs
Centralization vs blast radius	One tier for all policy	Every added feature can take down everything
Latency vs consistency	An extra hop + filter pipeline	`10–50ms` per request if introspection isn’t cached
Aggregation vs coupling	Shape responses at the edge	Backend schema changes now risk the front door
Smart edge vs thin edge	Push logic into the gateway	A distributed monolith every team coordinates through
Shared pool vs isolation	One upstream connection pool	One slow backend starves all the others
Sync auth vs availability	Introspect every request live	Auth outage = 100% gateway failure

Two of these deserve emphasis because they cause the most incidents.

Smart edge vs thin edge is a cultural trap, not just a technical one. The gateway is right there on every request, so it is tempting to put “just one more thing” in it — a discount calculation, a feature flag check, a data transformation. Each addition is reasonable in isolation. The sum is a gateway that encodes business logic, which means every team that wants to ship is now blocked on the gateway team’s review and deploy cycle. Keep the edge thin: it enforces policy (who are you, how often, where do you go), never product (what does this feature do). The line is not always crisp, but “would a second client want this differently?” is a good test — if yes, it is product, and it belongs in a service or a BFF.

Shared pool vs isolation is the opening story. The default in most gateways is one connection pool shared across upstreams, and it is fine until one upstream gets slow. The fix — bulkheading, an isolated pool per backend — costs a little more memory and config and buys you the property that a slow /recommendations can only ruin /recommendations.

Performance: the latency tax and where it hides

The gateway sits on 100% of your request path, so its performance is your product’s floor. You cannot be faster than your front door. Three places the time goes, in rough order of impact:

TLS termination. The handshake is CPU-heavy — asymmetric crypto on the first connection, session resumption after. At high request rates with lots of new connections (mobile clients reconnecting, no keep-alive), TLS can dominate gateway CPU. Levers: enable TLS session resumption and HTTP keep-alive so you amortize handshakes, prefer ECDSA certs over RSA (cheaper handshakes), and at very high scale push TLS termination out to a CDN/edge layer so the gateway itself does less of it.

The auth call. If every request synchronously calls a token-introspection endpoint, you have added that endpoint’s latency — often 10–50ms — to every single request, plus a hard dependency on it being up. The fix is caching: validate the token once, cache the result for the token’s remaining TTL (or a capped window), and serve subsequent requests from cache. A self-contained JWT you can verify with a local public key avoids the network call entirely — verify the signature in-process, check exp and scopes, done in microseconds. The trade is revocation: a cached or JWT decision is valid until it expires, so you cannot instantly revoke a token. For most systems a short TTL (a few minutes) is the right balance.

The filter pipeline itself. Every plugin runs per request. A regex-heavy request-validation rule, a verbose access log written synchronously, a Lua plugin doing real work — each adds microseconds-to-milliseconds, multiplied by your full traffic. Profile the pipeline under load; an innocuous-looking logging plugin writing to a slow disk has stalled more than one gateway.

The numbers to watch are not the same as your backends’. Track gateway p99 latency added (response time minus upstream time — the gateway’s own overhead), connection pool utilization per upstream (the early-warning metric for the bulkhead story), active connections, TLS handshakes/sec, and auth-cache hit rate. A creeping pool utilization on one upstream is the signal that a backend is slowing before it ever shows as an error.

Failure modes

How the front door breaks, each as symptom → root cause → prevention. These are edge-specific; they do not exist inside a single service.

Connection-pool starvation from a slow backend. Symptom: requests to healthy services time out at the edge while their backend dashboards stay green. Root cause: a shared upstream pool, one slow backend, no per-route timeout — workers pile up parked on the slow service (the opening story). Prevention: a tight per-route timeout shorter than the client’s, an isolated connection pool per backend (bulkheading), and a circuit breaker on the slow one so it fails fast.

Retry storms. Symptom: a backend has a minor hiccup and instantly gets buried under 2–3x its normal load. Root cause: the gateway retries failed requests, retries pile onto an already-struggling backend, and the amplification arrives exactly when the backend can least take it. Prevention: a retry budget — cap retries to a small fraction of total requests (e.g. 10%), so retries can never multiply load without bound — plus only retrying idempotent methods (see API Design & Idempotency), and exponential backoff with jitter.

flowchart TD
  B[Backend slow] --> R{Retry on\nfailure?}
  R -->|no budget| AMP[Retries pile on\n2-3x load]
  AMP --> DEAD[Backend buried\nfull outage]
  R -->|budget 10%| CAP[Retries capped]
  CAP --> CB[Circuit breaks\nfail fast]
  style DEAD fill:#e11d48,color:#fff
  style AMP fill:#171717,color:#fff

Auth dependency outage. Symptom: 100% of requests fail with auth errors though the backends are healthy. Root cause: synchronous introspection on every request against an auth service that just went down. Prevention: cache validation within token TTL, prefer locally-verifiable JWTs, and decide deliberately whether you fail-open or fail-closed when auth is unreachable. For a payments path, fail-closed (reject) is correct. For a public read endpoint, fail-open (serve without identity) may be acceptable. The wrong default is no decision at all.

Config rollout takes everything down. Symptom: a routine config push and instantly every request errors or routes wrong. Root cause: a bad route or filter applies to all traffic the moment it is loaded. Prevention: stage config rollouts (canary the config the way you canary code), validate config in CI, and keep a one-command rollback. A typo in a routing rule should not be a global outage.

The single point of failure — by design. Symptom: one gateway instance or one AZ dies and the whole product is down despite healthy backends. Root cause: the gateway tier is not horizontally scaled across AZs. Prevention: run many stateless gateway instances behind a load balancer across multiple availability zones; never one box.

The gateway turning a single slow backend into a total outage is the classic edge incident, and it is worth saying plainly: your gateway’s default config is almost certainly wrong for production. A shared connection pool and a 60-second timeout mean any one backend that gets slow — not down, just slow — can starve every other route through the same gateway. Set a tight per-route timeout, give each backend its own pool, and circuit-break the slow ones. Those three changes convert “one service degrades the whole product” into “one service degrades only itself,” which is the entire reason you isolate.

Scaling it

The gateway scales differently from the services behind it because it is stateful in one specific, dangerous way: it holds connections and, naively, counters.

Stateless instances, shared state externalized. The gateway must keep no per-request state in local memory, so you can run many identical instances behind a load balancer across AZs and lose any one without consequence. Anything that must be shared — rate-limit counters, sessions, auth cache — lives in an external store like Redis, not in gateway RAM. This is the same horizontal-scaling discipline as a stateless web tier; the gateway just has more reasons to violate it accidentally.

Distributed rate limiting is the hard part. Here is the trap: you want a global limit of “1000 requests/minute per API key.” You run 10 gateway instances. If each instance enforces 1000 locally, your real limit is 10,000 — ten times what you intended. The load balancer spreads a client’s requests across instances, so no single instance sees the whole picture.

flowchart TD
  C[Client key=abc] --> LB[Load balancer]
  LB --> G1[Gateway 1]
  LB --> G2[Gateway 2]
  LB --> G3[Gateway 3]
  G1 --> RS[(Shared counter\nRedis)]
  G2 --> RS
  G3 --> RS
  RS --> D{Over limit?}

Two ways out, and you pick your poison. Centralized counters: every instance does an atomic INCR against a shared Redis, so the limit is exact — at the cost of a network hop on every rate-limited request and a hot dependency on that store. Local division: give each instance 1000 / N of the budget, no hop, fully accurate only if the load balancer distributes perfectly evenly (it doesn’t) and N is stable (it isn’t, during a deploy or autoscale). Most large systems use a hybrid: local counters synced periodically, accepting small overage for the latency win. The full algorithm menu — token bucket, sliding window — lives in Rate Limiting.

Cache aggressively at the edge. The cheapest backend call is the one the gateway answers itself. Cache auth introspection within token TTL, cache hot idempotent GET responses, cache routing decisions. Every cache hit is a hop and a backend load you didn’t pay.

Circuit breaking per backend. At scale, trip a circuit when a backend’s error rate or latency crosses a threshold, so the gateway returns a fast cached or degraded response instead of piling requests onto a dying service. This is bulkheading’s active sibling — isolation contains the damage, circuit breaking stops feeding it.

The wall you hit is the gateway tier’s own CPU (TLS) and the shared rate-limit store becoming a hot dependency. At very high scale you push TLS to a CDN/edge layer, keep the gateway’s per-request work minimal, and shard or replicate the rate-limit store. Past that, you are running a distributed system at the edge with all the consistency tradeoffs that implies.

When to reach for it (and when not to)

Reach for an API gateway when you have multiple services exposed to external clients and want consistent auth, rate limiting, TLS termination, and routing enforced in one place instead of reimplemented twelve times. Reach for it when you expose public or partner APIs that need keys, quotas, and usage plans. Reach for it when client-specific shaping (a BFF) keeps your clients decoupled from your core services so teams can move independently.

Don’t reach for it when you have a single service or a small monolith — a load balancer with TLS termination does everything you need, and a gateway is just an extra hop and another tier to operate and page on. Don’t route service-to-service (east-west) traffic through it; that is a service mesh’s job, and sending internal calls out through the public front door adds latency and a needless choke point. Don’t stuff business logic into it; the moment product behavior lives in the gateway, every team is blocked on the gateway team, and you have built a distributed monolith with a reverse proxy at its heart.

The honest framing: a gateway earns its keep when the number of services times the number of cross-cutting concerns is large enough that duplicating policy is worse than centralizing it. Below that line, it is overhead.

When to consider alternatives

Pure internal service-to-service traffic, mTLS, retries → a service mesh (Istio, Linkerd), discussed alongside Observability.
Simple north-south with no policy needs → a plain L7 load balancer with TLS termination.
Serverless or managed-first stacks → the cloud provider’s managed gateway (AWS API Gateway) so you don’t operate the tier at all.
The rate-limiting algorithms themselves → Rate Limiting.
Where shared edge state lives → Redis for counters, sessions, and the auth cache.
Async ingestion rather than synchronous request/response → put a queue at the edge; see Message Queues.

The pattern: the gateway is the synchronous, north-south, policy-enforcing front door. The moment your need is east-west, or async, or pure connection-spreading, the right tool is something narrower, and the gateway is just extra moving parts.

Operational checklist

Run the gateway tier horizontally across multiple AZs behind a load balancer — never a single instance, ever.
Set a per-route timeout shorter than the client’s timeout, and give each backend an isolated connection pool (bulkhead) so one slow service can’t starve the rest.
Enforce a retry budget (cap retries to ~10% of requests), only retry idempotent methods, and use backoff with jitter; pair with per-backend circuit breakers.
Strip client-supplied trust headers (X-User-Id and friends) at ingress before injecting your own, and authenticate before rate-limiting per identity.
Cache token introspection within token TTL or use locally-verifiable JWTs; explicitly decide and document fail-open vs fail-closed when auth is unreachable.
Keep rate-limit and session state in a shared store (Redis), never gateway memory, and account for the N×R overage problem when limits are per-instance.
Stage config and route rollouts with CI validation and one-command rollback — a bad route applies globally and instantly.
Amortize TLS with session resumption and keep-alive; watch gateway-added p99, per-upstream pool utilization, TLS handshakes/sec, and auth-cache hit rate.
Emit per-route metrics, logs, and traces (latency, error rate, retry count) into your observability pipeline — the edge is where you see problems before the backends do.

Summary

An API gateway is the single front door every external request passes through, and almost all of its sharp edges trace back to that one fact: it concentrates your cross-cutting policy and your blast radius into one tier on 100% of the request path. That concentration is its value — consistent auth, rate limits, TLS, and routing in one place — and its danger, because a slow backend, a retry storm, an auth outage, or a bad config push all become product-wide outages even when every service behind it is healthy. Run it stateless and horizontal across AZs, externalize shared state, give every backend an isolated pool and a tight timeout, cap retries with a budget, cache the auth call, and keep the edge thin — policy, never product. Do that and the gateway is the most boring tier you own. Forget the isolation settings and it is the tier that turns one service’s bad afternoon into your whole company’s incident.

Appendix: reverse proxy fundamentals

If the body assumed terms you’d like restated:

Reverse proxy — a server that accepts client requests and forwards them to backend servers, returning the backend’s response as if it were the origin. The client never talks to the backend directly. (A forward proxy is the mirror image: it sits in front of clients and forwards their requests out to the internet.)
TLS termination — decrypting HTTPS at the proxy so backends receive plain HTTP over the trusted internal network. The CPU cost of the handshake lives at the terminator.
North-south vs east-west — north-south is traffic in and out of your system (clients ↔ services); east-west is traffic between your services internally. Gateways handle north-south; meshes handle east-west.
Bulkhead — borrowed from ship design: isolated compartments so a breach in one doesn’t sink the whole vessel. In a gateway, an isolated connection pool per backend so one slow upstream can’t drown the others.
Circuit breaker — a state machine that, after enough failures to a backend, “opens” and fails requests fast for a cooldown instead of waiting on a dying service, then “half-opens” to test recovery.
Idempotent method — a request that has the same effect whether applied once or many times (GET, PUT, DELETE), which is what makes it safe to retry. POST usually is not, which is why the gateway must not blindly retry it; see API Design & Idempotency.

Incidents & deep-dives

Where this system breaks in production — and how it comes back.

No incident deep-dives yet. See the roadmap for what's coming.