Sharding & Partitioning
Sharding buys write scale and hands you routing, rebalancing, and cross-shard queries as the rent. Hash vs range vs directory, hot partitions, global indexes, and resharding without a % N reshuffle.
On this page
Sharding is what you do when one machine can no longer hold the data or absorb the writes. It is also the moment the operations you took for granted — a join, a unique constraint, a COUNT(*), a transaction across two rows — stop being free and start being engineering projects. The pitch is seductive: “split the data across N nodes and scale linearly.” The reality is that you have traded a capacity problem for a routing and coordination problem, and unlike the capacity problem, the new one never fully goes away. You will be paying its rent for as long as the system exists.
The single decision that defines everything downstream is the partition key — the field you split on. Choose it well and load spreads evenly, most queries touch one shard, and the cluster scales the way the brochure promised. Choose it badly and you get hot partitions, scatter-gather queries that get slower as you add nodes, and a resharding migration that haunts a quarter and a postmortem. There is no schema migration to fix a bad partition key cheaply once you have a few terabytes; you re-key the whole dataset, live, while it serves traffic.
This is the long-form context article. It assumes you have already run a sharded system in production and been bitten, or you are about to build one and want to be bitten less. It leans on Consistent Hashing for the ring math that makes resharding survivable, Database Replication for the read-scaling step you should exhaust first, and Database Indexing for what indexes cost once they are themselves partitioned. The systems that live and die by these tradeoffs are DynamoDB, Cassandra, Kafka, and Elasticsearch; the relational system you are usually outgrowing when you reach this page is PostgreSQL.
The blunt truth up front: sharding does not make your system faster. It makes it bigger. Per-request latency on a well-placed query stays about the same; what changes is that you can now do many more of those requests in parallel across machines. Everything that cannot be answered from a single shard gets slower, more fragile, or both. Internalize that asymmetry and most sharding design decisions answer themselves.
A motivating failure
A ride-hailing company shards its trips table by city_id. It is a tidy, intuitive choice — trips belong to a city, dispatch queries are scoped to a city, and there are a few thousand cities, which feels like plenty of partitions. For three years it works. Each city’s trips live on one shard, dispatch reads are single-shard and fast, the on-call rotation is quiet.
Then the company runs a national New Year’s Eve promotion. At 23:00 local time, in each timezone, trip volume in the largest metros spikes 40×. The problem is that city_id for New York, London, and São Paulo each maps to exactly one shard, and a single shard’s primary cannot absorb 40× the write rate of a normal Friday. The shard hosting New York hits 100% CPU, write latency climbs from 4ms to 2.5s, and the dispatch service — which writes a row per driver-location update — starts timing out. Drivers in Manhattan go invisible on the map.
Meanwhile, the shard hosting the bucket of two thousand small towns sits at 6% CPU. The cluster as a whole is at 19% utilization. Every dashboard that averages across shards says the system is healthy. The pager says it is not. The engineer who responds adds three more shards, which does nothing, because city_id for New York still hashes to one of them — you cannot spread one key across nodes by adding nodes.
The fix shipped two weeks later was to re-key high-volume cities to city_id#bucket, splitting each hot city across 16 synthetic sub-partitions and summing on read. Nothing in the original design was a bug. The partition key was reasonable for the average case and catastrophic for the peak case, and “even on average” turned out to be a lie that only the tail told the truth about. That gap — between average distribution and worst-case distribution — is where most sharding incidents live.
The one-sentence mental model
Sharding is deterministically mapping each row to exactly one node by a key, so any client can find data without consulting a central directory — and the entire difficulty is that the map must stay balanced and stable while the data, the traffic, and the node count all change underneath it.
Every clause is an operational constraint you will meet:
- Deterministically mapping by a key → the key is load-bearing for both correctness and performance. A query that knows the key is a one-node lookup; a query that does not is a fan-out to all of them.
- Exactly one node → no row lives in two places (replication is a separate concern), so anything spanning rows on different shards loses single-node transactions and joins.
- Without consulting a central directory → the mapping function is the contract every client computes locally, which is why changing it is so expensive — you have to change everyone’s mental map at once.
- Stay balanced and stable while everything changes → this is the whole game. A perfect map today is a hot mess after the dataset doubles or you add a node, unless you designed the remapping cost in from the start.
flowchart TD
W[Write\nkey=user:1042] --> R{Router\nshard = f key}
R -->|shard 0| S0[(Shard 0)]
R -->|shard 1| S1[(Shard 1)]
R -->|shard 2| S2[(Shard 2)]
Q[Query by\nnon-key field] -.scatter.-> S0
Q -.gather.-> S1
Q -.-> S2
Two truths live in that diagram. A query that includes the partition key is a single-node lookup — cheap, parallelizable, scalable. A query that does not include it — filter by a different field, range over the whole dataset, aggregate everything — becomes a scatter-gather that fans out to every shard and waits for the slowest one to answer. Your partition key is therefore a bet on your dominant access pattern. Bets on the wrong access pattern are not refundable; you pay for them in scatter-gather latency on every request for the life of the system, or you pay to re-key.
A note on vocabulary, because the two words in the title are often used loosely. Partitioning is the general act of splitting one logical dataset into pieces (partitions); it can happen within a single machine — PostgreSQL declarative partitioning splits one table into child tables on one node. Sharding is partitioning where the pieces live on different machines for the explicit purpose of scaling out. All sharding is partitioning; not all partitioning is sharding. This article is mostly about the cross-machine case, because that is where the hard problems are.
How it actually works
Partitioning means choosing a function f(key) → partition, then a second mapping partition → node. The three families differ in what that first function is and what it costs you. Keeping the two mappings separate — key to partition, partition to node — is the single most important structural decision, and the one most naive implementations get wrong by collapsing them.
flowchart LR K[Partition key] --> H[Hash\nspread evenly] K --> R[Range\nordered buckets] K --> D[Directory\nlookup table] H --> HE[No range scans\nno hotspots] R --> RE[Range scans cheap\nhotspots likely] D --> DE[Fully flexible\nlookup is a dep]
Hash partitioning
Apply a hash to the key and use it to assign a partition — classically hash(key) % N, in practice a hash onto a ring or onto a fixed set of buckets. Hashing spreads load evenly and destroys the correlation between adjacent keys, which kills the hotspot you would otherwise get from sequential keys like timestamps or auto-increment IDs.
The cost is that it also destroys ordering. Once hash(key) decides placement, a range query like WHERE created_at BETWEEN '2026-01-01' AND '2026-01-07' has no locality — the matching rows are smeared across every shard, so the query becomes a scatter-gather. Hash partitioning is the default in DynamoDB (the partition-key hash) and Cassandra (the partitioner over the token ring). Reach for it when access is key-value-shaped and even distribution matters more than ordered scans.
Range partitioning
Assign contiguous key ranges to shards: a–f here, g–m there, and so on; or by time, 2026-W01 here, 2026-W02 there. Range scans are now cheap and stay on one or two shards, which is exactly what you want for time-series reads or “give me this user’s last 100 events.”
The cost is the mirror image of hashing: sequential keys all land on the same range. A monotonic key — a timestamp, a Snowflake ID, an auto-increment — funnels every new write to whichever shard owns the newest range, creating a hot partition at the write frontier while every older shard goes read-only-ish and idle. HBase and Bigtable partition by range, and “avoid monotonically increasing row keys” is the first line of every HBase schema guide for exactly this reason.
Directory (lookup) partitioning
Keep an explicit table that maps key → partition → node. Maximum flexibility: you can place any key anywhere and move a single hot key off a struggling shard without touching anyone else. This is how you handle the celebrity-user problem surgically.
The cost is that the directory is now a dependency on the request path and a single point of failure if you do not replicate it. Every lookup consults it (usually cached aggressively), and the directory’s own consistency becomes a thing you operate. Most real systems are hash-by-default with a thin directory layer bolted on for the small set of keys that need special placement — the best of both, at the price of running both.
The resharding problem
Here is the trap that sinks naive implementations. Plain hash(key) % N has a fatal property: change N and almost every key remaps. Going from 4 shards to 5 does not move 1/5 of the data — it moves roughly 80% of it, because key % 4 and key % 5 agree for almost no keys. You are reshuffling nearly the entire dataset across the network while serving live traffic.
flowchart TD
A[Add 5th shard] --> M{Mapping\nmethod?}
M -->|hash mod N| BAD[~80% keys\nremap]
M -->|consistent hash| OK1[~1/N keys\nmove]
M -->|fixed vpartitions| OK2[whole partitions\nmove, no resplit]
BAD --> X[network storm\nlive reshuffle]
OK1 --> Y[bounded move\nneighbors only]
OK2 --> Y
style BAD fill:#e11d48,color:#fff
style X fill:#171717,color:#fff
There are two standard escapes, and you must pick one before the first shard goes live:
- Consistent hashing. Keys and nodes both map onto a ring; a key belongs to the next node clockwise. Adding a node only steals the keys between it and its predecessor — roughly
1/Nof the data moves, not80%. Virtual nodes smooth out the lumpiness. The full mechanics are in Consistent Hashing; this is what Cassandra and DynamoDB use under the hood. - Fixed virtual partitions. Pre-split the keyspace into a large, fixed number of logical partitions — say
4096— at day one, and assign whole partitions to nodes. The key→partition map never changes; only the partition→node assignment does. Adding a node moves a batch of whole partitions to it and never re-splits a single key. This is how Kafka (topic partitions), Elasticsearch (shards), and DynamoDB (internal partitions) all dodge the% Ntrap.
A concrete number makes the case. Take 100M keys on 4 shards. Move to 5 shards under % N and roughly 100M × (1 − 4/5) = 80M keys remap — 80% of the dataset copied across the network, live. Under 4096 fixed virtual partitions assigned ~1024 per node, adding a fifth node moves about 4096/5 ≈ 819 partitions — close to 20%, and crucially zero key-level churn, because keys never change which partition they belong to. Same end state (a rebalanced cluster), an order of magnitude less data in flight, and no risk of a key being briefly homeless mid-move. The choice of 4096 is itself a one-way door: too few and you cannot spread across many nodes later; too many and per-partition overhead adds up. Pick a partition count that comfortably exceeds the largest node count you will ever plausibly run.
The tradeoffs that bite
These are the decisions that look free on the whiteboard and bill you in production.
Even distribution vs query locality. Hash gives you balance and steals your range scans. Range gives you cheap ordered scans and hands you a write-frontier hotspot. You cannot have perfectly even load and cheap ordered queries on the same key — the two are in direct tension, and you must pick the one your dominant access pattern needs. If you need both, you need two copies of the data keyed differently, which is a real cost you should name out loud.
Single-shard speed vs cross-shard cost. Anything within one partition stays transactional, joinable, and fast. The instant an operation spans shards — a join across two keys, a multi-row transaction, a global aggregate — you pay fan-out latency and lose single-node ACID. The cliff is sharp: same-shard and cross-shard are not “a bit slower,” they are different categories of operation with different failure modes.
| Operation | Same shard | Cross shard |
|---|---|---|
| Point lookup by key | O(1), one node | n/a |
| Range scan | One/few nodes | Scatter-gather, all nodes |
| Join | Local, cheap | Application-side or denormalize |
| Transaction | Native ACID | 2PC / saga, much harder |
| Unique constraint | Enforced locally | Needs a global index |
COUNT(*) / aggregate | One node | Fan-out + merge |
Flexibility vs operational weight. A directory layer lets you place and move any key — exactly the flexibility you want for hot keys — but now every read consults it, and you have added a stateful service to run, replicate, cache, and keep consistent. Free flexibility is never free; it is a service on your org chart.
Re-keying is a one-way door. The partition key is the hardest thing to change after launch. Unlike a missing index, which you add online, changing the partition key means re-reading and re-writing every row to its new home while the old layout still serves traffic, usually via a dual-write-and-backfill migration that runs for weeks. Treat the partition-key choice with the seriousness of a database vendor choice, because it is about as reversible.
Performance: what gets faster, what gets slower
The honest framing: sharding improves aggregate throughput and leaves single-request latency roughly where it was — for the requests that hit one shard. Everything else trades latency for the privilege of scaling.
What gets faster (in aggregate):
- Writes, because they distribute across
Nprimaries instead of contending on one. A well-keyed write workload scales close to linearly with shard count — this is the whole reason to shard. - Key-scoped reads, because they parallelize:
Nshards each serving1/Nof the point-lookup traffic at the same per-request latency the single node had. - Per-shard maintenance — vacuum, compaction, index rebuilds, backups — because each shard holds
1/Nof the data, so the work-per-node shrinks even as total data grows.
What gets slower or more fragile:
- Scatter-gather queries. A query without the partition key fans out to all
Nshards, and its latency is governed by the slowest shard, not the average. This is the tail-latency amplification that surprises everyone, and it has hard numbers behind it (next section). - Cross-shard transactions. A two-phase commit adds at least one extra network round trip and a coordinator that can stall; a saga adds compensating logic and a window of inconsistency. Either way, the transaction that was
O(one node)is nowO(coordination). - Global aggregates and counts.
SELECT COUNT(*)that was an index lookup is now a fan-out, a partial count per shard, and a merge. Teams cache these or maintain them incrementally precisely because the exact answer got expensive.
The tail-latency math of scatter-gather
This is the number worth memorizing. If a single shard answers a query in p99 = 10ms, and your scatter-gather query must wait for all N shards, then the query’s p99 is not 10ms — it is roughly the p99 of the maximum of N independent samples. For N shards, the expected slowest response sits near the (1 − 1/N) percentile of a single shard’s distribution.
Concretely: with N = 100 shards, a query that waits for all of them sees a tail near each shard’s p99.9, not p99. So a fleet where each node has a perfectly respectable p99 of 10ms but a p99.9 of 200ms will see scatter-gather p99 around 200ms. Adding shards makes scatter-gather queries worse, not better — every shard you add is another lottery ticket for the slow tail. The levers are: avoid scatter-gather by keying for single-shard access, cap fan-out width, hedge requests (send to a replica after a short delay and take the first answer), and keep per-shard tail latency tight because the fleet inherits the worst of it.
Failure modes
How sharding breaks in production, as symptom → root cause → prevention. These are the ones that actually page people.
Hot partitions. Symptom: one node at 95% CPU and throttling while cluster-wide utilization reads 19% (the opening story). Root cause: a partition key with low cardinality or skewed access — a celebrity user, a status='pending' value that 90% of rows share, a monotonic timestamp funneling all writes to the newest range. In DynamoDB this surfaces as throttling on one partition despite spare table capacity; in Cassandra as one replica set timing out under a wide partition. Prevention: choose a high-cardinality key, add entropy to known-hot keys (userId#bucket, sum on read), and model the skew of your top keys before committing, not after the promo.
A partition key with low cardinality or skewed distribution is a hot partition waiting for traffic. “Even on average” is not “even” — a single viral key or one dominant enum value routes a firehose at one node, and the other shards cannot help, because the data simply is not theirs to serve. You cannot scale your way out of a hot key by adding shards; the key still maps to one of them.
Scatter-gather latency cliffs. Symptom: a query’s p99 is far worse than any single shard’s p99, and gets worse every time you add a node. Root cause: the query lacks the partition key, fans out to all N shards, and inherits the slowest one’s tail (the math above). Prevention: design the dominant query to be single-shard; for the unavoidable cross-cutting ones, maintain a global index keyed on the query field, cap fan-out, and consider request hedging.
Rebalancing storms. Symptom: timeouts and elevated latency cluster-wide right after adding or replacing a node. Root cause: moving partitions to the new node copies data over the network while the cluster still serves traffic; an unthrottled rebalance saturates disk and network I/O and steals it from live queries. Prevention: throttle the move rate explicitly, move whole virtual partitions (never re-split keys mid-flight), rebalance off-peak, and never let an automatic rebalancer run unbounded on a hot cluster — the relief it brings can arrive after it has caused the outage.
Cross-shard transaction gaps. Symptom: a “transfer” debits one account and never credits the other; reconciliation finds money created or destroyed weeks later. Root cause: code written for a single-node transaction silently stopped being atomic when the two rows landed on different shards. Prevention: identify every multi-row write at design time, replace it with an idempotent saga plus compensating actions, and reserve true two-phase commit for the few cases that genuinely need atomic multi-shard writes.
Secondary-index drift. Symptom: a query by a non-key field misses a row that demonstrably exists. Root cause: a global secondary index is itself partitioned on the index key, lives on different nodes than the base data, and is updated asynchronously; read it right after a write and it has not caught up. Prevention: treat global indexes as eventually consistent, design read-after-write paths to hit the base table by primary key, and document the index lag so nobody assumes freshness.
Resharding-by-accident. Symptom: a routine capacity bump triggers a multi-day data migration nobody scheduled. Root cause: the cluster was built on hash(key) % N, so changing N reshuffled the dataset. Prevention: never ship % N you intend to grow — use consistent hashing or fixed virtual partitions from day one. This one is pure prevention; there is no graceful in-flight fix.
Scaling it
The mechanism that lets sharding scale is also where it breaks, so the levers matter more than the theory.
Secondary indexes — local vs global. A local secondary index lives on the same shard as its base rows: cheap to keep consistent (same node, same write transaction), but querying it without the partition key is still a scatter-gather, because you do not know which shard to ask. A global secondary index is partitioned independently on the index field: a query on that field hits one index shard directly, but the index updates asynchronously, so it is eventually consistent with the base table. The rule of thumb: local index when you always have the partition key in hand and just want a secondary access path within the shard; global index when you must query by a different field at scale and can tolerate index lag. DynamoDB exposes both as LSIs and GSIs precisely because the tradeoff is fundamental, not vendor-specific.
Rebalancing. Use fixed virtual partitions so rebalancing moves whole partitions and never re-splits keys. Throttle the move rate, watch per-shard I/O during moves, and prefer moving cold partitions before hot ones. The goal is for a new node to fill up over minutes-to-hours of controlled copying, not to trigger a thundering reshuffle.
Cross-shard queries. At scale you stop fighting fan-out and design around it. Denormalize so the common query is single-shard — store the data the way you read it, even if that means duplicating it. Maintain a global index for the few genuinely cross-cutting lookups. And push heavy aggregates and ad-hoc analytics to a separate store (Elasticsearch for search-shaped queries, a columnar warehouse for analytics) rather than scatter-gathering your OLTP shards, which were never meant for it.
Cross-shard transactions. Replace distributed ACID with sagas — a sequence of local transactions, each with a compensating action that undoes it if a later step fails — wherever the business logic tolerates a brief inconsistency window. Reserve two-phase commit for the rare operations that truly require atomic multi-shard writes, because 2PC’s coordinator is a synchronous dependency and a stall point. Most “we need a distributed transaction” turns out, on inspection, to be “we need an idempotent saga with good retries,” which is far cheaper to operate. The deeper consistency tradeoffs are in Consistency & Consensus and CAP Theorem & Tradeoffs.
The routing tier. As shard count grows, where the f(key) → node decision is made becomes its own scaling question. Client-side routing (the app computes the shard) is fastest but couples every client to the topology. A proxy tier (a load balancer or a dedicated router) decouples clients from topology at the cost of an extra hop. Topology changes — added nodes, moved partitions — must propagate to whoever routes, often via a coordination service like ZooKeeper holding the authoritative map.
When to reach for it (and when not to)
Reach for sharding when a single node can no longer hold the dataset or absorb the write throughput, and you have already exhausted read replicas and vertical scaling. This is the critical ordering. Sharding is the answer to a write or capacity ceiling, not a read ceiling — if your problem is read load, read replicas and caching solve it with a fraction of the operational cost and none of the cross-shard pain.
Reach for hash partitioning when access is key-value-shaped and even load matters more than ordered scans — the common case for user-keyed OLTP data. Reach for range partitioning when ordered range scans dominate (time-series, “last N events for an entity”) and you can engineer around the write-frontier hotspot, usually by combining a high-cardinality prefix with the time component.
Don’t shard prematurely. A well-indexed PostgreSQL on a large instance, with read replicas for fan-out and a cache in front, handles far more than teams assume — millions of rows and thousands of writes per second is comfortably single-node territory in 2026. Sharding adds routing, rebalancing, cross-shard query rewrites, and a permanent resharding liability. That is real, ongoing operational cost you carry forever. Exhaust single-node tuning, indexing, read replicas, and partitioning-within-one-node first.
Don’t shard on a low-cardinality or skewed key to save yourself a migration. The hot partition you create will cost more, sooner, than the resharding you were trying to avoid. A bad partition key is worse than no sharding at all, because it gives you all of sharding’s complexity and none of its load-spreading benefit.
When to consider alternatives
- Read load, not write load → read replicas plus a cache. Most “we need to scale” is read-bound and solvable without distributing writes.
- A managed key-value store that shards for you → DynamoDB or Cassandra — you still choose the partition key, but the platform owns rebalancing and the
% Nproblem. - Search and relevance over a different field than your key → Elasticsearch as a secondary index, instead of scatter-gathering your OLTP shards.
- A durable, partitioned event log → Kafka, which is partitioning applied to a log and a good model for how fixed partitions behave.
- Sub-millisecond ephemeral state at scale → Redis Cluster, with the same hot-key caveat — sharding distributes keys, not load within a key.
- One huge table on one node → PostgreSQL declarative partitioning first; it buys you per-partition vacuum and pruning without leaving a single machine.
The pattern: sharding is the heavyweight answer for a genuine write/capacity ceiling. For almost everything else, a lighter tool spreads the load without making you the owner of a distributed routing problem.
Operational checklist
- Choose a partition key with high cardinality and even access; model the skew of your top keys (and your peak, not your average) before committing.
- Use consistent hashing or fixed virtual partitions from day one — never
hash(key) % Nyou intend to grow. - Pick a virtual-partition count that comfortably exceeds your largest plausible node count; it is hard to change later.
- Alert on per-shard skew (max-shard load vs cluster average), not just cluster-wide averages — hotspots hide in the mean.
- Add entropy (suffix/bucket sharding) for known-hot keys before they go viral, and keep a directory layer to relocate a single hot key surgically.
- Throttle rebalancing and monitor per-shard I/O during partition moves; never run an unbounded auto-rebalance on a hot cluster.
- Decide local vs global secondary indexes per access pattern, and document that global indexes are eventually consistent.
- Replace cross-shard transactions with idempotent sagas where possible; reserve 2PC for the genuinely atomic cases.
- Track scatter-gather p99 separately from single-shard p99; the fleet inherits the slowest shard’s tail.
- Keep the topology map authoritative and replicated (e.g. in ZooKeeper) so routing never disagrees with reality.
Summary
Sharding is the tool you reach for when one machine can no longer hold the data or take the writes — and the moment you reach for it, you accept a permanent new job: keeping a key-to-node map balanced and stable while the data, traffic, and node count all change. The partition key is the load-bearing decision; choose it for high cardinality and even peak access, because “even on average” is the lie that hot partitions tell. Hash for balance, range for ordered scans, a directory for surgical placement — and never hash(key) % N if you ever intend to add a node, because consistent hashing and fixed virtual partitions are the only things that make resharding survivable. Single-shard operations stay fast and transactional; everything cross-shard — joins, global counts, multi-row transactions, queries that lack the key — gets slower and more fragile, and scatter-gather queries inherit the slowest shard’s tail latency, so adding nodes makes them worse. Exhaust replicas, caching, and single-node tuning first. Shard only when you must, key it carefully, design the resharding strategy before the first shard, and replace distributed transactions with sagas. Do that and sharding scales you for years. Skip a step and it scales your incident count instead.
Appendix: the rebalancing math, worked
If you want the resharding numbers to stick, work them once by hand.
Under hash(key) % N, a key k lives on shard k mod N. When N changes to N+1, key k stays put only if k mod N == k mod (N+1), which is true for a vanishing fraction of keys. Empirically, moving from N to N+1 shards remaps about N/(N+1) of all keys — 80% going from 4 to 5, 90% going from 9 to 10. That is why % N is a trap: the bigger your cluster, the larger the fraction that reshuffles on each addition.
Under consistent hashing, nodes and keys both hash onto a ring of fixed size. A key belongs to the first node clockwise from it. Adding a node inserts one point on the ring and steals only the arc between it and its clockwise predecessor — on average 1/(N+1) of the keyspace, and only from one neighbor. Virtual nodes (each physical node owning many ring points) make the stolen fraction closer to the ideal 1/(N+1) and keep it evenly sourced rather than dumping it all on one neighbor.
Under fixed virtual partitions (say P = 4096), the key→partition map is frozen forever: partition = hash(key) mod P. Only the partition→node assignment moves. Adding the (N+1)-th node reassigns about P/(N+1) partitions to it, pulled proportionally from the existing nodes, and no key ever changes partitions. This is why it is the most operationally calm option: rebalancing is “copy these whole partitions,” a bounded, resumable, throttleable unit of work, with zero risk of a key being temporarily unaddressable. The one constraint is that P caps your maximum node count and sets a per-partition overhead floor, so you choose it once, generously, and live with it.
Further reading
Incidents & deep-dives
Where this system breaks in production — and how it comes back.
No incident deep-dives yet. See the roadmap for what's coming.