Apache Kafka

Kafka in production: a partitioned, replicated commit log where durability, ordering, and throughput are knobs you trade against each other — and the failure modes that bite when you guess wrong.

18 min readupdated 2026-06-28

On this page

Kafka gets introduced as “a message queue.” It is not a queue, and the day a team learns the difference is usually the day of an incident. A queue deletes a message once someone reads it. Kafka does not. Kafka is a partitioned, replicated, append-only commit log: messages sit on disk for a fixed retention window whether or not anyone consumed them, and each consumer tracks its own position with an offset. Reading does not consume. Offsets advance.

That one difference reshapes everything downstream — how you reason about delivery guarantees, how you recover a stuck consumer, why “the queue is backed up” is the wrong mental model, and why a slow consumer can silently lose data without a single error in the logs. If you carry queue intuitions into Kafka, you will eventually be paged for something the system did exactly as designed.

This is the long-form context article — the thing I wish someone had handed me before my first Kafka outage. It assumes you have written a producer and a consumer and now have to keep them alive under load. It leans on Database Replication for the leader/follower mechanics Kafka borrows, Consistency & Consensus for the quorum story behind leader election, and Message Queues & Event Streaming for where Kafka fits among the alternatives. If you want the broader “log vs queue vs RPC” framing first, start there and come back.

A motivating failure

A logistics company runs Kafka as the spine between their order service and a dozen downstream consumers — billing, fulfillment, analytics, fraud, notifications. It is the boring, reliable middle of the system for two years. Then Black Friday traffic arrives and the fraud consumer, which calls a slow external scoring API, starts falling behind. Lag climbs from seconds to hours.

Nobody worries, because “Kafka buffers, that’s the whole point.” Except the orders topic was provisioned with retention.ms=86400000 — 24 hours — back when daily volume was a tenth of this. Under peak load the log now fills 24 hours of retention in about nine hours. The oldest segments start getting deleted while the fraud consumer is still 11 hours behind.

When the consumer’s committed offset points at a segment that no longer exists, the broker doesn’t error. The consumer’s auto.offset.reset=latest kicks in, it jumps to the head of the log, and it resumes — having silently skipped roughly two hours of orders. No exception. No alert. The lag metric even improves, because the consumer is now caught up to a head that moved past the data it never saw.

Fraud checks for those two hours never ran. The gap surfaces three days later in a chargeback report, by which point the fraudulent orders have shipped. The post-mortem’s root cause is not a bug — it is a retention window measured against the wrong thing. Lag was monitored against the log head. It was never monitored against retention. That distinction is the difference between “we’re a bit behind” and “we lost data and didn’t know.”

The one-sentence mental model

A Kafka topic is an ordered, immutable log, split into partitions, each partition replicated across brokers to survive failure — and consumers read by advancing an offset they own, not by draining a queue.

Unpack each clause as an operational constraint:

Ordered, immutable log → you never delete or update a record; you append, and you reason about position, not presence. Compaction and retention remove old records, but a consumer’s correctness depends on reaching them in time.
Split into partitions → the partition is the unit of parallelism and the unit of ordering. Ordering holds within a partition, never across them. More partitions means more throughput and a narrower ordering guarantee.
Replicated across brokers → durability comes from copies, and the copies that count are the ones currently caught up (the ISR). The size of that set, not the replication factor on paper, determines whether an acks=all write is actually safe.
Consumers own their offset → Kafka does not track who has read what. A consumer that commits the wrong offset, or falls behind retention, breaks in ways the broker cannot see.

flowchart LR
  P[Producer] -->|hash key| MOD[partition =\nhash % count]
  MOD --> T0
  MOD --> T1
  MOD --> T2
  subgraph Topic
    T0[Partition 0\nlog]
    T1[Partition 1\nlog]
    T2[Partition 2\nlog]
  end
  T0 --> G1[Consumer A]
  T1 --> G1
  T2 --> G2[Consumer B]

A record’s partition is chosen by hash(key) % numPartitions (the default partitioner uses murmur2 over the key bytes). Same key → same partition → ordered relative to itself. No key → records spread round-robin across partitions → no ordering guarantee at all. Within a consumer group, each partition is owned by exactly one consumer at a time; that ownership is what makes parallel consumption safe without locks.

How it actually works

The log on disk

Each partition is a directory of segment files. Writes append to the active segment; once it hits segment.bytes (default 1 GB) or segment.ms, the segment is closed and a new one opens. Retention and compaction operate on whole closed segments — Kafka deletes or compacts a segment, never an individual record. This is why retention is coarse and why a slow consumer falls off a cliff rather than a slope: data is fine, fine, fine, then a whole segment vanishes at once.

flowchart LR
  W[append writes] --> S0[Segment old\nclosed]
  W --> S1[Segment active]
  S0 --> RET{age > retention?}
  RET -->|yes| DEL[delete whole\nsegment]
  RET -->|no| KEEP[keep]
  DEL --> CON[lagging consumer\noffset now gone]
  CON --> RESET[auto.offset.reset\njumps to head]

Reads are sequential scans from an offset, which is why Kafka is fast. There is no random-access index per record beyond a sparse offset index; consumers stream forward. The OS page cache does most of the heavy lifting — recently written records are usually still in memory when a healthy consumer reads them, so a well-behaved pipeline rarely touches disk on the read path. Lagging consumers, by contrast, read cold segments off disk and compete for I/O, which is its own source of broker stress.

Replication and the ISR

Each partition has one leader and some followers. Producers and consumers talk only to the leader; followers replicate by fetching from it, exactly like a consumer would. The set of replicas currently caught up to the leader is the in-sync replica set (ISR). A follower stays in the ISR as long as it has fetched up to the leader’s log end within replica.lag.time.max.ms (default 30s). Fall behind that, and the leader evicts it from the ISR.

Producer durability hinges on two settings that only work as a pair:

acks=all — the leader does not acknowledge the write until every member of the current ISR has it.
min.insync.replicas — the minimum ISR size for an acks=all write to be accepted at all. If the ISR shrinks below this, the leader rejects the write with NotEnoughReplicas rather than accept something under-replicated.

The trap is configuring one without the other. acks=all with min.insync.replicas=1 is durability theater: a single surviving replica can acknowledge a write that then disappears when that broker dies. The combination that actually holds is replication factor 3, min.insync.replicas=2, acks=all — you can lose one broker and still accept writes safely, lose two and you correctly stop accepting writes rather than risk loss.

sequenceDiagram
  participant Pr as Producer
  participant L as Leader
  participant F1 as Follower ISR
  participant F2 as Follower lagging
  Pr->>L: produce acks=all
  L->>F1: replicate
  F1-->>L: fetch ack
  Note over L,F2: F2 exceeded lag window\nevicted from ISR
  L->>L: check ISR >= min.insync
  L-->>Pr: ack

The write path and batching

Producers do not send one record per request. They accumulate records into per-partition batches, controlled by batch.size (bytes) and linger.ms (time to wait for a fuller batch). A linger.ms=0 producer optimizes for latency and sends sparse batches; raising it to 10–50 ms trades a little latency for dramatically better throughput and compression ratios. Compression (compression.type=lz4 or zstd) happens per batch, so larger batches compress better — another reason linger.ms matters.

Idempotence (enable.idempotence=true, the default in modern Kafka) makes the producer attach a sequence number per partition so the broker can dedupe retries. Without it, a producer retry after a network blip can write the same record twice. With it, retries are safe — and it is a prerequisite for transactions.

Consumers, offsets, and delivery guarantees

A consumer reads from its assigned partitions and periodically commits an offset — its bookmark for “I have processed up to here.” Where you commit, relative to where you process, is your delivery guarantee:

Commit before processing → at-most-once. A crash after commit, before processing, loses the record. No duplicates, possible loss.
Commit after processing → at-least-once. A crash after processing, before commit, reprocesses the record on restart. No loss, possible duplicates. This is the sane default.
“Exactly once” is not free. It requires the transactional producer plus isolation.level=read_committed on consumers, and it only covers Kafka-to-Kafka flows. The moment you write to an external system (a database, an email API), you are back to needing idempotent processing downstream. Make the effect idempotent and at-least-once becomes operationally equivalent to exactly-once, without the complexity.

sequenceDiagram
  participant C as Consumer
  participant B as Broker
  participant DB as Downstream
  C->>B: poll records
  B-->>C: batch at offset N..N+9
  C->>DB: process records
  DB-->>C: ok
  C->>B: commit offset N+10
  Note over C,B: crash before commit\nreplays N..N+9 = duplicates

The tradeoffs that bite

These are the decisions that look free at design time and bill you later.

Decision	Cheap option	What it actually costs
Partition count	Few partitions	Caps consumer parallelism; raising it later rehashes keys and breaks per-key ordering for existing data
`acks`	`acks=1`	Fast, loses data when a leader fails before followers replicate
`min.insync.replicas`	`1`	Makes `acks=all` meaningless — a single replica can ack a doomed write
Retention	Short window	Slow consumers silently skip deleted segments with no error
`unclean.leader.election`	`true`	Trades correctness for availability — an out-of-sync replica becomes leader and truncates committed data
Consumer `max.poll.interval.ms`	Default 5 min	Slow processing exceeds it → consumer ejected → rebalance storm

Two of these surprise people most. The partition count is effectively permanent for any keyed topic: you can add partitions, but hash(key) % N changes for every key when N changes, so a key that lived on partition 3 now maps to partition 7, and its ordering relative to its own history is gone. Size partitions for peak throughput and tolerable ordering scope on day one. The other is the realization that you cannot have all three of high throughput, strict ordering, and strong durability at full strength simultaneously — every Kafka design is a point on that triangle.

Throughput and latency: the levers

Kafka’s reputation for speed is earned, but the defaults are tuned for safety, not raw numbers. The levers, roughly in order of impact:

Partition count — the ceiling on consumer parallelism. A topic with 6 partitions can be consumed by at most 6 active consumers in a group; a 7th sits idle. Provision partitions at 2–3× your expected peak consumer count to leave room to scale out.
Producer batching — linger.ms=10–50 plus a generous batch.size (e.g. 131072) turns thousands of tiny requests into a handful of fat, compressed ones. This is often a 3–5× throughput swing with negligible latency cost.
Compression — zstd or lz4 cuts network and disk by half or more on text-heavy payloads (JSON events compress beautifully). It costs producer CPU; measure both.
fetch.min.bytes / fetch.max.wait.ms on the consumer — let the consumer wait for a fuller fetch instead of hammering the broker with tiny requests under light load.
Replication factor and acks — acks=all adds a round trip to the slowest ISR follower. This is the durability tax, and it is worth paying for a system of record; just know it is the floor on your produce latency.

The thing to internalize: Kafka latency is dominated by the slowest required acknowledger. With acks=all, a single GC-pausing follower drags every produce p99 up with it. Throughput, by contrast, is dominated by batching and partition count. They are tuned with different knobs.

Failure modes

This is the part that pages you. Symptoms → root cause → prevention.

1. ISR shrink death spiral. Symptom: acks=all producers start failing with NotEnoughReplicas, throughput collapses, back-pressure stalls the whole pipeline. Root cause: one follower falls behind (GC pause, slow disk, network), gets evicted from the ISR; the ISR drops below min.insync.replicas; the leader correctly refuses writes. Prevention: size replication factor 3 and min.insync.replicas=2 so you tolerate one slow replica; alarm on UnderReplicatedPartitions > 0; chase the root cause (broker disk, GC) rather than lowering min.insync.replicas in a panic, which just trades a stall for silent data loss.

2. Rebalance storm. Symptom: a consumer group spends all its time reassigning partitions and almost none consuming; lag climbs steadily; logs are full of “rebalancing.” Root cause: a consumer’s processing exceeds max.poll.interval.ms, so the group coordinator assumes it died and triggers a rebalance, which pauses everyone, which makes the next poll late too — a feedback loop. Prevention: keep per-poll work bounded (max.poll.records), make processing fast or asynchronous, raise max.poll.interval.ms deliberately if work is genuinely long, and use cooperative sticky assignment (cooperative-sticky) so rebalances don’t stop the world.

3. Silent data loss via retention. Symptom: a consumer “catches up” but downstream data has gaps; no errors anywhere. Root cause: the consumer fell behind retention.ms/retention.bytes; the segments it needed were deleted; auto.offset.reset jumped it to the head, skipping data. This is the opening story. Prevention: monitor consumer lag against retention time, not just against head; size retention to comfortably exceed worst-case consumer downtime plus a margin; alarm when lag exceeds, say, 50% of the retention window. Consider auto.offset.reset=none for critical consumers so they fail loudly instead of skipping.

4. Unclean leader election. Symptom: after a broker failure, records that producers saw acknowledged are gone. Root cause: unclean.leader.election.enable=true let an out-of-sync replica become leader to restore availability; that replica never had the latest writes, so the log was truncated to its shorter length. Prevention: keep unclean.leader.election.enable=false (the modern default) unless you have explicitly chosen availability over correctness and documented that the topic can lose data.

5. Hot partition. Symptom: one partition’s consumer is pinned at 100% while others idle; lag is wildly uneven. Root cause: a skewed key (one tenant, one viral entity) routes a disproportionate share of traffic to a single partition. Sharding distributes keys, not load within a key — the same wall described in Sharding & Partitioning and Consistent Hashing. Prevention: choose a higher-cardinality key, add a salt to the key for hot entities (accepting weaker ordering for them), or split the hot tenant onto its own topic.

Retention is not a queue drain. A record persists for exactly retention.ms/retention.bytes regardless of whether anyone read it, and a slow consumer gets no backpressure — it gets silently skipped data once its offset falls off the oldest segment. Monitor lag against retention, not against the log head. The day those two numbers diverge is the day you lose data without an alert.

The sharpest incident walk-throughs — ISR shrink, rebalance storms, unclean leader election — are on the roadmap as deep-dives; the prevention summaries above are the operational core.

Scaling it

What changes as volume grows by 10× and 100×:

Partitions are the unit of parallelism, and the first wall. Consumer throughput caps at the partition count. When you need more parallel consumers than partitions, you needed more partitions — and adding them to a keyed topic is a migration (rehashed keys, broken ordering), not a config flip. Over-provision early; the cost of spare partitions is controller metadata and a little end-to-end latency, far cheaper than a repartition under fire.
Leadership balance. Each partition leader lives on one broker. If leaders cluster on a few brokers, those brokers become hotspots while others idle. Run periodic leader rebalancing (kafka-leader-election / auto leader rebalancing) and watch per-broker leader counts, not just aggregate throughput.
Tiered storage. Modern Kafka can offload closed segments to object storage (Object Storage & Large Blobs) while keeping recent data local. This decouples retention from local disk, letting Kafka act as a long-lived, replayable source of truth without provisioning petabytes of broker SSD. The tradeoff is slower reads for cold offsets and longer recovery scans.
KRaft replaces ZooKeeper. Older clusters stored cluster metadata in an external ZooKeeper quorum; modern Kafka uses KRaft, an internal Raft-based metadata quorum, removing a whole external dependency and its operational burden. If you are still on ZooKeeper-backed Kafka, plan the migration — it removes a class of “ZooKeeper session expired” incidents entirely.
Multi-cluster and geo. Beyond a single cluster, MirrorMaker 2 or a managed equivalent replicates topics across regions. This is its own database-replication-style consistency problem: cross-region is asynchronous, offsets don’t map one-to-one, and you design for eventual convergence, not a global order.

When to reach for it (and when not to)

Reach for Kafka when you need a durable, replayable event log; high fan-out to many independent consumers reading at their own pace; a stream-processing substrate; or a decoupling backbone between services where producers and consumers evolve independently. The replay capability is the quiet superpower — being able to reset a consumer group to an old offset and reprocess a week of events has saved more bad deploys than I can count.

Don’t reach for it as a synchronous RPC bus — request/response over a log is painful and slow; use API design patterns and direct calls. Don’t use it as a low-latency task queue for a handful of jobs — Redis or a broker behind Celery is simpler and faster at small scale. Don’t treat it as a database you query by key — it has no random access by content, only by offset. And don’t reach for it when your whole system is one producer and one consumer at modest volume; you are paying a distributed-systems tax for capabilities you aren’t using.

When to consider alternatives

Low-latency task queue, modest volume → Celery on a broker, or Redis streams/lists.
Simple point-to-point or pub/sub without replay → a traditional broker; see Message Queues & Event Streaming for the comparison.
Request/response between services → direct calls behind an API Gateway, not an event log.
Durable system of record you query by key → PostgreSQL, Cassandra, or DynamoDB.
Cluster metadata / coordination primitives → ZooKeeper & etcd directly, rather than bending Kafka into a config store.

Operational checklist

Set replication factor >= 3, min.insync.replicas=2, and producer acks=all together — none of them is safe alone.
Keep unclean.leader.election.enable=false unless you have explicitly accepted data loss for that topic.
Monitor consumer lag against retention time, not just against the log head; alarm at ~50% of the retention window.
Alarm on UnderReplicatedPartitions > 0 and on ISR shrink events; chase the slow broker, don’t lower min.insync.replicas.
Set producer enable.idempotence=true and make consumers idempotent so at-least-once is operationally safe.
Use cooperative-sticky partition assignment and bound max.poll.records to avoid rebalance storms.
Size partitions for peak throughput and ordering scope before launch; repartitioning a keyed topic is a migration.
Tune linger.ms (10–50ms) and batch.size with compression (lz4/zstd) for throughput; measure produce p99 separately.
Watch per-broker leader balance and controller health, not just aggregate throughput.
Set auto.offset.reset=none for critical consumers so falling off retention fails loudly instead of skipping data.

Summary

Kafka is a partitioned, replicated, append-only log — not a queue — and almost every production scar traces back to forgetting that. Reading doesn’t consume; offsets advance; retention deletes data on a timer whether or not anyone read it. Durability is the joint behavior of acks=all, min.insync.replicas, and replication factor, and getting one of the three wrong quietly defeats the other two. Ordering lives inside a partition and nowhere else, and the partition count you pick on day one is effectively permanent for keyed data. Learn the ISR, monitor lag against retention rather than head, keep unclean leader election off, make consumers idempotent, and size partitions with room to grow. Do that and Kafka becomes the dependable middle of your system — which is exactly where you want it to be invisible.

Appendix: log, offset, and consumer-group refresher

If the body assumed terms worth restating:

Log — an ordered, append-only sequence of records. New records go to the end; existing records are never modified. Position is everything.
Offset — a monotonically increasing integer identifying a record’s position within a partition. Consumers store their offset; the broker does not track per-consumer progress beyond the committed offset they report.
Partition — one physical log. A topic is a collection of partitions. Ordering is guaranteed within a partition only.
Consumer group — a set of consumers that cooperatively read a topic, with each partition assigned to exactly one member. Add members up to the partition count to scale; beyond that, they idle.
ISR (in-sync replicas) — the replicas currently caught up to the leader within replica.lag.time.max.ms. Only ISR members can be safely elected leader (with unclean election off), and only they count toward min.insync.replicas.

Incidents & deep-dives

Where this system breaks in production — and how it comes back.

Documenting next

🔒 ISR Shrink: The Replication Death Spiralroadmap →
🔒 Rebalance Storm: Consumers Stuck In A Looproadmap →
🔒 Unclean Leader Election & Silent Data Lossroadmap →