Practical rebuilds of these systems — real failovers & chaos drills — are in production onYouTube, soon.

Kubernetes

Kubernetes in production as a reconciliation engine: control loops, etcd, the scheduler's lies, probes, QoS and OOMKill, cascading eviction, autoscaling lag, and where it earns its complexity.

24 min readupdated 2026-06-28
On this page

Kubernetes gets sold as “just hand us your container, we’ll handle the rest.” That framing survives exactly until the first night a single node hits memory pressure, the kubelet evicts a pod to save itself, the rescheduled pod lands on a second node and tips it into pressure, and ninety seconds later a third of your fleet is Evicted while the control plane calmly reports that everything is doing precisely what you asked. Nothing is broken. Every component is behaving correctly. That is the part that takes people years to accept: Kubernetes is a set of controllers that continuously diff desired state against observed reality and act to close the gap — and “act” includes killing your workload when a node is in trouble.

It is not a deploy tool. It is a reconciliation engine that happens to be very good at deploys. Everything you touch — Deployments, Services, HorizontalPodAutoscalers — is a desired-state record sitting in etcd that some controller is busy trying to make true. There is no central brain issuing orders. There is a swarm of small loops, each watching one slice of the world and nudging it toward what you declared. Internalize that and most of the “why did Kubernetes do that” mysteries dissolve into “a controller saw a diff and closed it.”

This is the long-form context article — the thing I wish someone had handed me before my first cluster-wide eviction storm. It assumes you have shipped something real and been surprised by how the cluster behaved under load. It leans on Observability (you cannot operate what you cannot see), Load Balancing (how Services and Ingress actually move traffic), and the broader roadmap for where this fits in a system. The deep failure mode that defines Kubernetes at scale — the cascading eviction — gets its own treatment below, because it is the one that pages you.

The single biggest mistake teams make is treating Kubernetes like a smarter VM scheduler that “figures out” resources for you. It does not figure them out. It does exactly what your manifests say, and if your resource specs lie to the scheduler, the scheduler will pack nodes that have no real headroom and then dismantle your fleet to fix the math you broke.

A motivating failure

A mid-size SaaS team runs about sixty services on a managed cluster. For a year it is uneventful. Then a marketing push triples traffic on a Tuesday afternoon. One service — an image-processing worker — has no memory limit set, because “it’s hard to predict and Kubernetes handles it.” Under the spike it starts buffering large uploads in memory and climbs from its usual 300 MB to 4 GB.

That worker is BestEffort QoS: no requests, no limits. When its node crosses the memory-pressure eviction threshold, the kubelet does what it is designed to do — it ranks pods by QoS and evicts the cheapest first to reclaim memory. But the kubelet evicts several pods, not just the offender, because it reclaims in bulk until it’s safely under threshold. Those evicted pods are still desired, so their controllers immediately reschedule them. The scheduler looks for nodes with free requests, finds a neighbor that looks half-empty on paper — because half its pods also under-set their requests — and packs the refugees there.

That neighbor’s real usage was already near capacity; the scheduler just couldn’t see it. It tips into pressure. Its kubelet starts evicting. The wave moves node to node faster than the cluster autoscaler can add capacity, which takes three to four minutes to boot a node anyway. Within ten minutes, kubectl get pods is a wall of Evicted and Pending, p99 across unrelated services has gone to timeouts, and the on-call engineer is staring at green control-plane dashboards wondering what’s actually broken.

Nothing was broken. One service lied to the scheduler about how much memory it needed, and the scheduler’s bin-packing math was wrong cluster-wide as a result. The outage lived entirely in the gap between reserved capacity and real capacity. That gap is the thing this article is about.

The one-sentence mental model

You declare what you want (replicas: 3, memory: 512Mi), Kubernetes persists it to etcd, and a swarm of independent controllers each watch their slice of that state and take one action at a time until reality matches — forever, in a loop.

Every clause is an operational constraint:

  • You declare what you want → Kubernetes is declarative, not imperative. You don’t tell it to start a pod; you tell it three should exist, and a controller makes that true. kubectl delete pod rarely does what beginners expect, because the desired state still says three.
  • Persists to etcd → there is exactly one source of truth, a consistent replicated key-value store, and if it is slow or unhealthy the entire cluster is slow or frozen.
  • A swarm of independent controllers → no central orchestrator. Each loop is dumb and narrow on purpose; the intelligence is emergent. This is why it self-heals and also why it will evict, throttle, and reschedule on its own judgment.
  • Until reality matches, forever → the loop never stops. The same mechanism that recreates a crashed pod will, given bad inputs, recreate a crashing pod into a CrashLoopBackOff you can’t stop by deleting it.
flowchart TB
  U[kubectl apply\ndesired state] --> API[kube-apiserver]
  API --> E[(etcd\nsource of truth)]
  API --> SCH[scheduler\nassigns pods]
  API --> CM[controller-manager\nreconcile loops]
  SCH --> API
  CM --> API
  API --> KL[kubelet\non each node]
  KL --> CRI[container runtime\nruns pods]
  KL -->|status| API

The control plane is four things worth naming precisely. kube-apiserver is the only component that talks to etcd — it is the front door, and everything else (kubelets, controllers, your kubectl) goes through it. etcd is the consistent, replicated key-value store that is the cluster state; lose it without a backup and the cluster’s memory is gone. The scheduler decides which node a pending pod lands on, based on requests, affinity, taints, and topology. The controller-manager is the bundle of reconciliation loops — Deployment, ReplicaSet, Node, endpoint, namespace controllers, each running the same watch-diff-act cycle. On every worker node, the kubelet is the agent that pulls pod specs from the API server, drives the container runtime to run them, and reports status back. Nothing bypasses the API server. The consistency it relies on underneath is the same family of problem covered in Consistency & Consensusetcd is a Raft cluster, and its quorum math is your control plane’s availability ceiling.

How it actually works

The reconciliation loop

There is no master process handing out commands. Each controller runs the identical loop: watch desired state via the API server, observe actual state, compute the diff, take one action to shrink it, repeat. You set replicas: 3; the ReplicaSet controller sees two pods running and creates one. You delete a pod by hand; the controller observes a shortfall against the still-desired three and makes another within seconds.

This is the mental flip that unlocks Kubernetes. You are not operating on pods. You are editing the desired-state record, and the controllers do the rest. Want a pod gone for good? Scale the Deployment or delete it, not the pod. Fighting a controller by hand always loses, because the loop never tires and you do.

A useful corollary: the loops are level-triggered, not edge-triggered. They don’t react to events so much as continuously converge on a target. Miss an event during a network blip and it doesn’t matter — the next reconciliation reads current state and corrects. This is why Kubernetes is robust to its own components restarting, and why “did the event fire?” is rarely the right debugging question. “What does the desired state say versus what exists?” is.

Pod lifecycle and the three probes

A pod moves through phases: PendingRunningSucceeded or Failed. Pending means the scheduler hasn’t placed it yet, or it’s placed but images are still pulling. Once Running, the kubelet runs up to three probes against each container, and the most common production mistake is not understanding that they do genuinely different jobs.

  • livenessProbe — “is this container wedged and unrecoverable?” Failing it makes the kubelet kill and restart the container. Set it too aggressive and you restart healthy-but-slow pods straight into CrashLoopBackOff.
  • readinessProbe — “can this pod serve traffic right now?” Failing it removes the pod from its Service’s endpoint list but does not kill it. This is the probe that controls whether traffic reaches you. A pod can be alive and not ready (warming a cache, draining for shutdown).
  • startupProbe — “has the app finished booting?” While it’s failing, the liveness probe is suppressed. This exists so a JVM that takes ninety seconds to warm doesn’t get liveness-killed at second thirty and loop forever.
sequenceDiagram
  participant S as Scheduler
  participant K as kubelet
  participant C as Container
  participant EP as Endpoints
  S->>K: pod assigned
  K->>C: start container
  loop startupProbe
    K->>C: booted yet?
  end
  C-->>K: startup OK
  K->>EP: add pod (ready)
  Note over EP: traffic flows
  K->>C: livenessProbe periodic
  C--xK: liveness fails
  K->>EP: remove pod
  K->>C: kill and restart

The classic self-inflicted outage lives in the readiness probe. A team writes a /health endpoint that checks the database, then points the readiness probe at it. The database has a two-second blip. Every pod’s readiness fails simultaneously, every pod leaves its Service endpoints at once, and the cluster has taken itself fully offline over a transient hiccup that the app could have ridden out. Readiness should test the pod’s own ability to serve, not the health of its dependencies. Liveness even more so — a liveness probe coupled to a downstream dependency turns a blip into a restart storm.

Requests, limits, and QoS classes

Two numbers per resource decide your pod’s fate, and they are the most consequential lines in any manifest. requests is what the scheduler reserves — it will only place a pod on a node with that much allocatable capacity free, and it’s the floor you’re guaranteed. limits is the ceiling — exceed your CPU limit and you get throttled (slower but alive); exceed your memory limit and you get OOMKilled with exit code 137. The relationship between the two assigns a QoS class, and QoS decides eviction order when a node runs short.

QoS classConditionEviction priority
Guaranteedrequests == limits for every resource, every containerEvicted last
Burstablerequests set, limits higher or unsetEvicted after BestEffort
BestEffortno requests or limits at allEvicted first

The crucial asymmetry: CPU is compressible, memory is not. Over your CPU limit, the kernel’s CFS throttler just gives you fewer cycles — you slow down, you don’t die. There is no equivalent for memory. You cannot run “a little over” your memory limit; the kernel’s OOM killer terminates the container the instant it crosses, and the kubelet records exit 137. A pod with no memory limit at all can balloon and drag the entire node into pressure, which is the trigger for the eviction logic that defines the failure mode below.

flowchart TD
  N[node memory\nclimbing] --> T{crosses eviction\nthreshold?}
  T -->|no| OK[healthy\nscheduler math holds]
  T -->|yes| R[kubelet ranks pods\nby QoS]
  R --> BE[evict BestEffort\nfirst]
  BE --> BU[then Burstable\nover requests]
  BU --> RS[pods still desired\ncontrollers reschedule]
  RS --> NX[land on neighbor\nnode]
  NX --> N
  style RS fill:#e11d48,color:#fff
  style N fill:#171717,color:#fff

The tradeoffs that bite

These are the decisions that look free at manifest-writing time and bill you during an incident.

DecisionThe free-looking choiceWhat it actually costs
No requests/limits”Kubernetes figures it out”BestEffort QoS — first evicted; can starve a node and lie to the scheduler
requests == limits everywhere”Safest from eviction”Lower bin-packing density; you pay for reserved-but-idle capacity
Aggressive livenessProbe”Fast self-healing”Restarts slow-but-healthy pods into CrashLoopBackOff
Readiness gated on a dependency”Don’t serve if backend is down”Correlated failure — every pod leaves endpoints at once
Tight memory limit”Dense packing”OOMKill loops under legitimate spikes
No PodDisruptionBudget”Simpler manifests”One node drain can take every replica down together
Big single namespace, no quotas”Less overhead”One team’s runaway job evicts another team’s pods

The pattern mirrors most distributed systems: the setting that maximizes density and simplicity — no limits, tightly packed nodes, no disruption budget — is the same setting that maximizes correlated failure. You buy resilience with reserved headroom and explicit budgets, and that headroom costs money you can see on the bill, while the outage it prevents is invisible right up until the afternoon it isn’t. The honest engineering move is to make that trade deliberately per workload, not to default into BestEffort because setting numbers is annoying.

Two rows deserve emphasis. The requests == limits choice (Guaranteed QoS) is genuinely the safest from eviction, but if you apply it blanket-wide you destroy the economic argument for Kubernetes — you’re now paying for every pod’s peak as if it were constant, which is just expensive VMs with extra steps. Reserve Guaranteed for the workloads that must not be evicted. And the readiness-gated-on-dependency mistake is so common it deserves a rule: a probe failure should be something the pod can fix by restarting or waiting. If the fix is “the database comes back,” the probe is testing the wrong thing.

Scheduling and resource performance

“Performance” in Kubernetes is rarely about the orchestrator’s own speed — the control plane adds milliseconds, not seconds, to steady-state operation. It’s about whether your capacity math is honest, because that determines packing density, eviction risk, and how fast you can absorb a spike.

What the scheduler is fast at: placing pods when requests reflect reality. The scheduler filters nodes that can’t fit the request, scores the survivors (spread, affinity, least-allocated), and binds. On a healthy cluster this is sub-second per pod. It scales to thousands of nodes because each decision is local and cheap.

Where it gets slow and dangerous: when requests are wrong. If requests are set too low, the scheduler over-packs — nodes look empty on paper while real usage is at the ceiling, and you get the eviction cascade. If requests are set too high, the scheduler under-packs — pods sit Pending because no node has the reserved room, even though every node is 70% idle. The single highest-impact tuning activity in Kubernetes is getting requests close to real p95 usage. Use the Vertical Pod Autoscaler in recommendation mode, or just read actual usage from your observability stack over a representative week.

The levers that actually move outcomes, in rough order of impact:

  1. Accurate requests. Everything downstream — scheduling, bin-packing, HPA utilization math — is computed against requests. Wrong requests poison all of it.
  2. Memory limits on every container. This is the blast-radius cap. One missing memory limit is how a single pod takes a node hostage.
  3. Node --system-reserved / --kube-reserved. Carve out capacity for the kubelet, container runtime, and OS so the system daemons don’t get squeezed into the eviction path themselves. Allocatable capacity should never be 100% of the node.
  4. Pod topology spread + anti-affinity. Spread replicas across nodes and zones so one node or zone failure doesn’t take a whole service. Cheap insurance, frequently skipped.
  5. Right-sized nodes. A few large nodes pack densely but have a big blast radius and slow autoscaling granularity; many small nodes scale smoothly but waste more on per-node reserved overhead. Match node size to your pod size distribution.

A concrete number to anchor on: if your eviction threshold is memory.available<500Mi (a common default) and your node has 16 GB, the kubelet starts evicting when real usage crosses ~15.5 GB — regardless of what requests summed to. So if your pods’ requests summed to 14 GB but their real usage is 15.5 GB, the scheduler thought there was 2 GB free and there was actually zero. That 1.5 GB lie, multiplied across a fleet, is the cascade.

Failure modes

The recurring ones, in rough order of how often they page people. Each is symptom → root cause → prevention.

The cascading eviction. Symptom: Evicted pods piling up across many nodes, climbing latency on unrelated services, green control-plane dashboards. Root cause: under-set or missing requests let real node usage drift far above what the scheduler reserved; one node tips into memory pressure, the kubelet evicts in bulk, controllers reschedule the refugees onto neighbors that were also secretly full, and the wave propagates faster than the autoscaler can add nodes. Prevention: honest requests, memory limits everywhere, reserved node capacity.

OOMKill loops. Symptom: restart count climbing, exit code 137 in kubectl describe pod, CrashLoopBackOff with exponential backoff between restarts. Root cause: memory limit set below real peak usage; the container hits the ceiling, gets killed, restarts, hits it again. Prevention: set the limit above observed p99 memory with headroom; fix actual leaks rather than papering over them with a higher limit.

Readiness-probe blackout. Symptom: sudden full outage of a service with no crashes — every pod alive but Service endpoints empty, clients getting connection refused or 503. Root cause: readiness probe coupled to a shared downstream dependency that blipped. Prevention: readiness tests the pod’s own serving ability, never a dependency’s health.

Liveness restart storm. Symptom: healthy pods getting killed mid-work, restart counts climbing during traffic peaks or GC pauses. Root cause: liveness timeout shorter than a legitimate slow path (long GC, cold cache, slow startup). Prevention: generous liveness timeouts plus a startupProbe to cover boot time.

PDB deadlock during maintenance. Symptom: a node drain or cluster upgrade stalls indefinitely, never completing. Root cause: a PodDisruptionBudget with minAvailable set so that the drain can never satisfy it (e.g. minAvailable: 3 on a 3-replica Deployment means zero pods can ever be voluntarily evicted). Prevention: set PDBs as a percentage or leave headroom (maxUnavailable: 1); the opposite mistake — no PDB — lets a drain evict every replica at once.

The cascading eviction is almost always one root cause wearing a costume: you lied to the scheduler about how much you need. BestEffort and loosely-Burstable pods let a node’s real usage drift far above what the scheduler reserved, so it keeps packing nodes that have no actual headroom, and the first spike unwinds the whole thing. Set requests close to real usage so the scheduler’s arithmetic reflects the world, set memory limits so one pod can’t hold a node hostage, and reserve node capacity with --system-reserved and --kube-reserved. Eviction is not Kubernetes being hostile. It is Kubernetes acting on numbers you gave it.

Scaling it

The honest progression. Each layer adds capability and a new failure surface.

Horizontal Pod Autoscaler (HPA). The HPA watches a metric and adjusts a Deployment’s replicas to hit a target. The default is CPU utilization as a percentage of requests — which is exactly why wrong requests break HPA: if your request is half of real usage, HPA thinks you’re at 200% and scales wildly; if it’s double, HPA thinks you’re idle and never scales. The deeper caveat is that CPU is a lagging signal. By the time CPU climbs, the latency damage is already happening, and the HPA reacts in tens of seconds to a minute. For spiky traffic, scale on a leading indicator instead — request rate, or queue depth from your message queue — using custom or external metrics, and keep warm headroom so you’re not always one step behind the spike.

Cluster Autoscaler. HPA adds pods; if no existing node has room for them, those pods sit Pending until the Cluster Autoscaler provisions a new node. That’s a minutes-long cold path — node boot, image pull, kubelet registration. If your traffic can’t wait three minutes, you cannot rely on reactive node scaling. Pre-provision headroom, or run low-priority “overprovisioning” placeholder pods that get evicted to make room for real workloads, giving you instant capacity while a real node boots in the background.

flowchart LR
  M[metric climbs\nCPU or queue] --> HPA[HPA raises\nreplicas]
  HPA --> SCH{node has\nroom?}
  SCH -->|yes| RUN[pods scheduled\nseconds]
  SCH -->|no| PEND[pods Pending]
  PEND --> CA[Cluster Autoscaler\nadds node]
  CA --> BOOT[boot + image pull\nminutes]
  BOOT --> RUN

Rolling updates with guardrails. A Deployment rollout is governed by maxSurge (extra pods allowed above desired during the roll) and maxUnavailable (how many can be down at once). maxUnavailable: 0 with maxSurge: 1 is the safe, slow roll — always at full capacity, one extra pod at a time. Higher values roll faster but cut into serving capacity mid-deploy. Pair every rollout with a PodDisruptionBudget so that voluntary disruptions — node drains for upgrades, the rollout itself — can never drop you below a safe replica count.

Services, Ingress, and traffic. A Service gives a stable virtual IP and load-balances across the current set of ready pod endpoints — the ones passing readiness. Ingress (or the newer Gateway API) sits in front for L7 HTTP routing, TLS termination, and host/path rules. This is the integration seam with load balancing: the cloud LB targets the ingress controller, the ingress routes to Services, Services spread across pods. The metric that silently kills you here is endpoint count — a Service with zero ready endpoints is a 503 factory that emits no pod crashes and no obvious alarm unless you’re watching for it.

Multi-tenancy and etcd limits. At fleet scale, the wall is often the control plane, not the workers. etcd performance degrades with object count and churn; tens of thousands of objects, high-frequency updates (chatty controllers, huge ConfigMaps, per-request CRDs) and large resource bodies all push it. Use ResourceQuota and LimitRange per namespace so one team can’t exhaust the cluster, keep custom controllers from hammering the API server, and back up etcd relentlessly — it is your only source of truth, and a Raft quorum loss without a backup is an unrecoverable cluster.

When to reach for it (and when not to)

Reach for Kubernetes when you’re running many services that each need independent scaling, rolling deploys, self-healing, and a common substrate that several teams share — and you have the operational maturity, or a managed control plane (EKS, GKE, AKS) that supplies it. It earns its considerable complexity at fleet scale, across many teams, where the alternative is a zoo of bespoke deploy scripts and snowflake VMs. The reconciliation model genuinely is better than humans at keeping a large fleet in its desired state.

Don’t reach for it for a single app or a small team that just needs to run a container reliably. A managed platform — Cloud Run, ECS Fargate, Fly, a plain PaaS — gives you 90% of the benefit (rolling deploys, autoscaling, health checks) with a fraction of the operational surface, and none of the foot-guns this article catalogs. Don’t adopt it as a substitute for understanding your app’s resource profile; Kubernetes amplifies bad capacity planning into cluster-wide eviction rather than absorbing it. And don’t casually run stateful databases on it — it’s possible with StatefulSets, persistent volumes, and operators, but the storage and failover failure modes are sharp, and a managed PostgreSQL or DynamoDB is almost always the better trade than reinventing database failover inside your cluster.

The blunt heuristic: if you can’t name three services that need to scale independently and a person who’ll own the cluster, you don’t need Kubernetes yet.

When to consider alternatives

  • A single service or small fleet that just needs to run → a managed container platform (Cloud Run, ECS, Fly) — orchestration without the control plane to operate.
  • Durable stateful data with real failoverPostgreSQL, DynamoDB, or Cassandra as a managed service, not a StatefulSet you babysit.
  • Async work distribution and retries → a message queue or task system like Celery; Kubernetes Jobs are fine for batch, not for high-rate task fan-out.
  • Strong distributed coordination, leader election, locksZooKeeper or etcd directly; don’t hand-roll it on top of pods.
  • Edge L7 routing and API concerns → an API gateway in front of, or instead of, raw Ingress when you need auth, rate limiting, and request shaping.
  • Caching and ephemeral hot stateRedis; don’t store it in pod memory you expect to survive a reschedule.

The pattern: Kubernetes is the substrate that runs stateless and carefully-managed-stateful workloads at fleet scale. The moment a requirement is “and it must be durable / strongly coordinated / a managed data plane,” reach for the purpose-built system and let Kubernetes run the stateless tier in front of it.

Operational checklist

  • Set requests and limits on every containerrequests near real p95 usage so the scheduler’s math is honest, memory limits so one pod can’t take a node hostage.
  • Reserve node capacity with --system-reserved and --kube-reserved so the kubelet and system daemons never get squeezed into the eviction path.
  • Keep readinessProbe checks local to the pod — never gate readiness on a shared downstream dependency.
  • Use a startupProbe for slow-booting apps so the liveness probe doesn’t kill them during warmup; keep liveness timeouts generous.
  • Attach a PodDisruptionBudget to every Deployment so drains and rollouts can’t drop you below a safe replica count — and verify it doesn’t deadlock drains.
  • Spread replicas with topology spread / anti-affinity across nodes and zones so one node or zone loss isn’t a full outage.
  • Configure HPA on a leading metric (request rate or queue depth), not just CPU, and keep warm headroom for the autoscaler’s cold path.
  • Alert on node memory pressure, Evicted pod count, OOMKilled (exit 137) restarts, and Service endpoint count hitting zero — the silent 503 factory.
  • Set ResourceQuota and LimitRange per namespace so one team’s runaway job can’t evict another’s pods.
  • Keep etcd healthy and backed up — it is the single source of truth; a slow or lost etcd is a cluster outage, and quorum loss without a backup is unrecoverable.

Summary

Kubernetes is the best fleet-scale orchestrator there is, and almost every sharp edge traces back to one fact: it is a reconciliation engine acting on the numbers you give it, not a magic resource manager that figures things out. It keeps desired state in etcd and runs a swarm of dumb, level-triggered loops to make reality match — which is why it self-heals, and why it will evict, throttle, and reschedule on its own judgment. The defining outage, the cascading eviction, is what happens when your requests lie to the scheduler: it packs nodes that have no real headroom, the first spike tips one node, and the wave propagates faster than capacity can grow. Set honest requests, put memory limits on everything, keep probes local to the pod, attach disruption budgets, scale on leading indicators, and back up etcd. Do that and Kubernetes is a calm, boring substrate. Lie to the scheduler and it will dismantle your fleet with perfect, infuriating correctness.

Appendix: containers, pods, and controllers refresher

If the body assumed fundamentals you’d like restated:

  • Container — a process with its own filesystem, namespaces, and cgroup-enforced resource limits. Lightweight isolation, sharing the host kernel. The cgroup is what makes requests/limits enforceable at the kernel level.
  • Pod — the smallest schedulable unit: one or more containers that share a network namespace (same IP, same localhost) and can share volumes. You almost never create pods directly; a controller does.
  • ReplicaSet — ensures N copies of a pod template exist. Rarely managed directly.
  • Deployment — manages ReplicaSets to give you rolling updates and rollbacks for stateless workloads. The workhorse object.
  • StatefulSet — like a Deployment but with stable network identity and stable per-pod storage, for workloads that need it (databases, clustered apps). Sharper failure modes; use sparingly.
  • DaemonSet — runs one pod per node (log shippers, node agents).
  • Service — stable virtual IP load-balancing across a pod’s ready endpoints. Ingress / Gateway — L7 HTTP routing in front of Services.
  • Namespace — a soft tenancy boundary for naming, quotas, and access control.

The unifying idea: you describe the what (a Deployment desiring three replicas of an image, with these resources), and the controllers continuously work out the how (which nodes, which order, replacing failures). That separation of declared intent from imperative action is exactly why the inputs — your resource specs and probes — matter so much, because the loops will faithfully execute whatever those inputs imply.

Further reading

Incidents & deep-dives

Where this system breaks in production — and how it comes back.

Documenting next