Observability

Observability past the dashboard: metrics vs logs vs traces, RED/USE, the cardinality cliff, SLOs and error budgets, symptom-based alerting, trace propagation, and what it costs at scale.

23 min readupdated 2026-06-28

On this page

The pitch for observability is always “three pillars: metrics, logs, and traces.” That framing is exactly how teams end up paying a vendor six figures a year and still sitting in an incident bridge at 3am, staring at a wall of green dashboards, unable to answer the only question that matters: why is checkout failing right now? Pillars are data types. They are not understanding.

Here is the distinction that took me a decade and several bad nights to internalize. Monitoring answers the questions you already knew to ask — CPU, error rate, queue depth, the known unknowns you predicted when you built the dashboard. Observability is whether, when the system breaks in a way nobody anticipated, the telemetry you already emit can explain it without shipping new code. One is a checklist. The other is a property. Most teams over-invest in the first and call it the second.

This article is the long-form context piece I wish someone had handed me before I owned a pager. It covers what each signal type is actually for, the RED and USE methods for deciding what to measure, why cardinality is the cost that ambushes you on a Tuesday, how SLOs and error budgets turn “are we reliable enough” from an argument into arithmetic, why you alert on symptoms and never on causes, and how a trace stitches itself together across a dozen services. The cost model section is the one most “intro to observability” posts skip, and it is the one that decides whether your telemetry survives contact with scale.

These signals are emitted all along a distributed request path, so this leans on Load Balancing for where requests fan out, API Gateway for where a request first gets a trace id, and Kafka for how high-volume telemetry actually moves. The reliability math borrows from Consistency & Consensus only loosely — observability is about seeing the system, not agreeing on its state. If you want the broader map of where this fits, the roadmap has it.

A motivating failure

A retail platform I worked near ran a healthy-looking observability stack: Prometheus for metrics, a hosted log service, traces sampled at 10%. Dashboards were green. Then Black Friday traffic arrived and the metrics backend fell over — not the application, the monitoring. Prometheus OOM-killed itself, restarted, OOM-killed again, and for the ninety minutes that mattered most the team was flying blind.

The root cause had shipped three weeks earlier and nobody noticed. An engineer, debugging a slow-customer complaint, had added customer_id as a label on the checkout_duration_seconds histogram. In staging, with forty test customers, it was invisible. In production it multiplied every existing label combination by the number of active customers. A metric that was ~3,000 time series became ~14 million. Prometheus holds active series in memory. Under normal load it limped along with the bloat; under Black Friday concurrency, the active-series count crossed the memory ceiling and the whole thing collapsed.

So during the single highest-revenue window of the year, the symptom dashboard, the SLO burn alerts, and the on-call team’s entire view of reality all went dark — because of a debugging convenience added by someone who had long since moved on to another ticket. The application was fine. The blindness was self-inflicted, and it landed at the worst possible moment.

Nothing here was a bug. The label did exactly what labels do. The failure lived entirely in not understanding that a metric label is a multiplier, not a free annotation — and that your telemetry system is a production system with its own capacity limits, one that tends to break precisely when load is highest. That is the kind of failure this article exists to prevent.

The one-sentence mental model

Observability is the ability to explain any behavior of your system from the telemetry you already emit — where metrics tell you that something is wrong, traces tell you where, and logs tell you why — bought at a cost that scales with cardinality, not with traffic.

Every clause is an operational constraint:

Explain any behavior you didn’t predict → the test of observability is novel failures, not the dashboard you already built. If you can only answer questions you anticipated, you have monitoring.
Metrics tell you that → cheap, aggregated, the thing you alert on. Bounded dimensions only.
Traces tell you where → per-request causal path across services; sampled, because keeping all of them is ruinous.
Logs tell you why → the high-cardinality detail (the exact order_id, the stack trace) that metrics structurally cannot hold.
Cost scales with cardinality → your bill and your blast radius are driven by the number of unique label combinations and the volume of detail, not by request count. This is the clause people forget until it bankrupts them.

flowchart LR
  M[Metrics\ncheap aggregate\nWHAT WHEN] -->|alert fires| O{On-call}
  T[Traces\nrequest path\nWHERE] --> O
  L[Logs\nevent detail\nWHY] --> O
  O -->|error rate up| T
  O -->|span slow| L
  O -->|exception found| RC[Root cause]

The workflow is directional and you should spend money in proportion to how often you walk each step. A metric alert fires — error rate is up. You pivot to traces to find which service and which span carries the latency or the errors. You drill into the logs tied to that trace to read the actual exception. Metrics are the smoke alarm, traces are the map, logs are the confession. Confusing their jobs is where both the bill and the blind spots come from.

How it actually works

Metrics, logs, traces — what each is for

A metric is a numeric measurement aggregated over a time window: a counter (http_requests_total), a gauge (queue_depth), or a histogram (request_duration_seconds), tagged with a handful of dimensions. It is cheap to store and query because it is pre-aggregated into time series. It is what you alert on and what drives trend dashboards. Its cost driver is dimensionality: every unique combination of label values is a separate, independently-stored time series.

A log is a timestamped event, and it should be structured — {"level":"error","trace_id":"4bf9...","order_id":"8412","latency_ms":812,"msg":"charge declined"}, not a free-text string you later regret trying to parse. Logs carry the high-cardinality detail metrics cannot: the specific id, the full stack trace, the exact query. They are expensive at volume and you should sample them deliberately rather than keep every INFO line forever.

A trace is the causal path of one request as it moves across services, assembled from spans. Each span is a unit of work — an HTTP call, a database query, a cache lookup — recording a start time, a duration, and a parent span. Stitched together, the spans form a tree that answers “where did the 800ms actually go,” which neither metrics nor logs can tell you because only the trace preserves the call structure.

The recurring mistake is using one signal to do another’s job. Grep-counting errors across terabytes of logs is using logs as a metric — slow and expensive. Adding user_id to a metric label to find one slow customer is using metrics as a trace — and it detonates cardinality, as the opening story showed. Match the question to the signal designed for it.

RED and USE — deciding what to measure

Two complementary recipes keep you from drowning in the infinite things you could measure.

RED is for request-driven services — the things users or other services call:

Rate — requests per second.
Errors — failed requests per second, and as a fraction of rate.
Duration — the latency distribution, watched at percentiles, never the mean.

USE is for resources — CPU, memory, disk, connection pools, queues:

Utilization — the fraction of time the resource was busy.
Saturation — how much work is queued and waiting (run-queue length, pool wait time, queue depth).
Errors — error events on the resource itself.

RED catches “users are seeing failures.” USE catches “this resource is the bottleneck causing it.” They compose: a saturated database connection pool (USE: saturation climbing) explains the latency spike on the /checkout endpoint (RED: duration p99 up). Point RED at your service edges and USE at every shared resource underneath, and you have covered the two questions that start most investigations — who is hurting and what is starved.

The write and read path of telemetry

It helps to see telemetry as its own data pipeline, because that is exactly what it is.

flowchart LR
  APP[App + SDK\nemit signals] --> COL[OTel Collector\nbatch sample drop]
  COL --> MET[(Metrics TSDB)]
  COL --> LOG[(Log store)]
  COL --> TRC[(Trace store)]
  MET --> Q[Query / alert]
  LOG --> Q
  TRC --> Q
  Q --> DASH[Dashboards\nand pages]

On the write path, the application’s instrumentation SDK emits signals to a local agent or an OpenTelemetry Collector, which batches, samples, drops unwanted labels, and routes each signal to its backend. On the read path, dashboards and alert rules query those backends. The Collector is the single most useful piece of this picture: it lets you change sampling, drop a runaway label, or re-route telemetry without redeploying every service. If you take one architectural decision from this article, put a Collector between your apps and your backends before you need it.

Distributed trace context propagation

A trace only works if every service in the path participates. The first service to touch a request — usually the gateway — mints a trace_id and a root span_id. Every downstream call must pass them along in headers so the next service attaches its spans to the same tree. The W3C standard header is traceparent:

traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01
              │  └ trace-id (16 bytes) ──────┘ └ parent span ─┘ └flags
              version

sequenceDiagram
  participant U as User
  participant GW as Gateway
  participant Ord as Order svc
  participant Pay as Payment svc
  U->>GW: POST /checkout
  Note over GW: mint trace_id span A
  GW->>Ord: traceparent A
  Note over Ord: span B parent A
  Ord->>Pay: traceparent B
  Note over Pay: span C 600ms here
  Pay-->>Ord: ok
  Ord-->>GW: ok
  GW-->>U: 200

The failure here is quiet and vicious: one service in the path that fails to forward traceparent breaks the chain. The trace shows a gap, and everything downstream of the gap becomes an orphan — the slow span you were hunting for is now unattributable. This is why propagation has to be enforced in shared middleware and SDK auto-instrumentation, never left to each team to remember. The whole point of OpenTelemetry’s auto-instrumentation is that the header gets forwarded without anyone having to think about it. A trace is only as complete as its least-instrumented hop.

The tradeoffs that bite

These are the decisions that look free when you make them and bill you later — sometimes literally.

Tradeoff	The free-looking choice	What it actually costs
Metric labels	Add `user_id` to find one slow user	Cardinality explosion; OOM’d backend, 10x bill
Sampling	Keep 100% of traces “to be safe”	Storage and ingest cost that scales with traffic
Log verbosity	`INFO`-log every request	Terabytes/day; queries time out mid-incident
Alert target	Alert on CPU, restarts, disk	Pages that self-heal; silence for novel failures
Latency stat	Watch the mean	`p99` of 2s hidden behind a `40ms` average
Vendor lock	Native agent everywhere	Re-instrumenting the fleet to ever change vendors

Two of these deserve to be spelled out, because they cause the majority of self-inflicted observability pain.

Cardinality is the number of unique time series, and it is the product of every label’s distinct values, not the sum. A metric http_requests_total{method, status, endpoint} with 5 methods, 10 statuses, and 50 endpoints is 5 × 10 × 50 = 2,500 series — trivial. Add user_id with a million users and you have manufactured a million-series-per-other-combination explosion that will OOM your TSDB and multiply your bill.

flowchart LR
  B[method x status\nx endpoint] --> N[2,500 series\nbounded fine]
  B --> ADD[add user_id\n1M values]
  ADD --> X[multiply not add]
  X --> BOOM[2.5B series\nOOM and bill]
  style BOOM fill:#e11d48,color:#fff
  style X fill:#171717,color:#fff

The rule is absolute: never put unbounded-cardinality values — user_id, order_id, raw URLs with ids, full email addresses — into metric labels. Those belong in traces and logs, which are built for high cardinality and are sampled. Templatize routes (/orders/{id}, never /orders/8412) before they ever become a label.

Sampling strategy is the other one. Head-based sampling decides at request start whether to keep the trace — cheap and simple, but it will happily discard the one slow, erroring request you actually needed. Tail-based sampling buffers the spans of a trace and decides after it completes, keeping it if it errored or breached a latency threshold and dropping the boring 99%. Tail-based needs a Collector that holds spans in memory for a few seconds and costs more to run, but for latency and error debugging it keeps the interesting needles instead of a random sample of hay. For anything past a handful of services, tail-based is worth it.

The cost model: where the money and the limits live

Observability is the rare infrastructure domain where the monitoring can cost more than the thing it monitors, and where the failure of the monitoring is correlated with the failure of the system. So treat cost as an engineering constraint, not a finance afterthought.

The three signals have fundamentally different cost curves, and matching workload to the cheapest sufficient signal is most of the job:

Metrics cost by active series (cardinality), and that cost is roughly flat with traffic. A million requests across 2,500 series costs the same as a thousand requests across 2,500 series. This is why metrics are the right home for anything you alert on continuously — they are cheap per query and cheap to retain, as long as you keep cardinality bounded.
Traces cost by spans retained = sample rate × spans per request × traffic. At 100% sampling this scales linearly and brutally with traffic. At 1–10% head sampling or tail-based “keep the interesting ones,” it becomes affordable. The cost lever is the sample rate, and it is the first knob to reach for.
Logs cost by volume × retention. Logging every request body at INFO and keeping it 90 days is how a startup wakes up to a log bill larger than its compute bill. The levers are level discipline (don’t INFO what you can DEBUG), structured fields (so you sample and query cheaply), and tiered retention.

A concrete sizing intuition I keep in my head: a metrics backend like Prometheus needs on the order of 1–4 KB of RAM per active series. So 1 million active series is 1–4 GB of memory just for the series index, before any query load. That is why the opening incident’s jump to 14 million series was fatal — it was tens of gigabytes of working set that simply was not provisioned. When you reason about “can we add this label,” do the multiplication and then the memory math. The label is free to type and expensive to keep.

The numbers that tell you the truth about your telemetry’s health are not hit rates or dashboard counts. Watch active series count and its growth rate (a step change means someone added a label), ingestion lag (samples arriving slower than they’re produced means you’re falling behind during exactly the load you needed to see), dropped spans / dropped samples at the Collector, and query latency on your alert rules (if an alert query takes longer than its evaluation interval, the alert is effectively broken). A creeping active-series count is the single best early warning of an impending cardinality wall, and it is trivial to alarm on.

Failure modes

The defining observability failure is the alert that tells you nothing — it either stays silent while users suffer, or it fires so constantly that on-call mutes the entire channel and then misses the real one. Both come from alerting on the wrong layer.

Alert on symptoms users feel — error rate, latency SLO burn — not on causes like CPU at 80% or a pod that restarted. High CPU is not an incident; a slow checkout is. Cause-based alerts page you at 3am for conditions that self-heal before you open the laptop, and they stay completely silent for the novel failure mode you never thought to threshold. Page a human only when a human is needed, and a human is needed when users are hurting.

The other recurring ones, each as symptom → root cause → prevention:

Cardinality explosion. Symptom: the metrics backend’s memory climbs, ingestion lags, dashboards go blank during the incident you needed them for. Root cause: someone added a high-cardinality label (user_id, an un-templatized URL). Prevention: enforce a cardinality budget in CI against metric definitions; alarm on active-series growth rate; templatize routes before they become labels.
Broken trace propagation. Symptom: traces fragment, the slow downstream span is an orphan, you can see the gateway is slow but not why. Root cause: one service in the path drops traceparent. Prevention: propagation in shared middleware / auto-instrumentation, not per-team discipline; a synthetic check that asserts an end-to-end trace stays intact.
Averages hiding pain. Symptom: the latency dashboard reads a calm 40ms while support tickets pile up. Root cause: alerting and dashboarding on the mean. The mean is 40ms while p99 is 2s and 1% of users are timing out. Prevention: alert and chart on percentiles (p50, p95, p99), and remember percentiles don’t average across instances — aggregate the histogram, not the quantiles.
Log volume bankruptcy. Symptom: log bill balloons, and queries time out exactly when you’re mid-incident. Root cause: INFO-logging every request at scale, sampling only at query time over terabytes. Prevention: level discipline, structured logs, ingest-time sampling, tiered retention.
The monitoring dies with the system. Symptom: during a major incident the dashboards and alerts go dark. Root cause: the telemetry pipeline shares fate with the application (same cluster, same network, same load spike). Prevention: give telemetry its own capacity and ideally its own failure domain; it must be more available than what it watches.
Dashboard sprawl. Symptom: 200 panels, none of which answer “are users okay right now.” Root cause: every investigation left a graph behind and nothing was ever deleted. Prevention: a small number of curated service dashboards keyed to RED/USE and SLOs; everything else is exploratory and disposable.

Scaling it

At 10x traffic, raw ingestion stops being free and you start making the cheap reductions. Pre-aggregate metrics at the agent so the backend stores rollups instead of raw points. Drop unused labels at the Collector. Move logs to structured-and-sampled rather than capture-everything. Introduce the OpenTelemetry Collector as a mandatory hop between apps and backends, so you can change sampling, drop a runaway label, or switch a destination centrally without a fleet-wide redeploy. This is the single highest-leverage architectural move, and it is much easier to add before you’re in trouble than during.

At 100x, you make the hard reductions. Move traces to tail-based sampling — keep every error and the slow tail, drop the boring 99% — which needs Collectors holding spans in memory, fronted by the same load balancing concerns as any stateful tier. Add metric recording rules that precompute expensive aggregations so a dashboard load doesn’t recompute a million-series query every time it renders. Adopt tiered retention: high-resolution metrics for 15d, downsampled rollups for a year, raw logs for 7d and sampled summaries beyond. And start shipping high-volume telemetry through a durable buffer like Kafka so a backend hiccup queues rather than drops, and so the Collector fleet can be scaled independently of the producers.

The wall you actually hit is almost never raw request volume — it is cardinality. Volume scales predictably and you can throw money or sampling at it. Cardinality detonates discontinuously: one merged pull request adds one label and the active-series count jumps an order of magnitude overnight. A cardinality budget per team, enforced in CI against metric definitions, prevents the 3am explosion far better than any after-the-fact cleanup. The same discipline that makes database indexing tractable — bound what you index, measure what it costs — applies directly to what you label.

There is one more scaling truth that is easy to miss: your telemetry system needs its own SLOs. If your monitoring is less available than the services it watches, it goes blind exactly when you need it, which is the worst possible time. Provision it as a tier-zero dependency, separate its failure domain from the application where you can, and alert on its health (ingestion lag, dropped samples, query latency) as seriously as you alert on the product.

When to reach for it (and when not to)

Observability investment should track operational risk and architectural complexity, not fashion or what the conference talk said.

Reach for full distributed tracing when you have more than a handful of services and “which hop is slow” is a genuine, recurring question. Distributed latency is structurally unsolvable from metrics alone — only a trace preserves the call tree. Reach for SLOs and error budgets when you need a shared, numeric definition of “good enough” to arbitrate between shipping features and paying down reliability; the budget converts a recurring argument into arithmetic. Reach for tail-based sampling once trace volume makes 100% capture expensive but you still need to catch the rare slow request.

Don’t stand up the full three-pillar stack on a single monolith on day one. With one process, RED metrics plus structured logs answer the overwhelming majority of questions, and traces add little when there is no network hop to attribute. Don’t put high-cardinality identifiers into metrics because traces felt harder to set up — that is borrowing against next quarter’s bill and next quarter’s outage. Don’t chase 100% trace coverage or infinite retention; sampled traces answer latency questions fine, and most logs are worthless a week after they’re written.

A precise word on SLOs, because the terms get muddled. An SLI is the measured number — the fraction of requests served under 300ms. The SLO is the target — 99.9% of requests under 300ms over 30d. The error budget is the allowed failure — 0.1%, which over thirty days is about 43m of badness you’re permitted to spend. When the budget is healthy you ship fast and take risks; when it’s spent you freeze risky changes and pay down reliability. That is the entire mechanism, and it is the most useful single idea in modern operations: it turns “are we reliable enough” from a vibe into a number everyone can see.

flowchart TD
  SLI[measure SLI\ngood / total] --> BR{burn rate}
  BR -->|fast burn\n2% in 1h| PAGE[page human]
  BR -->|slow burn\nbudget trickling| TICK[open ticket]
  BR -->|budget spent| FREEZE[freeze risky\nchanges]
  BR -->|budget healthy| SHIP[ship faster]
  style PAGE fill:#e11d48,color:#fff

When to consider alternatives

Observability isn’t a tool you swap out; it’s a discipline. But several adjacent jobs are not observability and reaching for a tracing stack to do them is a mistake. Map the job to the right tool:

Durable buffering of high-volume telemetry → Kafka or Message Queues, in front of your backends, not the trace store itself.
Long-term archival of raw events / logs → Object Storage with lifecycle tiering, not hot log retention.
Full-text search and ad-hoc log exploration → Elasticsearch, which is a log query engine, not a metrics or alerting system.
Distributing Collector and backend load → Load Balancing and API Gateway patterns, the same as any stateful service tier.
Running the whole stack as a platform → Kubernetes operators for Prometheus/OTel, where the orchestration concerns are the same as any other workload.
Real-time counters and rate windows feeding dashboards → Redis for ephemeral aggregates, with the durable copy elsewhere.

The pattern: observability is the seeing. Storage, transport, search, and orchestration are separate jobs with purpose-built tools, and bolting them onto your tracing vendor because it was already there is how the bill and the brittleness grow together.

Operational checklist

Alert on user-facing symptoms — SLO burn rate, error rate, p99 latency — never on raw CPU, memory, or restart counts.
Define an SLI/SLO/error budget for every user-facing service; page on fast burn (e.g. 2% of the monthly budget in 1h), open a ticket on slow burn.
Enforce a cardinality budget in CI: no unbounded labels (user_id, order_id, un-templatized URLs) on any metric.
Alarm on active-series growth rate and ingestion lag — a step change means a new label exploded.
Propagate traceparent (W3C / OpenTelemetry) in shared middleware so no service breaks the trace chain; add a synthetic end-to-end trace check.
Chart and alert on percentiles, never means; aggregate histograms across instances, not pre-computed quantiles.
Run an OpenTelemetry Collector between apps and backends so sampling, label drops, and routing change without redeploys.
Use tail-based trace sampling and sampled structured logs; apply tiered retention (high-res short, rollups long).
Give the telemetry pipeline its own SLO and failure domain — it must outlive the systems it watches.
Curate a small set of RED/USE + SLO dashboards; treat everything else as disposable exploration.

Summary

Observability is not three pillars; it is the property that you can explain a failure you never predicted from telemetry you already emit. Metrics tell you that something is wrong and are cheap as long as cardinality stays bounded. Traces tell you where and must be sampled because keeping them all is ruinous. Logs tell you why and bankrupt you if you keep everything. Alert on the symptoms users feel, never on causes that self-heal. Watch percentiles, not means. Treat a metric label as a multiplier and guard cardinality with a budget enforced in CI, because the wall you hit at scale is cardinality, not volume — and it detonates discontinuously, usually during your highest-traffic hour. Put a Collector in the middle early, give your telemetry its own availability target, and turn “reliable enough” into the arithmetic of an error budget. Do that and your observability stack explains outages instead of becoming one.

Appendix: percentiles and the error-budget math

If the body assumed a couple of fundamentals, here they are restated.

Percentiles. p99 latency means 99% of requests were faster than this value and 1% were slower. It matters more than the mean because users experience the tail: at a million requests an hour, p99 = 2s means ten thousand people per hour waited two seconds, even if the average reads a comfortable 40ms. A critical gotcha: you cannot average percentiles across instances or time buckets to get a meaningful aggregate — p99 of host A and p99 of host B do not average to the fleet p99. Aggregate the underlying histogram buckets and compute the quantile from the merged distribution, which is exactly why histogram metrics (not gauges of pre-computed quantiles) are the right instrument.

Error budget arithmetic. An SLO of 99.9% availability over 30d permits 0.1% failure. Thirty days is 43,200 minutes, so the budget is 43,200 × 0.001 ≈ 43m of badness per month. 99.95% halves that to about 21m; 99.99% cuts it to roughly 4.3m. Each extra nine costs an order of magnitude more engineering, which is why “how many nines” is a business decision about how much you’ll spend, not a default you crank to maximum. Burn rate is how fast you’re spending: a burn rate of 1 exhausts the whole budget exactly at the end of the window; a burn rate of 14.4 over an hour spends 2% of a monthly budget in that hour — a sensible fast-burn paging threshold.

Incidents & deep-dives

Where this system breaks in production — and how it comes back.

No incident deep-dives yet. See the roadmap for what's coming.