Database Indexing
B-tree internals, clustered vs secondary, composite column order and leftmost-prefix, covering index-only scans, GIN/BRIN/hash/partial, the write tax, and reading EXPLAIN.
On this page
An index is a deal, and most teams only read half the contract. You pay on every single write — extra disk, extra I/O, extra latency on the hot path — so that some reads get dramatically faster. People remember the read half. They forget the bill. Then a table accretes eleven indexes over three years, inserts that used to take 2ms take 14ms, the table’s on-disk footprint is twice the size of the actual data, and nobody can say which indexes are even being used.
The other half of the misery is the opposite shape: you add the “obvious” index, you run the query, and EXPLAIN flatly ignores it. The index exists. The query looks like it should match. And the planner does a sequential scan anyway. That gap — between “we have an index on that column” and “the query actually uses an index” — is where most indexing time gets burned, and it’s almost always one of three or four mechanical rules that nobody internalized.
This article is the operating model for indexes: the structures the engine actually maintains, the column-order rules that decide whether your index is even eligible to be used, the difference between an index that answers a query and one that just narrows it, and how to read the planner’s reasoning out of EXPLAIN. It pairs tightly with PostgreSQL, which is the concrete engine most of the examples lean on, and with Sharding & Partitioning for what indexing turns into once one machine is no longer enough. The write-amplification story here is the same one that makes Cassandra’s LSM design make sense, so I’ll draw that contrast directly. Upcoming work is on the roadmap.
The one sentence to carry through everything below: an index is a second, sorted, redundant data structure the database keeps in lockstep with your table, and every property you love or hate about it falls out of that one fact.
A motivating failure
An e-commerce orders service runs on Postgres. The hot query is the customer order history page: WHERE customer_id = ? ORDER BY created_at DESC LIMIT 20. There’s an index on customer_id, added years ago, and for a long time it’s fine — most customers have a handful of orders, the index narrows to a few rows, the sort is trivial.
Then the business signs a handful of wholesale accounts. One of them places, over eighteen months, roughly 400,000 orders under a single customer_id. The history page for that account starts timing out. An engineer pulls EXPLAIN (ANALYZE, BUFFERS) and sees it: Index Scan using orders_customer_id_idx, then a Sort node that pulls all 400,000 matching rows into memory, sorts them by created_at, and throws away all but 20. The index found the right rows. It did nothing for the ORDER BY. Every page load read 400k rows and sorted them to return 20.
The “fix” someone reaches for first is a bigger box, then a work_mem bump so the sort stops spilling to disk. Both are treating a symptom. The actual fix is a one-line composite index — (customer_id, created_at DESC) — so the rows for that customer are already stored in created_at order in the index leaf. The query becomes an index range scan that reads exactly 20 leaf entries and stops. No sort node. p99 for that account drops from 9 seconds to under 5ms.
Nothing here was a bug. The single-column index was a reasonable choice on day one and silently became wrong as one customer’s data distribution drifted. That’s the recurring shape of index failures: the index is correct and useless at the same time, and only a careful read of the plan tells you which half you’re living in.
The one-sentence mental model
An index is a separate, sorted copy of some columns plus a pointer back to the full row, kept in sync on every write, so that reads can binary-search and range-scan instead of scanning the whole table.
Every clause is a constraint you’ll meet in production:
- A separate copy → it costs disk, it competes with the table for buffer cache, and it has to be maintained. This is the entire write tax.
- Sorted → this is the magic. Sorted data gives you
O(log n)lookups, free range scans, and freeORDER BY— but only along the sort order you actually built. Sort by(a, b)and you have no help for queries onbalone. - Of some columns → if the index doesn’t contain a column the query needs, the engine has to go get it from the table. That second trip is the difference between a fast index and a very fast one.
- Plus a pointer back to the row → that pointer is what makes a “secondary” index secondary. Following it is a random I/O, and random I/O is where read performance quietly dies.
- Kept in sync on every write → the index is part of the write path, not a background nicety. Every
INSERT/UPDATE/DELETEthat touches an indexed column pays to maintain it, synchronously, inside the transaction.
flowchart LR Q[WHERE email\n= x] --> Idx[Index on email\nsorted B-tree] Idx -->|O log n\nbinary search| Ptr[row pointer\nor PK] Ptr --> Heap[(Table rows\nheap)] Heap --> Row[full row\nreturned] NoIdx[No index] -.->|O n\nseq scan| Heap
Hold those five clauses in your head and almost every indexing decision answers itself. Why does column order matter? Because it’s sorted. Why does an index slow writes? Because it’s a synced copy. Why is a covering index faster? Because it skips the pointer back. The structure is simple; the consequences are not.
How it actually works
B-tree internals
The default index everywhere — Postgres, MySQL/InnoDB, SQL Server, Oracle — is the B-tree, specifically a B+tree. It’s a balanced tree where internal nodes hold separator keys that route you down, and all the actual keys live in the leaf level. Leaves are linked into a doubly-linked list, which is the trick that makes range scans cheap: find the start of a range, then walk the leaf chain in order without ever going back up the tree.
The property that makes B-trees feel instant is fanout. Each node is one page (8 KB in Postgres), and a page holds hundreds of keys, so the tree is wide and shallow. With a fanout around 400, depth 2 holds ~160,000 entries and depth 3 holds ~64 million. That means even a large table is 3–4 page reads from any row — and the upper levels are almost always resident in buffer cache, so in practice a lookup is one or two actual disk reads. That shallow depth is why a B-tree lookup feels O(1) even though it’s formally O(log n): the log base is enormous and the top of the tree never leaves memory.
B-trees handle the widest range of query shapes, which is why they’re the default:
- Equality:
WHERE email = 'x' - Range:
WHERE created_at >= '2026-01-01' - Prefix:
WHERE name LIKE 'And%'(a leading-anchored prefix only —'%son'cannot use it) - Sorting:
ORDER BY created_atreads the leaf chain in order, no sort step MIN/MAX: jump to the first or last leaf
Inserts and deletes keep the tree balanced by splitting and merging pages, which is where the write cost lives — more on that below.
flowchart TD Root[Root\nseparators] --> I1[Internal\nA - M] Root --> I2[Internal\nN - Z] I1 --> L1[Leaf\nAda Ben] I1 --> L2[Leaf\nCal Mae] I2 --> L3[Leaf\nNed Sam] I2 --> L4[Leaf\nTom Zoe] L1 -.next.-> L2 L2 -.next.-> L3 L3 -.next.-> L4
Clustered vs secondary indexes
This distinction decides how many hops a read takes, and it differs by engine.
A clustered index is the table. The rows are physically stored in the index’s key order, in the leaf level itself. InnoDB (MySQL) clusters every table on its primary key, so a PK lookup descends the tree and the row is right there in the leaf — one structure, no second trip. There can be only one clustered index per table, because data can only be laid out in one physical order.
A secondary index is a separate structure. Its leaves store the indexed columns plus a reference back to the full row. What that reference is matters a lot:
- In InnoDB, a secondary index stores the primary key as the row reference. So a secondary lookup is two B-tree descents: search the secondary index to get the PK, then search the clustered PK index to get the row. This is also why a fat primary key (a UUID string, say) bloats every secondary index on the table — each one carries a copy of it.
- In Postgres, the table is a heap (unordered), and a secondary index stores a
ctid— a direct physical pointer to the tuple’s page and offset. One index descent, then a direct heap fetch. But MVCC means anUPDATEwrites a new tuple version with a newctid, so the index may point at a dead tuple and the read has to check visibility (more in PostgreSQL).
sequenceDiagram participant Q as Query (InnoDB) participant Sec as Secondary idx email participant Clu as Clustered idx PK Q->>Sec: find email = x Sec-->>Q: returns PK 42 Q->>Clu: lookup PK 42 Clu-->>Q: full row Note over Q,Clu: two descents per matched row
The practical upshot: in InnoDB, choose a small, monotonic primary key (an auto-increment BIGINT beats a random UUID) because it’s embedded in every secondary index and a random PK scatters inserts across the clustered tree. In Postgres, watch for the heap-fetch cost on secondary indexes and reach for index-only scans (below) on hot paths.
Composite indexes and the leftmost-prefix rule
A composite index on (a, b, c) is sorted by a first, then by b within each equal a, then by c within each equal (a, b). Picture a phone book sorted by last name, then first name. This single mechanical fact is the most common source of “why isn’t my index being used?”
The rule: a composite index can be used only for a leftmost prefix of its columns.
(a, b, c) can serve:
WHERE a = ?WHERE a = ? AND b = ?WHERE a = ? AND b = ? AND c = ?- A range on the last column you’re still using:
WHERE a = ? AND b > ?
It cannot efficiently serve:
WHERE b = ?alone — that skips the leading column, so the global sort order is meaningless. (The phone book is no help finding everyone named “James” regardless of surname.)WHERE c = ?alone, same reason.
And here’s the subtle one that bites experienced people: a range on an early column stops the index from using later columns for equality. WHERE a > 10 AND b = 5 uses the index to satisfy a’s range, but once a varies across that range, b is no longer in any consistent order, so b = 5 becomes a row-by-row filter, not an index seek. The index helped with a and quit.
This makes column order a deliberate design choice, not an afterthought. The rules of thumb:
- Equality columns first, the range/sort column last. This is exactly the bug in the motivating failure —
(customer_id, created_at)puts the equality column (customer_id) first and the sort column (created_at) last, so the matched rows come out pre-sorted. - Among equality columns, higher selectivity earlier is the classic advice, though it matters less than people think once the leading column is selective enough to narrow the search.
- Match the index to the query’s actual shape, not to the schema’s prettiness. One well-ordered composite index often replaces three single-column ones.
Covering indexes and index-only scans
Normally a secondary index narrows the search and then the engine fetches the full row to get the columns you SELECTed. But if the index already contains every column the query touches — both the filtered columns and the returned columns — the engine can answer from the index alone and never touch the table. That’s a covering index, and the plan reports it as an Index Only Scan (Postgres) or “Using index” (MySQL).
You build one by adding the extra columns as payload. In Postgres:
-- query: SELECT total, status FROM orders WHERE customer_id = ?
CREATE INDEX idx_orders_cust_covering
ON orders (customer_id) INCLUDE (total, status);
The INCLUDE columns aren’t part of the sort key (so they don’t change ordering or selectivity), they just ride along in the leaf so the read is satisfied without a heap trip. This is the single cheapest big win in query tuning when a hot query reads a small, stable set of columns: you turn a two-hop secondary lookup into one.
The Postgres caveat worth knowing: index-only scans still consult the visibility map to confirm a tuple is visible without checking the heap. If the table has had heavy recent writes and isn’t well-vacuumed, the visibility map is stale and the “index-only” scan quietly falls back to heap fetches. So an index-only scan’s speed depends on VACUUM keeping up — another reason vacuum is on the correctness/performance path, not a chore.
GIN, BRIN, hash, and partial indexes
B-trees are the default, not the only tool. Most teams use exactly one index type and then wonder why some queries stay slow. The menu:
- GIN (Generalized Inverted Index) — for values that contain many searchable elements: array membership (
tags @> '{sale}'),jsonbcontainment (doc @> '{"k":"v"}'), and full-text search. A GIN index maps each element to the rows containing it, like a book index mapping each word to its pages. Fast lookups, but slower writes and larger size; tune withfastupdateand a pending-list size. When you need real search, you’re often better off with Elasticsearch as a secondary index rather than pushing GIN too far. - BRIN (Block Range Index) — for huge, naturally ordered, append-only tables, classically time-series on
created_at. Instead of indexing every row, BRIN stores the min/max value per block range (e.g. per 128 pages). A query for a time window skips block ranges that can’t match. The payoff is dramatic: on a 1 TB append-only table, a B-tree oncreated_atmight be 30 GB and fight the heap for cache, while the BRIN equivalent is a few megabytes. The hard catch: it only works if rows are physically clustered by that column. Random insert order makes BRIN useless. - Hash — maps a key to a bucket for
O(1)equality. It cannot do ranges, ordering, or prefixes at all;WHERE x > 5is invisible to it. Niche — reach for it only on pure-equality lookups where you’ve measured a win over B-tree. For most equality cases a B-tree is just as good and far more flexible. - Partial index — an index with a
WHEREclause, so it indexes only the rows you actually query:CREATE INDEX ... ON orders (created_at) WHERE status = 'pending'. If 99% of orders arecompleteand you only ever scanpendingones, the partial index is tiny, fits in cache, and skips maintenance on the rows you don’t care about. The query’s predicate must match (or imply) the index’s predicate for the planner to use it.
flowchart TD
Need{What shape\nof query?} -->|equality + range\n+ sort| BT[B-tree\ndefault]
Need -->|contains element\narray jsonb FTS| GIN[GIN\ninverted]
Need -->|huge append-only\ntime ordered| BRIN[BRIN\nblock ranges]
Need -->|pure equality\nmeasured win| HASH[Hash]
Need -->|subset of rows\nby predicate| PART[Partial\nWHERE clause]
For completeness: GiST covers geometric, range, and nearest-neighbor queries (PostGIS, pgvector-style problems live near here), and LSM trees — used by Cassandra and RocksDB — are a different storage philosophy entirely. LSM writes land in an in-memory memtable, flush to immutable sorted files (SSTables), and background compaction merges them. Writes are cheap sequential appends with no in-place update; reads may check several SSTables (mitigated by bloom filters). LSM trades read amplification and compaction CPU for write throughput — which is exactly the opposite bias from a B-tree, and why write-heavy stores reach for it.
| Structure | Equality | Range / sort | Write cost | Used by |
|---|---|---|---|---|
| B-tree | O(log n) | yes | in-place, moderate | Postgres, InnoDB |
| Hash | O(1) | no | moderate | niche / in-memory |
| GIN | per-element | no (membership) | high | Postgres jsonb/FTS |
| BRIN | range skip | yes (if clustered) | very low | time-series |
| LSM | O(log n) + SSTable checks | yes (sorted SSTables) | low append + compaction later | Cassandra, RocksDB |
The tradeoffs that bite
These are the decisions that look free when you type CREATE INDEX and send you a bill later.
| Tradeoff | The free-looking choice | What it actually costs |
|---|---|---|
| Read speed vs write tax | ”Add an index, reads get faster” | Every write maintains it; N indexes ≈ N× index maintenance per row |
| Selectivity vs uselessness | Index a status column with 3 values | Planner ignores it; a scan beats random fetches for 33% of rows |
| Covering vs bloat | INCLUDE more columns to skip heap | Wider index, more disk, evicts hotter pages from cache |
| Composite vs flexibility | One index per column | Wrong-order composites don’t compose; right-order ones replace several |
| Index count vs ingest rate | One index per ad-hoc report | Insert latency climbs linearly; the write path pays for read convenience |
| Bloat vs steady state | ”Indexes are self-maintaining” | Updates/deletes leave dead entries; size drifts past live data until REINDEX |
Two of these deserve emphasis. Selectivity vs uselessness is the one that makes people distrust the planner: you index status, the query filters on status = 'active', and the planner still does a sequential scan. It’s right to. If a third of the table matches, fetching those rows by following a third of the index’s pointers means a third of the table’s worth of random I/O, which is slower than reading the whole table sequentially. Indexes win on selective predicates — roughly when they narrow to a small single-digit percentage of rows — and lose on broad ones. A partial index can rescue the case where the useful subset is small even though the column itself isn’t selective.
Index count vs ingest rate is the slow-motion version. No single index hurts much. But a write-hot table that’s accumulated a dozen indexes for various reporting needs does a dozen B-tree maintenance operations on every insert, inside the transaction, on the latency-critical path. The reports got faster; the ingestion pipeline got slower, and the connection between the two is invisible unless you go looking. Write throughput drops roughly linearly with index count on the columns being written.
Read vs write performance: the index tax
The core asymmetry of indexing: indexes make a specific set of reads faster and make essentially all writes slower. You’re always trading write throughput for read latency. Knowing the exchange rate is the whole job.
What indexes make fast (reads):
- Selective equality and range filters — turn an
O(n)table scan into anO(log n)descent plus a short range walk. ORDER BY/LIMIT— a B-tree on the sort column returns rows in order with no sort step, soLIMIT 20reads ~20 entries and stops instead of sorting the whole result.- Joins — an index on the foreign key turns a nested-loop join from
O(n×m)intoO(n log m). - Uniqueness checks — a unique index makes “does this email already exist?” a single descent instead of a scan.
- Covering reads — an index-only scan skips the row fetch entirely.
What indexes make slow (writes), and why:
Every INSERT must add an entry to every index on the table, placed in sorted position — which can trigger a page split when the target leaf is full (the page is divided, half its entries copied to a new page, parent pointers updated). Random-key inserts (UUIDs) cause far more splits and fragmentation than monotonic keys (auto-increment), because they land all over the tree instead of always appending to the rightmost leaf.
An UPDATE is worse than it looks. If it changes an indexed column, the engine must delete the old index entry and insert a new one — on every index covering that column. In Postgres, because MVCC writes a whole new tuple version, even an update to a non-indexed column historically meant new index entries pointing at the new tuple; the HOT (Heap-Only Tuple) optimization avoids this only when no indexed column changed and the new version fits on the same page. So the number and width of your indexes directly shape how often updates stay cheap.
A DELETE marks index entries dead but doesn’t always reclaim the space immediately; that’s the source of index bloat over time.
Concrete shape of the tax: on a write-heavy table, going from 2 indexes to 8 can easily double or triple insert latency and inflate WAL volume proportionally (more index changes = more to log = more replication traffic to your read replicas). The reads those six extra indexes serve might be worth it. The point is that it’s a decision with a cost, not a free addition, and you should be able to name the query each index pays for.
The levers when writes are hurting:
- Drop indexes nothing reads. The highest-value move; see the operational checklist.
- Prefer monotonic primary keys (
BIGINTidentity over random UUID) to keep inserts appending to the right edge and minimize page splits — especially under InnoDB clustering. - Use partial indexes so writes to rows outside the predicate skip maintenance entirely.
- Batch and bulk-load with indexes dropped, then rebuild — building an index once over sorted data is far cheaper than maintaining it across millions of individual inserts.
Failure modes
How indexing actually goes wrong in production. Each is symptom → root cause → prevention.
Leftmost-prefix miss. Symptom: a query is slow, the relevant index “exists,” everyone is confused. Root cause: the query filters on a non-leading column of a composite index (WHERE b = ? against an index on (a, b)), so the index is ineligible and the planner falls back to a scan. Prevention: build the index to match the query’s filter shape; verify with EXPLAIN that the index is chosen, not just present.
The unused index. Symptom: writes are slow, disk is bloated, but read performance is fine. Root cause: indexes added “just in case,” for a report that was run once, or duplicating the prefix of an existing composite. They cost every write and serve no read. Prevention: audit pg_stat_user_indexes for idx_scan = 0 over a representative window and drop the dead weight.
Write collapse from over-indexing. Symptom: ingestion latency doubles after a release; reads are unaffected. Root cause: someone added indexes for ad-hoc analytics on a write-hot table, and the per-insert maintenance now dominates. Prevention: treat index count on write-hot tables as a budget; serve analytics from a replica or warehouse, not the ingestion primary.
The right index, useless for the sort. Symptom: a paginated, ordered query is slow for high-volume keys (the motivating failure). Root cause: the index covers the filter but not the ORDER BY, so the engine fetches all matches and sorts them. Prevention: put the sort column last in a composite, matching the ORDER BY direction.
Function/type mismatch defeats the index. Symptom: a textbook-looking query ignores its index. Root cause: WHERE lower(email) = 'x' can’t use a plain index on email (it needs an expression index, CREATE INDEX ... ON t (lower(email))); WHERE id = '42' with id an integer forces an implicit cast that can skip the index; a text column compared against a non-matching collation does the same. Prevention: index the exact expression you filter on, and keep types aligned between column and literal.
Bloat and stale statistics. Symptom: a query that was fast for a year suddenly picks a bad plan after a bulk load. Root cause: indexes have grown well past live data from churn, and the planner’s row-count estimates are stale, so it mis-costs the plan and flips from Index Scan to Seq Scan (or a bad join). Prevention: keep autovacuum/ANALYZE current, REINDEX CONCURRENTLY when size drifts past ~2× live data, and ANALYZE after big data movements.
An index the planner doesn’t use is pure liability: it slows every write, consumes disk, and evicts useful pages from cache, all for zero read benefit. “We have an index on that column” is not the same statement as “the query uses an index on that column.” Before you add one, confirm with
EXPLAINthat a real query will use it; after you add it, confirm it’s actually chosen — and re-confirm after your data distribution shifts, because the index that was right at 1,000 rows per key can be useless at 400,000.
Scaling it
Indexing strategy changes shape as data grows. What works at a million rows is the wrong default at a billion.
- First, index the workload, not the schema. Pull the actually-slow queries from
pg_stat_statements(or the slow query log) and design composite indexes around their real filter-plus-sort shape. Most over-indexed tables got that way by indexing columns that “seemed important” rather than queries that were measured slow. - Watch the working set, not just the data set. A B-tree only stays fast while its upper levels (and ideally the leaves you hit) live in buffer cache. Once total index size dwarfs RAM and access is random, every lookup becomes a disk seek and your
O(log n)turns intoO(log n)disk reads. This is the wall that pushes you toward partitioning. - Partition before the tree gets too deep and the working set too cold. Range-partition a huge table by a natural key — time, tenant — so each partition’s index is smaller, hotter, and prunable: the planner skips partitions that can’t match (
WHERE created_at >= '2026-06'touches one monthly partition’s index, not the whole table’s). See Sharding & Partitioning. The interaction to watch is that your index choice and partition key now have to agree — a query that doesn’t include the partition key can’t prune and scans every partition’s index. - Match the index type to the data’s physical order at scale. This is where BRIN earns its keep: on a multi-terabyte append-only table, a BRIN on the time column is megabytes where a B-tree is tens of gigabytes, because the data is physically time-ordered. The B-tree’s per-row precision is wasted when you always query time ranges.
- At extreme write scale, reconsider the structure entirely. If your write rate is the wall and your reads are mostly key/range, an LSM-based store like Cassandra is built for the bias you have — cheap appends, compaction later — where a B-tree’s in-place maintenance is exactly the cost you can’t afford. That’s not an index tuning decision; it’s an engine decision, and it’s the honest one when the write tax dominates.
When to reach for it (and when not to)
Add an index when you have:
- A selective filter on a large table (narrows to a small fraction of rows).
- A foreign key you join on — index it; unindexed FK joins are a classic silent
O(n×m). - A column (or column set) you
ORDER BYand paginate, especially withLIMIT— build it as a composite with the sort column last. - A uniqueness constraint (the unique index enforces it and speeds the existence check).
- A hot query reading a small, stable column set — make it covering with
INCLUDEfor an index-only scan.
Don’t add an index for:
- Low-selectivity columns (a
statuswith three values) — use a partial index if the useful subset is small, otherwise let it scan. - Small tables — a sequential scan of a few thousand rows beats the index descent plus heap hop; the planner knows this and will ignore your index anyway.
- Write-hot tables, beyond what reads genuinely justify — every index is a permanent tax on the ingestion path.
- Columns you only ever query wrapped in a function, unless you build the matching expression index.
- Speculation. “We might query this someday” is how tables end up with eleven indexes and crawling writes. Add the index when the query exists and
EXPLAINproves it helps.
When to consider alternatives
When indexing the relational table is the wrong tool for the job:
- Full-text search, relevance ranking, fuzzy matching → Elasticsearch as a secondary index, not GIN stretched past its comfort zone.
- Write throughput is the wall, reads are key/range → an LSM store like Cassandra, whose write-cheap design is the opposite bias from a B-tree.
- Single-digit-millisecond hot lookups on a small key set → put Redis or another cache in front; the fastest index lookup still loses to an in-memory hash hit.
- Single-table, predictable access at massive scale → DynamoDB, where the partition/sort key is the access pattern and you design the table around the query rather than bolting indexes on after.
- One node’s index working set no longer fits in RAM → Sharding & Partitioning to shrink each shard’s index back into cache.
- Reads are saturating the primary → fan them out to read replicas and accept replication lag, rather than over-indexing the primary to squeeze it.
Operational checklist
- Run
EXPLAIN (ANALYZE, BUFFERS)on every candidate query before and after adding an index, and confirm the index is actually chosen — not merely present. - Order composite index columns equality-first, range/sort column last; verify the leftmost prefix matches your
WHEREandORDER BYshape (including sort direction). - Make hot, narrow queries covering with
INCLUDEpayload to earn index-only scans; in Postgres, keepVACUUMcurrent so the visibility map doesn’t force heap fetches. - Treat index count on write-hot tables as a budget; for each index, be able to name the query it pays for.
- Find and drop zero-scan indexes (
pg_stat_user_indexes.idx_scan = 0over a representative window) to recover write throughput and disk. - Prefer monotonic primary keys (
BIGINTidentity) over random UUIDs to minimize page splits and, under InnoDB, secondary-index bloat. - Keep
autovacuum/ANALYZEcurrent;REINDEX CONCURRENTLYwhen index size drifts past ~2× live data, andANALYZEafter bulk loads. - Alert on plan regressions: a query flipping from
Index ScantoSeq Scanafter a data-volume or statistics change is an early warning, not a curiosity. - For huge append-only tables, evaluate BRIN over B-tree on the naturally-ordered column before paying for a giant B-tree.
- When a query filters on a function or a non-matching type, build the matching expression index or fix the type — don’t assume the column index covers it.
Summary
An index is a separate sorted copy of some columns, synced on every write, that lets reads binary-search instead of scan — and every property worth knowing falls out of that one sentence. B-trees are the default because their high fanout makes them shallow and flexible, serving equality, ranges, and sorts. The rules that decide whether your index is even usable are mechanical: a secondary index costs an extra hop unless it’s covering; a composite index serves only a leftmost prefix and wants equality columns first, the sort column last; low-selectivity filters lose to a scan; and a function or type mismatch silently defeats the whole thing. Against all of that sits the write tax — every index is maintained synchronously on every write, so an unused index is pure liability. The discipline is small and repeatable: index the measured-slow workload, prove each index with EXPLAIN, build covering composites for hot ordered queries, drop the dead weight, and keep statistics and bloat in check. Do that and the planner becomes predictable. Skip it and you get the worst of both halves of the deal — slow writes and slow reads.
Appendix: reading EXPLAIN
EXPLAIN shows the plan; EXPLAIN (ANALYZE, BUFFERS) runs the query and shows what actually happened plus the I/O. Read it bottom-up — the deepest, most-indented node runs first, feeding its parent. What to look for:
Seq Scanon a large table inside a selective query is the classic smell: either the index is missing, ineligible (leftmost-prefix miss, function mismatch), or the planner judged the predicate too broad to bother.Index Scan— descends the index and fetches matching rows from the heap. Good for selective filters.Index Only Scan— answered from the index alone, no heap trip. This is the goal for covering indexes; if you expected it but seeIndex Scan, your index isn’t covering or vacuum is behind.Bitmap Heap Scan(preceded by aBitmap Index Scan) — the planner expects many matching rows, so it gathers their locations into a bitmap and reads the heap in physical order to avoid random I/O. Normal for medium-selectivity predicates; suspicious if you expected a tight index scan.Sortsitting above an index scan, especially with aLIMITabove that — the index found rows but didn’t provide order, so the engine is sorting. This is the motivating-failure signature; a composite index with the sort column can erase theSortnode.- Estimated vs actual rows is the single most important comparison. A node estimating
rows=1that actually returnsrows=2000000means the planner was lied to by stale statistics and likely picked a bad join (a nested loop where a hash join belonged). The fix is usually a freshANALYZE, sometimes a higherdefault_statistics_targeton a skewed column. Buffers: shared hit=... read=...—hitis cache,readis disk. A plan that’s “fast” incostbut shows hugereadcounts is I/O-bound and will fall apart when the cache is cold.
EXPLAIN (ANALYZE, BUFFERS)
SELECT total, status FROM orders
WHERE customer_id = 8842
ORDER BY created_at DESC
LIMIT 20;
If that returns Index Only Scan using idx_orders_cust_created with no Sort node and a low read count, your composite-plus-covering index is doing exactly its job.
Further reading
Incidents & deep-dives
Where this system breaks in production — and how it comes back.
No incident deep-dives yet. See the roadmap for what's coming.