Index Design Under Load: Partial Indexes, Covering Indexes, and Index Bloat Over Time
Index Design Under Load: Partial Indexes, Covering Indexes, and Index Bloat Over Time
The content platform’s articles table has 500,000 rows and 12 indexes. Every INSERT writes to all 12 indexes. Every UPDATE that modifies an indexed column creates a new index entry and marks the old one as dead. Over time, dead entries accumulate. The indexes grow. Scans slow down. The team adds more indexes to compensate. The writes slow further.
Index design is a read/write trade-off with a time dimension. An index accelerates reads the moment it is created. It penalizes writes from the same moment. And it degrades over months as dead tuples accumulate. This chapter covers the mechanics of B-tree indexes, the advanced index types that shift the trade-off, and the maintenance required to keep indexes effective.
B-tree Internals: Pages, Levels, and Selectivity
Every PostgreSQL B-tree index is a balanced tree of 8 KB pages. Understanding the physical structure explains why some queries touch 3 pages and others touch 3,000.
A B-tree has three types of pages:
- Root page: The single entry point. Contains pointers to internal pages (or leaf pages for small indexes).
- Internal pages: Contain keys and pointers to child pages. Each internal page holds approximately 300-400 entries for a bigint key.
- Leaf pages: Contain the actual index entries (key value + heap TID pointing to the table row). Each leaf page holds approximately 200-350 entries depending on key size.
For the content platform’s articles table with 500,000 rows:
-- Index size and page count
SELECT
indexrelname,
pg_relation_size(indexrelid) AS size_bytes,
pg_relation_size(indexrelid) / 8192 AS pages,
idx_scan,
idx_tup_read
FROM pg_stat_user_indexes
WHERE relname = 'articles'
ORDER BY pg_relation_size(indexrelid) DESC;
indexrelname | size_bytes | pages | idx_scan | idx_tup_read
---------------------------+------------+--------+----------+-------------
idx_articles_search_vec | 41943040 | 5120 | 842000 | 3535400
idx_articles_published_at | 9035776 | 1103 | 1248000 | 24960000
articles_pkey | 11272192 | 1376 | 984000 | 984000
idx_articles_slug | 22413312 | 2736 | 124000 | 124000
idx_articles_status | 3604480 | 440 | 12400 | 12400000
idx_articles_category_id | 3604480 | 440 | 98400 | 98400000
idx_articles_author_id | 3604480 | 440 | 42000 | 1680000
idx_articles_view_count | 3604480 | 440 | 8200 | 164000
The idx_articles_published_at index has 1,103 pages. With 500,000 rows and approximately 450 entries per leaf page, that is about 1,111 leaf pages. The tree has 3 levels:
- Level 0 (root): 1 page
- Level 1 (internal): 3 pages (each internal page points to ~370 leaf pages)
- Level 2 (leaves): 1,099 pages
An index lookup for a single value traverses root (1) + internal (1) + leaf (1) + heap (1) = 4 page reads. This is why single-row index lookups consistently show Buffers: shared hit=4.
Selectivity
Selectivity determines whether PostgreSQL uses an index. Selectivity is the fraction of rows an index condition matches:
-- Column selectivity
SELECT
attname,
n_distinct,
most_common_vals,
most_common_freqs
FROM pg_stats
WHERE tablename = 'articles' AND attname = 'status';
attname | n_distinct | most_common_vals | most_common_freqs
---------+------------+------------------------------+---------------------
status | 3 | {published,draft,archived} | {0.974,0.024,0.002}
The status column has 3 distinct values. published matches 97.4% of rows. An index scan for status = 'published' would read 97.4% of the index’s leaf pages, then fetch 97.4% of the heap pages in random order. A sequential scan reads all heap pages in sequential order. Random I/O for 97.4% of the table is slower than sequential I/O for 100%. PostgreSQL correctly chooses Seq Scan.
But status = 'draft' (2.4%) and status = 'archived' (0.2%) are highly selective. An index scan is fast for these values. The problem: the index on status occupies 440 pages (3.4 MB) and is maintained on every INSERT and UPDATE, but it is only useful for 2.6% of queries. This is where partial indexes help.
Partial Indexes: Indexing Only What Matters
A partial index includes only rows matching a WHERE clause:
-- Full index: 440 pages, covers all 500,000 rows
CREATE INDEX idx_articles_status ON articles (status);
-- Partial index: only non-published articles (13,000 rows)
CREATE INDEX idx_articles_draft ON articles (status)
WHERE status != 'published';
The partial index covers 13,000 rows instead of 500,000. It occupies approximately 12 pages instead of 440. It is maintained only when rows transition to/from draft/archived status, not for every INSERT (which defaults to draft) or every UPDATE.
Query for draft articles:
EXPLAIN (ANALYZE, BUFFERS)
SELECT id, title, slug, updated_at
FROM articles
WHERE status = 'draft'
ORDER BY updated_at DESC;
Before (full index):
Sort (cost=842.33..872.33 rows=12000 width=82)
(actual time=4.221..4.842 rows=12000 loops=1)
Sort Key: updated_at DESC
Sort Method: quicksort Memory: 1842kB
Buffers: shared hit=1248
-> Bitmap Heap Scan on articles (cost=52.18..542.33 rows=12000 width=82)
(actual time=0.842..2.442 rows=12000 loops=1)
Recheck Cond: (status = 'draft')
Heap Blocks: exact=842
Buffers: shared hit=1248
-> Bitmap Index Scan on idx_articles_status
(cost=0.00..49.18 rows=12000 width=0)
(actual time=0.612..0.612 rows=12000 loops=1)
Buffers: shared hit=406
Planning Time: 0.112 ms
Execution Time: 5.112 ms
After (partial index with sort column):
CREATE INDEX idx_articles_draft_updated
ON articles (updated_at DESC)
WHERE status = 'draft';
Index Scan using idx_articles_draft_updated on articles
(cost=0.29..842.33 rows=12000 width=82)
(actual time=0.031..1.842 rows=12000 loops=1)
Buffers: shared hit=842
Planning Time: 0.089 ms
Execution Time: 2.112 ms
The partial index eliminates the sort (the index is already ordered by updated_at DESC) and reduces buffer hits from 1,248 to 842 (no separate index scan for the status filter). From 5.1 ms to 2.1 ms.
For the content platform’s admin interface, which queries draft and archived articles frequently but published articles rarely (the public API uses the search index), partial indexes cover the admin workload without bloating the write path for the bulk of the data.
Covering Indexes: Eliminating Heap Fetches
A covering index contains all columns needed by a query, enabling an Index Only Scan that never touches the heap. PostgreSQL’s INCLUDE clause adds non-key columns to the index leaf pages:
-- Standard index: requires heap fetch for title, slug, excerpt
CREATE INDEX idx_articles_cat_published
ON articles (category_id, published_at DESC);
-- Covering index: all needed columns in the index
CREATE INDEX idx_articles_cat_published_covering
ON articles (category_id, published_at DESC)
INCLUDE (title, slug, excerpt, view_count);
The content platform’s category page loads the 20 most recent articles in a category:
EXPLAIN (ANALYZE, BUFFERS)
SELECT title, slug, excerpt, published_at, view_count
FROM articles
WHERE category_id = 5
AND status = 'published'
ORDER BY published_at DESC
LIMIT 20;
Without covering index:
Limit (cost=0.43..24.87 rows=20 width=292)
(actual time=0.052..0.312 rows=20 loops=1)
Buffers: shared hit=64
-> Index Scan Backward using idx_articles_cat_published on articles
(cost=0.43..4521.33 rows=7284 width=292)
(actual time=0.051..0.301 rows=20 loops=1)
Index Cond: (category_id = 5)
Filter: (status = 'published')
Rows Removed by Filter: 2
Buffers: shared hit=64
64 buffer hits: 4 for the index traversal plus approximately 20 heap page fetches (one per row, assuming rows are spread across different pages) plus overhead.
With covering index:
Limit (cost=0.43..4.87 rows=20 width=292)
(actual time=0.042..0.089 rows=20 loops=1)
Buffers: shared hit=4
-> Index Only Scan Backward using idx_articles_cat_published_covering on articles
(cost=0.43..2842.33 rows=7284 width=292)
(actual time=0.041..0.084 rows=20 loops=1)
Index Cond: (category_id = 5)
Filter: (status = 'published')
Rows Removed by Filter: 2
Heap Fetches: 0
Buffers: shared hit=4
4 buffer hits instead of 64. Heap Fetches: 0. From 0.312 ms to 0.089 ms. The index alone provides all requested data.
The trade-off: the covering index is larger. The standard composite index on (category_id, published_at) occupies approximately 1,400 pages. The covering index with INCLUDE (title, slug, excerpt, view_count) occupies approximately 4,200 pages because each leaf entry is wider. Every INSERT and every UPDATE to the included columns must update this larger index.
Multi-Column Index Ordering: The Left-Prefix Rule
A multi-column B-tree index on (a, b, c) supports queries that filter on:
aalonea AND ba AND b AND c
It does not efficiently support queries that filter on:
balonecaloneb AND c
This is the left-prefix rule. The B-tree is sorted first by a, then by b within each a value, then by c within each (a, b) pair.
Content platform example:
CREATE INDEX idx_articles_status_cat_published
ON articles (status, category_id, published_at DESC);
This index serves three query patterns:
-- Pattern 1: Uses columns (status) -> left prefix match
SELECT * FROM articles WHERE status = 'published';
-- Pattern 2: Uses columns (status, category_id) -> left prefix match
SELECT * FROM articles WHERE status = 'published' AND category_id = 5;
-- Pattern 3: Uses columns (status, category_id, published_at) -> full match
SELECT * FROM articles WHERE status = 'published' AND category_id = 5
ORDER BY published_at DESC LIMIT 20;
But not:
-- SLOW: Does not use the index (skips status)
SELECT * FROM articles WHERE category_id = 5;
Seq Scan on articles (cost=0.00..19924.00 rows=50000 width=15420)
(actual time=0.012..82.112 rows=49872 loops=1)
Filter: (category_id = 5)
Rows Removed by Filter: 450128
Buffers: shared hit=8924
PostgreSQL cannot skip the first column of a multi-column index. It must start at the leftmost column. The column order should follow the selectivity of your common query patterns: the most filtered column first, the sort column last.
Index Bloat: The Write-Side Time Bomb
PostgreSQL uses MVCC (Multi-Version Concurrency Control). When an UPDATE modifies a row, the old version remains in the heap (marked as dead) and a new version is created. If the updated column is indexed, a new index entry is created pointing to the new heap tuple. The old index entry pointing to the dead heap tuple becomes dead. VACUUM removes dead heap tuples and marks index entries pointing to them as available for reuse.
The content platform’s articles table updates view_count on every page view:
UPDATE articles SET view_count = view_count + 1 WHERE id = $1;
At 10,000 views per second, this creates 10,000 dead index entries per second in every index that includes view_count. If VACUUM cannot keep up, dead entries accumulate.
Measure index bloat:
-- Index bloat estimation
SELECT
schemaname || '.' || relname AS table,
indexrelname AS index,
pg_relation_size(indexrelid) AS index_size,
pg_stat_get_dead_tuples(indexrelid) AS dead_tuples,
round(100.0 * pg_stat_get_dead_tuples(indexrelid) /
NULLIF(pg_stat_get_live_tuples(i.indrelid) +
pg_stat_get_dead_tuples(indexrelid), 0), 1) AS dead_pct
FROM pg_stat_user_indexes sui
JOIN pg_index i ON sui.indexrelid = i.indexrelid
WHERE relname = 'articles'
ORDER BY pg_relation_size(indexrelid) DESC;
After one month without REINDEX:
table | index | index_size | dead_tuples | dead_pct
-----------------+----------------------------+------------+-------------+---------
public.articles | idx_articles_search_vec | 41943040 | 0 | 0.0
public.articles | idx_articles_view_count | 18874368 | 8420000 | 94.4
public.articles | idx_articles_published_at | 12058624 | 2480000 | 83.2
public.articles | articles_pkey | 11272192 | 0 | 0.0
The view_count index has 94.4% dead tuples. Its size grew from 3.6 MB (440 pages) to 18 MB (2,304 pages). Index scans now read 5x more pages because they must skip dead entries.
The fix has two parts: prevent unnecessary index updates and maintain existing indexes.
HOT Updates (Heap-Only Tuples)
When an UPDATE does not modify any indexed column and the new tuple fits on the same heap page, PostgreSQL can perform a HOT update. HOT updates do not create new index entries. They chain the new heap tuple to the old one on the same page.
The view_count column is in idx_articles_view_count. Every view_count update must create a new index entry. Removing this index (or redesigning it) enables HOT updates for view count increments:
-- Check HOT update ratio
SELECT relname,
n_tup_upd,
n_tup_hot_upd,
round(100.0 * n_tup_hot_upd / NULLIF(n_tup_upd, 0), 1) AS hot_pct
FROM pg_stat_user_tables
WHERE relname = 'articles';
relname | n_tup_upd | n_tup_hot_upd | hot_pct
----------+-----------+---------------+--------
articles | 26400000 | 1320000 | 5.0
Only 5% of updates are HOT because most updates modify view_count, which is indexed. After dropping idx_articles_view_count:
relname | n_tup_upd | n_tup_hot_upd | hot_pct
----------+-----------+---------------+--------
articles | 8800000 | 8360000 | 95.0
HOT updates jump from 5% to 95%. Index bloat drops proportionally. If the application needs to sort by view_count, pre-compute and cache popular article rankings instead of maintaining a live index on a column that updates 10,000 times per second.
Write-Side Cost: Measuring Insert Throughput per Index
Every index slows down every INSERT. Measure the impact:
// Benchmark: INSERT throughput with varying index count
@BenchmarkMode(Mode.Throughput)
@OutputTimeUnit(TimeUnit.SECONDS)
@State(Scope.Thread)
public class IndexInsertBenchmark {
private DataSource dataSource;
@Setup
public void setup() {
// Configure HikariCP pool: 10 connections
dataSource = createDataSource();
}
@Benchmark
public void insertWithZeroIndexes(Blackhole bh) throws SQLException {
try (Connection conn = dataSource.getConnection();
PreparedStatement ps = conn.prepareStatement(
"INSERT INTO articles_no_idx (title, slug, content, status) " +
"VALUES (?, ?, ?, ?)")) {
ps.setString(1, "Benchmark Article");
ps.setString(2, "benchmark-" + ThreadLocalRandom.current().nextLong());
ps.setString(3, "Content body text for benchmarking");
ps.setString(4, "draft");
bh.consume(ps.executeUpdate());
}
}
@Benchmark
public void insertWithSixIndexes(Blackhole bh) throws SQLException {
// Same insert, table has 6 B-tree indexes
try (Connection conn = dataSource.getConnection();
PreparedStatement ps = conn.prepareStatement(
"INSERT INTO articles_6idx (title, slug, content, status, " +
"category_id, published_at) VALUES (?, ?, ?, ?, ?, ?)")) {
ps.setString(1, "Benchmark Article");
ps.setString(2, "benchmark-" + ThreadLocalRandom.current().nextLong());
ps.setString(3, "Content body text for benchmarking");
ps.setString(4, "draft");
ps.setInt(5, ThreadLocalRandom.current().nextInt(1, 100));
ps.setTimestamp(6, Timestamp.from(Instant.now()));
bh.consume(ps.executeUpdate());
}
}
@Benchmark
public void insertWithTwelveIndexes(Blackhole bh) throws SQLException {
// Same insert, table has 12 B-tree indexes
// ...
}
}
Results (PostgreSQL 16, 16 cores, SSD, 10 concurrent threads):
Benchmark Mode Cnt Score Error Units
insertWithZeroIndexes thrpt 10 28421.3 +/- 842.1 ops/s
insertWithSixIndexes thrpt 10 14210.7 +/- 421.3 ops/s
insertWithTwelveIndexes thrpt 10 7842.1 +/- 312.4 ops/s
Each index costs approximately 50% of the remaining throughput. The relationship is roughly linear: going from 0 to 6 indexes cuts throughput by 50%, and from 6 to 12 cuts it by another 45%. This is because each index requires a separate B-tree insertion with its own page split potential, WAL logging, and buffer pool contention.
The content platform’s view_events table receives 10,000 INSERTs per second. With 6 indexes, maximum throughput is 14,210 ops/s, leaving only 42% headroom. Reducing to 2 essential indexes (primary key + article_id) increases throughput to approximately 22,000 ops/s, providing 120% headroom.
Trade-offs
Every index is a bet that reads will benefit more than writes will suffer. The bet has a time component: index bloat accumulates, write cost compounds, and VACUUM requires resources.
Partial indexes reduce the write-side cost by excluding the majority of rows. Covering indexes increase read performance at the cost of larger indexes that are more expensive to maintain. Multi-column indexes serve multiple query patterns with a single structure but only when queries match the left-prefix.
The decision framework:
- Identify hot queries via
pg_stat_statements(Chapter 17) - Check if existing indexes cover them:
EXPLAIN ANALYZE - Design the narrowest index that serves the query: partial if possible, covering if the query is on the hot path
- Measure write-side impact: benchmark INSERTs with and without the index
- Monitor bloat monthly:
pg_stat_user_indexesdead tuple ratio