Index Bloat, Maintenance, and Write-Side Costs
Index Bloat, Maintenance, and Write-Side Costs
The content platform’s articles table started with 4 indexes and 28,000 INSERT ops/s throughput. Over 18 months, the team added 8 more indexes to serve new query patterns. INSERT throughput dropped to 7,800 ops/s. Index sizes grew 3-5x beyond their theoretical minimum. The team’s response to slow reads was more indexes. The result was slow writes.
This section covers the three maintenance concerns: detecting bloat, remediating bloat without downtime, and measuring the ongoing write-side cost of each index.
Understanding Index Bloat
Index bloat has two causes:
1. Dead tuples from UPDATEs: When a row is updated and the updated column is indexed, PostgreSQL inserts a new index entry and the old entry becomes dead. VACUUM marks dead entries as reusable, but it does not compact the index. The pages containing dead entries remain allocated.
2. Page splits leaving half-full pages: When an INSERT fills a B-tree leaf page, PostgreSQL splits it into two pages, each approximately 50% full. If subsequent inserts go to other pages (as happens with random-order inserts on UUID or timestamp columns), these half-full pages remain, wasting space.
Over time, both effects cause the index to occupy more pages than necessary. Scans read more pages, the buffer pool caches fewer useful pages, and I/O increases.
Detecting Index Bloat
Method 1: pgstattuple Extension
The most accurate bloat measurement uses pgstattuple:
CREATE EXTENSION IF NOT EXISTS pgstattuple;
SELECT
indexrelname,
pg_relation_size(indexrelid) AS size_bytes,
(pgstatindex(indexrelid)).tree_level AS levels,
(pgstatindex(indexrelid)).leaf_pages,
(pgstatindex(indexrelid)).internal_pages,
(pgstatindex(indexrelid)).dead_items,
(pgstatindex(indexrelid)).avg_leaf_density,
round(100 - (pgstatindex(indexrelid)).avg_leaf_density, 1) AS bloat_pct
FROM pg_stat_user_indexes
WHERE relname = 'articles'
ORDER BY pg_relation_size(indexrelid) DESC;
indexrelname | size_bytes | levels | leaf_pages | internal | dead_items | avg_leaf_density | bloat_pct
------------------------------+------------+--------+------------+----------+------------+------------------+----------
idx_articles_published_at | 12058624 | 3 | 1342 | 4 | 248000 | 52.3 | 47.7
idx_articles_view_count | 18874368 | 3 | 2148 | 6 | 842000 | 28.4 | 71.6
articles_pkey | 11272192 | 3 | 1298 | 4 | 0 | 89.2 | 10.8
idx_articles_slug | 22413312 | 3 | 2612 | 7 | 1200 | 84.1 | 15.9
idx_articles_category_id | 5242880 | 3 | 594 | 2 | 42000 | 61.2 | 38.8
Key metrics:
- avg_leaf_density: Percentage of leaf page space used by live entries. 90% is healthy. Below 60% indicates significant bloat.
- dead_items: Index entries pointing to dead heap tuples. High counts mean VACUUM is not keeping up.
- bloat_pct:
100 - avg_leaf_density. Above 40% warrants remediation.
The view_count index has 71.6% bloat (28.4% leaf density). It occupies 18 MB but should occupy approximately 5 MB. The extra 13 MB is wasted pages that every index scan must skip.
Method 2: Size vs Expected Size
Compare actual index size against the expected size based on row count and key width:
-- Estimated minimum index size
-- Formula: (rows * avg_entry_width) / (page_size * fill_factor)
-- Default fill_factor: 90% for B-tree, page_size: 8192 bytes
-- Bigint key (8 bytes) + TID (6 bytes) + overhead (8 bytes) = 22 bytes per entry
SELECT
indexrelname,
pg_relation_size(indexrelid) AS actual_bytes,
(reltuples * 22 / (8192 * 0.9))::bigint * 8192 AS expected_bytes,
round(100.0 * (pg_relation_size(indexrelid) -
(reltuples * 22 / (8192 * 0.9))::bigint * 8192) /
NULLIF(pg_relation_size(indexrelid), 0), 1) AS overhead_pct
FROM pg_stat_user_indexes sui
JOIN pg_class c ON sui.indexrelid = c.oid
WHERE relname = 'articles'
ORDER BY overhead_pct DESC;
indexrelname | actual_bytes | expected_bytes | overhead_pct
------------------------------+--------------+----------------+-------------
idx_articles_view_count | 18874368 | 5242880 | 72.2
idx_articles_published_at | 12058624 | 5242880 | 56.5
idx_articles_category_id | 5242880 | 3145728 | 40.0
articles_pkey | 11272192 | 9437184 | 16.3
idx_articles_slug | 22413312 | 19922944 | 11.1
Method 3: Unused Index Detection
Before remediating bloat, check if the index is even used:
SELECT
indexrelname,
idx_scan,
idx_tup_read,
idx_tup_fetch,
pg_relation_size(indexrelid) AS size_bytes
FROM pg_stat_user_indexes
WHERE relname = 'articles'
AND idx_scan = 0
ORDER BY pg_relation_size(indexrelid) DESC;
indexrelname | idx_scan | idx_tup_read | idx_tup_fetch | size_bytes
----------------------------------+----------+--------------+---------------+-----------
idx_articles_created_updated | 0 | 0 | 0 | 9437184
idx_articles_excerpt_length | 0 | 0 | 0 | 3604480
Two indexes with zero scans since the last statistics reset. They cost write performance and storage for no read benefit. Drop them:
-- Verify the reset time to ensure stats cover a representative period
SELECT stats_reset FROM pg_stat_bgwriter;
-- 2026-04-01 00:00:00+00 (58 days of statistics)
-- Safe to drop: no scans in 58 days
DROP INDEX idx_articles_created_updated;
DROP INDEX idx_articles_excerpt_length;
Always check stats_reset to ensure the statistics cover a representative period including all query patterns (weekday peaks, monthly reports, batch jobs).
Remediating Bloat: REINDEX vs pg_repack
REINDEX
REINDEX rebuilds an index from scratch. It acquires an ACCESS EXCLUSIVE lock on the table, blocking all reads and writes:
-- Blocks all queries on the table for the duration
REINDEX INDEX idx_articles_view_count;
For the content platform’s view_count index (18 MB, 71.6% bloat):
REINDEX
Time: 842.112 ms
After REINDEX:
SELECT
pg_relation_size('idx_articles_view_count') AS size_bytes,
(pgstatindex('idx_articles_view_count')).avg_leaf_density,
(pgstatindex('idx_articles_view_count')).dead_items
FROM pg_stat_user_indexes
WHERE indexrelname = 'idx_articles_view_count';
size_bytes | avg_leaf_density | dead_items
------------+------------------+-----------
3604480 | 89.8 | 0
From 18 MB to 3.6 MB. Leaf density from 28.4% to 89.8%. But the 842 ms lock blocked all queries on the articles table. At 12,000 requests/second, that is 10,000+ failed requests.
REINDEX CONCURRENTLY
PostgreSQL 12+ supports concurrent reindexing:
-- Does not block reads or writes (PostgreSQL 12+)
REINDEX INDEX CONCURRENTLY idx_articles_view_count;
REINDEX CONCURRENTLY builds a new index alongside the old one, then swaps them atomically. It requires approximately 2x the disk space during the rebuild and takes longer (because it must track changes made during the rebuild):
REINDEX
Time: 2841.442 ms
3.4x slower than the blocking REINDEX, but zero downtime. The trade-off is temporary disk space: during the rebuild, both the old index (18 MB) and the new index (3.6 MB) exist simultaneously.
pg_repack
pg_repack is an extension that repacks tables and indexes without locks. It works by creating a trigger on the target table that captures changes during the repack, then applying them to the new table/index:
# Install pg_repack
apt-get install postgresql-16-repack
# Repack a specific index
pg_repack --no-superuser-check -d content_platform \
--index idx_articles_view_count
INFO: repacking index "idx_articles_view_count"
INFO: index "idx_articles_view_count" was repacked: 18874368 -> 3604480 bytes
Time: 1842.112 ms
pg_repack advantages over REINDEX CONCURRENTLY:
- Can repack the table heap as well (not just indexes), eliminating heap bloat
- Handles all indexes on a table in a single pass
- Works on PostgreSQL versions before 12
pg_repack disadvantages:
- Requires the
pg_repackextension installed on the server - Creates a trigger during the repack, adding minor write overhead
- Requires more temporary disk space
Repack All Bloated Indexes
Automate bloat remediation with a maintenance script:
-- Find all indexes with >40% bloat
SELECT format('REINDEX INDEX CONCURRENTLY %I;', indexrelname) AS reindex_cmd
FROM pg_stat_user_indexes sui
JOIN pg_class c ON sui.indexrelid = c.oid
WHERE round(100 - (pgstatindex(sui.indexrelid)).avg_leaf_density, 1) > 40
ORDER BY pg_relation_size(indexrelid) DESC;
reindex_cmd
--------------------------------------------------
REINDEX INDEX CONCURRENTLY idx_articles_view_count;
REINDEX INDEX CONCURRENTLY idx_articles_published_at;
REINDEX INDEX CONCURRENTLY idx_articles_category_id;
Schedule this during low-traffic hours. Each REINDEX CONCURRENTLY can run in parallel with normal operations but consumes CPU and I/O.
Write-Side Cost: Detailed Measurement
The main chapter showed that each index reduces INSERT throughput by approximately 50% of the remaining capacity. This section provides detailed measurements for different index types.
Benchmark Setup
@BenchmarkMode(Mode.Throughput)
@OutputTimeUnit(TimeUnit.SECONDS)
@State(Scope.Benchmark)
@Warmup(iterations = 5, time = 10)
@Measurement(iterations = 10, time = 10)
@Fork(2)
@Threads(10)
public class IndexWriteCostBenchmark {
private DataSource dataSource;
private AtomicLong counter = new AtomicLong();
@Param({"0", "1", "2", "4", "6", "8", "12"})
private int indexCount;
@Setup(Level.Trial)
public void setup() throws SQLException {
dataSource = createHikariDataSource(10);
createTable(indexCount);
}
@Benchmark
public int insertArticle() throws SQLException {
long id = counter.incrementAndGet();
try (Connection conn = dataSource.getConnection();
PreparedStatement ps = conn.prepareStatement(
"INSERT INTO bench_articles (title, slug, content, status, " +
"category_id, author_id, view_count, published_at) " +
"VALUES (?, ?, ?, ?, ?, ?, ?, ?)")) {
ps.setString(1, "Article " + id);
ps.setString(2, "article-" + id);
ps.setString(3, "Content body for article " + id);
ps.setString(4, "draft");
ps.setInt(5, ThreadLocalRandom.current().nextInt(1, 100));
ps.setInt(6, ThreadLocalRandom.current().nextInt(1, 50));
ps.setInt(7, 0);
ps.setTimestamp(8, Timestamp.from(Instant.now()));
return ps.executeUpdate();
}
}
}
Results (PostgreSQL 16, 16 cores, NVMe SSD, 10 threads, HikariCP pool=10):
Indexes | Throughput (ops/s) | Latency P50 (ms) | P99 (ms) | WAL (MB/s)
--------+--------------------+-------------------+----------+-----------
0 | 28,421 | 0.31 | 1.12 | 8.4
1 | 23,842 | 0.38 | 1.42 | 10.2
2 | 19,847 | 0.45 | 1.84 | 12.8
4 | 14,210 | 0.62 | 2.84 | 18.4
6 | 10,842 | 0.82 | 4.21 | 24.2
8 | 8,421 | 1.04 | 5.84 | 31.8
12 | 5,124 | 1.72 | 9.42 | 48.2
Observations:
- Throughput: Each additional index reduces throughput by 12-18%. The relationship is sublinear because B-tree insertions share the buffer pool and WAL writer.
- P99 latency: Grows faster than P50 because page splits are infrequent but expensive. At 12 indexes, a page split on one index can cascade to splits on others.
- WAL volume: Each index insertion generates WAL records. At 12 indexes, WAL volume is 5.7x higher than with no indexes. This impacts replication lag and checkpoint duration.
UPDATE Cost with Index Interaction
UPDATEs that modify indexed columns are more expensive than UPDATEs to non-indexed columns because they must delete the old index entry and insert a new one:
Operation | Throughput (ops/s) | vs Baseline
---------------------------------+--------------------+-----------
INSERT (6 indexes) | 10,842 | 1.00x
UPDATE non-indexed column | 18,421 | 1.70x
UPDATE 1 indexed column | 12,842 | 1.18x
UPDATE 3 indexed columns | 8,124 | 0.75x
UPDATE all 6 indexed columns | 5,421 | 0.50x
Updating a non-indexed column is faster than INSERT because it does not write to any index (and may qualify for a HOT update). Updating all 6 indexed columns is slower than INSERT because each old index entry must be marked dead.
HOT Update Effectiveness
HOT (Heap-Only Tuple) updates bypass index maintenance entirely. They occur when:
- The updated column is not part of any index
- The new tuple fits on the same heap page as the old tuple
-- Check HOT update eligibility for each column
SELECT
attname,
EXISTS (
SELECT 1 FROM pg_index
WHERE indrelid = 'articles'::regclass
AND attnum = ANY(indkey)
) AS is_indexed
FROM pg_attribute
WHERE attrelid = 'articles'::regclass
AND attnum > 0
AND NOT attisdropped
ORDER BY attnum;
attname | is_indexed
-----------------+-----------
id | true -- PK
title | false
slug | true -- unique index
content | false
excerpt | false
status | true -- partial indexes
view_count | true -- view_count index
published_at | true -- composite indexes
created_at | false
updated_at | false
author_id | true -- FK index
category_id | true -- composite indexes
featured_image | false
meta_description| false
Columns title, content, excerpt, created_at, updated_at, featured_image, and meta_description are not indexed. Updates to these columns are HOT-eligible. The view_count column is indexed and updated 10,000 times per second. Dropping the view_count index makes these updates HOT-eligible:
-- Before dropping view_count index
SELECT n_tup_upd, n_tup_hot_upd,
round(100.0 * n_tup_hot_upd / NULLIF(n_tup_upd, 0), 1) AS hot_pct
FROM pg_stat_user_tables WHERE relname = 'articles';
n_tup_upd | n_tup_hot_upd | hot_pct
-----------+---------------+--------
26400000 | 1320000 | 5.0
DROP INDEX idx_articles_view_count;
-- Reset stats and wait for representative sample
SELECT pg_stat_reset_single_table_counters('articles'::regclass);
-- ... wait for traffic ...
n_tup_upd | n_tup_hot_upd | hot_pct
-----------+---------------+--------
8800000 | 8360000 | 95.0
HOT updates go from 5% to 95%. This eliminates index bloat for the most frequent UPDATE operation.
Autovacuum Tuning for Index-Heavy Tables
Autovacuum removes dead heap tuples and marks dead index entries as reusable. Its default settings are conservative. For the content platform’s write-heavy tables, aggressive settings are necessary:
-- Default autovacuum settings (too conservative for write-heavy tables)
-- autovacuum_vacuum_threshold = 50
-- autovacuum_vacuum_scale_factor = 0.20 (20% of table must be dead)
-- autovacuum_vacuum_cost_limit = 200
-- For articles table (500,000 rows, UPDATE-heavy)
ALTER TABLE articles SET (
autovacuum_vacuum_threshold = 500,
autovacuum_vacuum_scale_factor = 0.02, -- 2% instead of 20%
autovacuum_analyze_threshold = 500,
autovacuum_analyze_scale_factor = 0.02,
autovacuum_vacuum_cost_delay = 2 -- 2ms instead of 20ms
);
-- For view_events table (INSERT-only, high volume)
ALTER TABLE view_events SET (
autovacuum_vacuum_threshold = 10000,
autovacuum_vacuum_scale_factor = 0.01, -- 1%
autovacuum_vacuum_cost_delay = 0 -- No delay: fastest possible
);
The autovacuum_vacuum_cost_delay setting controls how aggressively VACUUM runs. The default 20 ms delay means VACUUM sleeps 20 ms for every 200 cost units of work. Setting it to 2 ms (or 0 for INSERT-only tables) allows VACUUM to complete faster, keeping dead tuples low and preventing index bloat.
Monitor autovacuum effectiveness:
SELECT
relname,
n_dead_tup,
n_live_tup,
round(100.0 * n_dead_tup / NULLIF(n_live_tup + n_dead_tup, 0), 1) AS dead_pct,
last_vacuum,
last_autovacuum,
autovacuum_count
FROM pg_stat_user_tables
WHERE relname IN ('articles', 'view_events', 'article_tags')
ORDER BY n_dead_tup DESC;
relname | n_dead_tup | n_live_tup | dead_pct | last_autovacuum | autovacuum_count
--------------+------------+------------+----------+---------------------+-----------------
view_events | 84200 | 28400000 | 0.3 | 2026-05-29 14:32:11 | 4821
articles | 2400 | 500000 | 0.5 | 2026-05-29 14:28:42 | 1842
article_tags | 120 | 1360000 | 0.0 | 2026-05-29 14:30:18 | 842
Dead tuple ratios below 1% indicate healthy autovacuum performance. If dead_pct exceeds 5% consistently, autovacuum is falling behind. Increase autovacuum_max_workers (default 3) or reduce the cost delay further.
Index Maintenance Schedule
Combine bloat detection, unused index removal, and REINDEX into a maintenance routine:
-- Weekly maintenance script
-- 1. Identify unused indexes (candidates for removal)
SELECT indexrelname, idx_scan, pg_relation_size(indexrelid) AS size
FROM pg_stat_user_indexes
WHERE idx_scan = 0 AND NOT indisunique
ORDER BY pg_relation_size(indexrelid) DESC;
-- 2. Identify bloated indexes (candidates for REINDEX)
SELECT indexrelname,
pg_relation_size(indexrelid) AS actual_size,
round(100 - (pgstatindex(indexrelid)).avg_leaf_density, 1) AS bloat_pct
FROM pg_stat_user_indexes
WHERE relname IN ('articles', 'view_events')
AND round(100 - (pgstatindex(indexrelid)).avg_leaf_density, 1) > 40
ORDER BY pg_relation_size(indexrelid) DESC;
-- 3. REINDEX bloated indexes concurrently
-- (Run each one; they are safe to execute during production traffic)
-- REINDEX INDEX CONCURRENTLY <index_name>;
-- 4. Update table statistics
ANALYZE articles;
ANALYZE view_events;
ANALYZE article_tags;
The cost of not maintaining indexes is gradual: queries slow by 2-5% per month as bloat grows, then suddenly slow by 50%+ when the working set exceeds the buffer pool. Regular maintenance prevents this cliff.
Summary: The Index Cost Budget
Think of indexes as a budget. Each index costs:
| Cost Type | Per Index | Cumulative (12 indexes) |
|---|---|---|
| INSERT throughput | -12-18% | -82% |
| UPDATE (indexed col) | -15-20% | depends on columns |
| WAL volume | +2 MB/s | +40 MB/s |
| Storage | 3-40 MB | 36-480 MB |
| VACUUM work | Proportional | 12x baseline |
| Replication lag | +WAL volume | Significant |
For the content platform, the budget allows 6-8 indexes on the articles table. Each index must justify its existence through measured read-path improvement identified via pg_stat_statements (Chapter 17). Unused indexes are waste. Bloated indexes are debt. Both compound over time.