B-Tree Internals: Page Splits, Fill Factors, and Write Amplification
B-Tree Internals: Page Splits, Fill Factors, and Write Amplification
The Black Box
You create an index on a column. PostgreSQL builds a B-Tree. Queries on that column become fast. You add more indexes to speed up other queries. Writes slow down. The DBA mentions “write amplification” and “page splits.” These are not abstract concepts. They are physical events happening on disk, and they determine whether your write-heavy ingestion pipeline meets its throughput target.
The Mechanism
A PostgreSQL B-Tree stores data in 8KB pages. Each page has a fixed layout:
- Page header (24 bytes): page LSN, checksum, flags, free space pointer
- Item pointer array: fixed-size entries pointing to the actual tuples within the page
- Tuples: the indexed key values and their pointers to the heap (table) row
A leaf page in a B-Tree index on package_id (a VARCHAR(12)) can hold approximately 300-400 entries, depending on alignment and padding. An internal page holds slightly more, because it stores only keys and child page pointers, not heap pointers.
Page Split Mechanics
When an INSERT adds a new key to a leaf page that is already full:
- PostgreSQL allocates a new page.
- It moves approximately half the entries from the full page to the new page.
- It inserts a pointer to the new page in the parent internal page.
- If the parent is also full, the split cascades upward.
- If the root page splits, a new root is created, increasing the tree depth by one.
Each of these operations generates WAL records. A single page split that cascades to the parent produces 4-6 WAL writes: the original leaf modification, the new leaf page, the parent modification, and possibly more if the cascade continues.
-- Concept: observing page splits via pg_stat_user_indexes
-- Monitor split activity during a bulk insert
SELECT indexrelname, idx_scan, idx_tup_read, idx_tup_fetch, pg_relation_size(indexrelid) as index_size
FROM pg_stat_user_indexes
WHERE schemaname = 'public' AND relname = 'package_events';
-- Before bulk insert:
-- indexrelname | index_size
-- idx_package_events_package_id | 224698368 (214 MB)
-- After inserting 1 million new events:
-- indexrelname | index_size
-- idx_package_events_package_id | 247463936 (236 MB)
-- The index grew 22 MB for 1 million new rows.
-- Each row's package_id is ~12 bytes. 1M * 12 bytes = 12 MB of raw key data.
-- The additional 10 MB is overhead: page headers, item pointers, free space
-- from page splits (pages are split at roughly 50% full).
Fill Factor
The fillfactor storage parameter controls how full PostgreSQL fills each leaf page during index creation and bulk inserts. The default for B-Tree indexes is 90%.
-- Concept: fill factor tradeoff
-- Lower fill factor = more free space per page = fewer splits during inserts
-- Higher fill factor = denser pages = fewer pages to read during scans
-- For a write-heavy table (package events arriving constantly):
CREATE INDEX idx_package_events_package_id
ON package_events (package_id)
WITH (fillfactor = 70);
-- For a read-heavy, rarely-updated lookup table (warehouse locations):
CREATE INDEX idx_warehouses_code
ON warehouses (warehouse_code)
WITH (fillfactor = 100);
A fillfactor of 70 leaves 30% free space in each leaf page. This space absorbs new inserts without triggering page splits, at the cost of reading more pages for range scans (because fewer keys fit per page). A fillfactor of 100 packs pages tight, minimizing the number of pages read during scans, but every insert into a full page triggers a split.
The Observable Consequence
Write amplification is the ratio of bytes written to persistent media versus bytes of logical data inserted. For a table with $k$ indexes, each INSERT generates:
- 1 heap page write (the table row)
- $k$ index page writes (one per index)
- 1 WAL record per heap write
- 1 WAL record per index write
- Additional WAL records for any page splits
A logistics platform’s package_events table with 3 indexes (on package_id, timestamp, and warehouse_id) has a write amplification of at least 4x for the data pages alone, plus the WAL records that double it to roughly 8x. This means a logical write of 200 bytes generates approximately 1,600 bytes of I/O.
At 2,000 events per minute, this is negligible. At 200,000 events per minute, the disk I/O from write amplification becomes the bottleneck before the CPU does.
The Decision Rule
Keep fillfactor at 90 (the default) for tables with a mix of reads and writes. Lower it to 70-80 for tables with heavy insert activity on indexed columns where the keys are not sequential (UUIDs, hash values). Raise it to 100 for lookup tables that are loaded once and queried many times.
If write amplification from indexes is your bottleneck, reduce the number of indexes rather than tuning fillfactor. Every index removed eliminates one page write and one WAL record per insert. This is a larger gain than any fillfactor adjustment.