Skip to main content
data systems from the ground up

Relational Databases Under the Hood

7 min read Chapter 13 of 36

Relational Databases Under the Hood

PostgreSQL is the logistics platform’s primary database. It stores package metadata, warehouse inventory, delivery assignments, and audit history. The team uses it through an ORM. The ORM generates SQL. The SQL hits PostgreSQL’s query planner, which chooses an execution strategy based on table statistics, available indexes, and cost estimates. The query runs against a B-Tree index (Chapter 2), reads data pages managed by the buffer pool (Chapter 3), and returns results filtered by MVCC visibility rules.

Each of these layers is a black box that opens during production incidents. This chapter opens them.

MVCC: Why Your SELECT Sees Stale Data

PostgreSQL does not lock rows for reading. Instead, it uses Multi-Version Concurrency Control (MVCC): each row can have multiple versions, and each transaction sees a consistent snapshot of the database as of its start time.

Every row (tuple) in PostgreSQL has two hidden columns:

  • xmin: the transaction ID that created this tuple version
  • xmax: the transaction ID that deleted or updated this tuple version (0 if still live)

When a transaction starts, it takes a snapshot: the set of all currently committed transaction IDs. A tuple is visible to the transaction if xmin is committed and in the snapshot, and xmax is either 0 (not deleted) or not yet committed at snapshot time.

-- Concept: observing MVCC tuple versions
-- The pageinspect extension reveals the physical structure

CREATE EXTENSION IF NOT EXISTS pageinspect;

-- Insert a package record
INSERT INTO packages (package_id, status) VALUES ('PKG-50001', 'CREATED');

-- View the tuple's hidden columns
SELECT xmin, xmax, package_id, status FROM packages WHERE package_id = 'PKG-50001';
-- xmin: 1001  xmax: 0  package_id: PKG-50001  status: CREATED

-- Update the package status
UPDATE packages SET status = 'IN_TRANSIT' WHERE package_id = 'PKG-50001';

-- The old tuple is not deleted. It is marked with xmax.
-- A new tuple is created with the new value.
SELECT ctid, xmin, xmax, package_id, status
FROM packages WHERE package_id = 'PKG-50001';
-- ctid: (0,2)  xmin: 1002  xmax: 0  package_id: PKG-50001  status: IN_TRANSIT

-- The old tuple (ctid 0,1) still exists on disk with xmax: 1002
-- It is invisible to new transactions but occupies space until vacuum removes it.

The consequence: every UPDATE creates a new tuple and marks the old one as dead. UPDATEs in PostgreSQL are not cheap. They are INSERT + soft-DELETE. A table with 1 million rows that receives 10,000 updates per hour accumulates 10,000 dead tuples per hour. After 24 hours: 240,000 dead tuples occupying space, bloating the table, and slowing sequential scans.

Vacuum: The Garbage Collector You Cannot Ignore

Dead tuples accumulate from UPDATE and DELETE operations. Vacuum reclaims the space occupied by dead tuples, making it available for reuse by future INSERTs.

Autovacuum runs automatically based on thresholds:

-- Concept: autovacuum trigger conditions
SHOW autovacuum_vacuum_threshold;       -- Default: 50
SHOW autovacuum_vacuum_scale_factor;    -- Default: 0.2

-- Autovacuum triggers when:
-- dead_tuples > autovacuum_vacuum_threshold + (autovacuum_vacuum_scale_factor * total_tuples)
-- For a table with 1,000,000 rows: trigger at 50 + 0.2 * 1,000,000 = 200,050 dead tuples
-- That is 20% of the table must be dead before autovacuum runs.

-- For the logistics platform's package_events table (10 million rows):
-- Trigger: 50 + 0.2 * 10,000,000 = 2,000,050 dead tuples
-- At 10,000 updates/hour, autovacuum will not run for 200 hours.
-- The table bloats significantly before cleanup.

-- Fix: lower the scale factor for high-update tables
ALTER TABLE package_events SET (autovacuum_vacuum_scale_factor = 0.01);
-- New trigger: 50 + 0.01 * 10,000,000 = 100,050 dead tuples
-- Autovacuum runs after ~10 hours of updates instead of 200.

The Vacuum Storm

When autovacuum falls behind, the table bloats. Queries slow down because sequential scans read dead tuples. Index scans slow down because the index points to pages filled with dead tuples that must be skipped. Eventually, a manual VACUUM FULL is required, which rewrites the entire table with an exclusive lock, blocking all reads and writes.

PostgreSQL MVCC and vacuum cycle showing live tuples, dead tuples, and vacuum reclamation

The diagram shows the MVCC lifecycle. Inserts create live tuples. Updates create new tuple versions and mark old ones as dead (xmax set). Vacuum scans the table, identifies dead tuples invisible to all active transactions, and marks their space as reusable. If vacuum cannot keep pace with updates, dead tuples accumulate, table size grows, and query performance degrades.

TOAST: Large Values and Their Hidden Cost

PostgreSQL stores rows in 8KB pages. When a row exceeds approximately 2KB (the TOAST threshold), large column values are compressed and/or stored in a separate TOAST table.

-- Concept: TOAST behavior for the audit_history table
-- The audit_details column contains JSON payloads averaging 8KB

SELECT
    pg_column_size(audit_details) as compressed_size,
    octet_length(audit_details::text) as raw_size
FROM audit_history
LIMIT 5;

-- compressed_size | raw_size
-- 2048            | 8192
-- 1856            | 7680
-- 2304            | 9216

-- PostgreSQL compressed the 8KB JSON to ~2KB using pglz.
-- The compressed value is stored in a TOAST table, not inline in the main table page.
-- Reading this row requires two I/O operations: one for the main table page,
-- one for the TOAST table page.

The hidden cost: queries that SELECT the large column trigger TOAST table reads. Queries that do not reference the column avoid the TOAST table entirely. An SELECT * FROM audit_history is significantly slower than SELECT audit_id, timestamp FROM audit_history because the former decompresses and fetches every TOAST value.

The Query Planner

PostgreSQL’s query planner chooses an execution strategy based on cost estimation. The planner estimates the cost (in arbitrary “cost units”) of each possible plan and chooses the cheapest one.

-- Concept: why the planner ignores your index
EXPLAIN ANALYZE
SELECT * FROM package_events
WHERE status = 'DELIVERED'
AND timestamp > '2024-11-01';

-- If 60% of rows have status='DELIVERED', the planner chooses a sequential scan
-- because reading 60% of the table via index is more expensive than reading
-- the whole table sequentially. The index introduces random I/O (jumping between
-- index pages and heap pages), while the sequential scan reads pages in order.

-- Seq Scan on package_events (cost=0.00..325891.00 rows=6200000 width=82)
--   Filter: (status = 'DELIVERED' AND timestamp > '2024-11-01')
-- Execution Time: 4821.15 ms

The planner’s decision depends on random_page_cost (default: 4.0) and seq_page_cost (default: 1.0). These defaults assume HDDs where random reads are 4x slower than sequential reads. On SSDs, the ratio is closer to 1.1:1.

-- Concept: adjusting cost parameters for SSD storage
ALTER SYSTEM SET random_page_cost = 1.1;
SELECT pg_reload_conf();

-- After reload, re-run the query:
EXPLAIN ANALYZE
SELECT * FROM package_events
WHERE status = 'DELIVERED'
AND timestamp > '2024-11-01';

-- The planner now prefers the index scan because random I/O
-- is no longer penalized 4x compared to sequential I/O.

The Decision Rule

Tune autovacuum_vacuum_scale_factor per table based on the update rate. High-update tables (10,000+ updates/hour on a million-row table) need a scale factor of 0.01-0.05. Low-update lookup tables can use the default 0.2.

Set random_page_cost to 1.1 on SSD-backed databases. The default of 4.0 causes the planner to avoid index scans on SSDs when they would be faster.

Avoid SELECT * on tables with TOAST columns unless you need every column. The TOAST decompression and extra I/O add measurable latency per row.

Every mechanism in this chapter connects back to the storage layer from Part I. MVCC uses the heap tuple format inside 8KB pages. Vacuum reclaims space from dead tuples, a form of compaction analogous to the LSM-tree compaction in Chapter 3. The query planner’s choice between sequential scan and index scan is the same tradeoff from Chapter 2: linear scan vs indexed lookup, now with a cost model that accounts for page cache behavior and I/O characteristics.