MVCC Mechanics and Visibility
MVCC Mechanics and Visibility
The main chapter introduced MVCC at a high level: each UPDATE creates a new tuple, the old one becomes dead. This section goes deeper into how PostgreSQL determines which tuples are visible to which transactions, the cost of those visibility checks, and the data structures that accelerate them.
Tuple Header Layout
Every heap tuple (row stored on disk) starts with a 23-byte header:
Offset Size Field Description
0 4 t_xmin Transaction ID that inserted this tuple
4 4 t_xmax Transaction ID that deleted/updated this tuple (0 if live)
8 4 t_cid Command ID within the inserting transaction
12 4 t_ctid.ip_blkid Block number of next tuple version
16 2 t_ctid.ip_posid Offset within block of next tuple version
18 2 t_infomask2 Number of attributes + flags
20 2 t_infomask Visibility flags
22 1 t_hoff Offset to user data (header size including padding)
The t_infomask field contains critical performance flags:
Bit Name Meaning
0x0001 HEAP_HASNULL Has null attributes
0x0002 HEAP_HASVARWIDTH Has variable-width attributes
0x0010 HEAP_XMIN_COMMITTED xmin transaction is known committed
0x0020 HEAP_XMIN_INVALID xmin transaction is known aborted
0x0040 HEAP_XMAX_COMMITTED xmax transaction is known committed
0x0080 HEAP_XMAX_INVALID xmax transaction is known aborted/unused
0x2000 HEAP_HOT_UPDATED Tuple was HOT-updated
0x4000 HEAP_ONLY_TUPLE Tuple is a HOT chain member (no index entry)
0x8000 HEAP_UPDATED Tuple was created by UPDATE (not INSERT)
Hint Bits: Avoiding Repeated CLOG Lookups
When a tuple is first accessed after a transaction commits, PostgreSQL must determine if t_xmin and t_xmax are committed. This requires a lookup in pg_xact (the commit log, stored in pg_xact/ directory):
First access to tuple after xmin commits:
1. Read t_xmin from tuple header
2. Check pg_xact for transaction status: COMMITTED, ABORTED, or IN_PROGRESS
3. If COMMITTED: set HEAP_XMIN_COMMITTED hint bit in t_infomask
4. Write the updated page (dirty)
Second access to same tuple:
1. Read t_xmin from tuple header
2. Check HEAP_XMIN_COMMITTED bit in t_infomask: set!
3. Skip pg_xact lookup (fast path)
The first access after a commit is expensive (CLOG lookup). Subsequent accesses are fast (bit check). This is why the first sequential scan after a bulk load is slower than subsequent scans: it sets hint bits for every tuple.
Measuring the hint bit effect:
-- Load 1 million rows, then scan twice
COPY articles FROM '/tmp/articles.csv' WITH (FORMAT csv);
-- First scan: sets hint bits
EXPLAIN (ANALYZE, BUFFERS)
SELECT count(*) FROM articles;
QUERY PLAN
--------------------------------------------------------------------------------------------
Aggregate (cost=47847.00..47847.01 rows=1 width=8)
(actual time=847.2..847.2 rows=1 loops=1)
-> Seq Scan on articles (cost=0.00..45347.00 rows=1000000 width=0)
(actual time=0.02..612.8 rows=1000000 loops=1)
Buffers: shared hit=18472 read=24891 dirtied=24891
Planning Time: 0.08 ms
Execution Time: 847.3 ms
Note dirtied=24891: the scan dirtied 24,891 pages by setting hint bits. These dirty pages must eventually be written to disk by the background writer.
-- Second scan: hint bits already set
EXPLAIN (ANALYZE, BUFFERS)
SELECT count(*) FROM articles;
QUERY PLAN
--------------------------------------------------------------------------------------------
Aggregate (cost=47847.00..47847.01 rows=1 width=8)
(actual time=312.4..312.4 rows=1 loops=1)
-> Seq Scan on articles (cost=0.00..45347.00 rows=1000000 width=0)
(actual time=0.01..218.7 rows=1000000 loops=1)
Buffers: shared hit=43363
Planning Time: 0.06 ms
Execution Time: 312.5 ms
No reads (all in shared_buffers from first scan), no dirty pages. Execution time dropped from 847ms to 312ms. The 2.7x improvement comes from skipping CLOG lookups and reading warm buffers.
Snapshot Isolation: How Transactions See Data
A snapshot in PostgreSQL is:
typedef struct SnapshotData {
TransactionId xmin; // lowest still-running XID at snapshot time
TransactionId xmax; // first unassigned XID at snapshot time
TransactionId *xip; // array of in-progress XIDs between xmin and xmax
uint32 xcnt; // count of in-progress XIDs
} SnapshotData;
A tuple with t_xmin = X is visible if:
- X < snapshot.xmin (committed before any active transaction)
- OR X is between snapshot.xmin and snapshot.xmax AND X is NOT in snapshot.xip (committed, not in-progress)
- AND t_xmax is either 0 or not-yet-committed from this snapshot’s perspective
The cost of visibility checking scales with xcnt (number of concurrent transactions). With 200 concurrent transactions, each visibility check must search a 200-element array. PostgreSQL uses a sorted array with binary search, so the cost is O(log N).
Measuring visibility check overhead under concurrency:
-- 1 concurrent transaction: scan 1M rows
-- 200 concurrent transactions: scan same 1M rows
-- Simulated via pgbench with idle transactions holding snapshots
Concurrent Transactions | Seq Scan Time (1M rows) | Overhead
1 | 312 ms | baseline
10 | 314 ms | +0.6%
50 | 318 ms | +1.9%
200 | 329 ms | +5.4%
500 | 352 ms | +12.8%
The overhead is measurable but modest for typical concurrency. At 500 concurrent transactions, each tuple visibility check adds approximately 12 nanoseconds (binary search through 500-element array).
The Commit Log (pg_xact)
Transaction status is stored in pg_xact/ directory as fixed-size files. Each transaction occupies 2 bits:
Status values (2 bits per transaction):
00 = IN_PROGRESS
01 = COMMITTED
10 = ABORTED
11 = SUB_COMMITTED (subtransaction committed, parent may still be in-progress)
Each pg_xact page (8KB) stores status for 32,768 transactions. PostgreSQL caches recently-accessed CLOG pages in shared memory (128 pages by default = status for 4 million transactions).
When the CLOG cache misses:
-- Monitor CLOG access
SELECT * FROM pg_stat_slru WHERE name = 'Xact';
name | blks_zeroed | blks_hit | blks_read | blks_written | blks_exists | flushes | truncates
------+-------------+----------+-----------+--------------+-------------+---------+----------
Xact | 12 | 2847291 | 47 | 12 | 847 | 12 | 2
blks_hit = 2,847,291 vs blks_read = 47: the CLOG cache hit rate is 99.998%. Cache misses happen only for very old transactions that have aged out of the cache. This confirms that hint bits handle the common case (recent transactions) and CLOG lookups handle the rare case (first access or very old XIDs).
Dead Tuple Visibility and Sequential Scans
Dead tuples remain on heap pages until vacuum removes them. During a sequential scan, every dead tuple is:
- Read from disk/buffer (I/O cost)
- Checked for visibility (CPU cost)
- Skipped (no result contribution)
The cost of dead tuples on a sequential scan:
-- Table with 100,000 live rows and 400,000 dead rows (80% dead)
-- Compare scan performance
-- Before vacuum:
EXPLAIN (ANALYZE, BUFFERS)
SELECT count(*) FROM article_view_counts;
QUERY PLAN
--------------------------------------------------------------------------------------------
Aggregate (cost=8472.00..8472.01 rows=1 width=8)
(actual time=142.8..142.8 rows=1 loops=1)
-> Seq Scan on article_view_counts (cost=0.00..7972.00 rows=100000 width=0)
(actual time=0.02..128.4 rows=100000 loops=1)
Buffers: shared hit=5847
Planning Time: 0.04 ms
Execution Time: 142.9 ms
-- After vacuum removes dead tuples:
VACUUM article_view_counts;
EXPLAIN (ANALYZE, BUFFERS)
SELECT count(*) FROM article_view_counts;
QUERY PLAN
--------------------------------------------------------------------------------------------
Aggregate (cost=1847.00..1847.01 rows=1 width=8)
(actual time=28.4..28.4 rows=1 loops=1)
-> Seq Scan on article_view_counts (cost=0.00..1597.00 rows=100000 width=0)
(actual time=0.01..21.7 rows=100000 loops=1)
Buffers: shared hit=1284
Planning Time: 0.03 ms
Execution Time: 28.5 ms
5,847 buffers vs 1,284 buffers. 142.9ms vs 28.5ms. Dead tuples made the scan read 4.6x more pages. Note that regular VACUUM does not shrink the table file (pages are marked as reusable, not returned to OS), but it does allow PostgreSQL to skip empty pages in future scans.
The Visibility Map
The visibility map tracks which pages contain only tuples visible to all transactions. One bit per page:
Visibility Map bit = 1: all tuples on this page are visible to everyone
Visibility Map bit = 0: page may contain dead tuples or tuples from in-progress transactions
The visibility map serves two purposes:
1. Index Only Scans
An Index Only Scan returns data from the index without visiting the heap, but only if the visibility map confirms all tuples on the referenced page are visible:
-- With visibility map bit set for the page:
-- Index Only Scan: read index leaf page -> return data (no heap access)
-- With visibility map bit not set:
-- Index Only Scan: read index leaf page -> read heap page (visibility check) -> return data
-- Monitor visibility map effectiveness
SELECT
relname,
heap_blks_read + heap_blks_hit AS heap_fetches,
idx_blks_read + idx_blks_hit AS idx_fetches
FROM pg_statio_user_tables
WHERE relname = 'articles';
2. Vacuum Skipping
Vacuum skips pages whose visibility map bit is set (no dead tuples possible on that page):
-- Aggressive vacuum still processes all pages:
VACUUM (VERBOSE) article_view_counts;
INFO: vacuuming "public.article_view_counts"
INFO: scanned index "article_view_counts_pkey" to remove 1847 row versions
INFO: table "article_view_counts": removed 1847 dead item identifiers in 42 pages
INFO: table "article_view_counts": found 1847 removable, 100000 nonremovable row versions
in 1284 out of 1284 pages
DETAIL: 0 dead row versions cannot be removed yet.
CPU: user: 0.01 s, system: 0.00 s, elapsed: 0.02 s
Vacuum processed 1,284 pages but only modified 42 (those containing dead tuples). The visibility map told vacuum to skip the other 1,242 pages.
Tuple Freezing
Transaction IDs are 32-bit unsigned integers, wrapping around at 2^32 (approximately 4 billion). To prevent transaction ID wraparound, vacuum “freezes” old tuples by setting their t_xmin to a special value (FrozenTransactionId = 2) indicating they are visible to all transactions regardless of XID comparison:
SHOW vacuum_freeze_min_age; -- 50,000,000 (tuples older than 50M transactions get frozen)
SHOW autovacuum_freeze_max_age; -- 200,000,000 (force vacuum if any tuple is this old)
Monitoring freeze progress:
SELECT
relname,
age(relfrozenxid) AS xid_age,
pg_size_pretty(pg_relation_size(oid)) AS size
FROM pg_class
WHERE relkind = 'r'
ORDER BY age(relfrozenxid) DESC
LIMIT 10;
relname | xid_age | size
-----------------------+----------+----------
article_view_counts | 47829184 | 142 MB
view_events | 28471924 | 1.2 GB
articles | 8472918 | 4.8 GB
If xid_age approaches autovacuum_freeze_max_age (200 million), PostgreSQL forces an aggressive anti-wraparound vacuum that scans the entire table. For the 4.8 GB articles table, this freeze vacuum takes minutes and causes I/O contention. Keeping xid_age well below the threshold via regular vacuum prevents these emergency freezes.
HOT Chain Mechanics
When conditions for a HOT update are met (no indexed column modified, free space on same page):
Before UPDATE (page has free space):
Offset 3: [xmin=100, xmax=0, ctid=(5,3), article_id=42, count=100]
After UPDATE (HOT):
Offset 3: [xmin=100, xmax=200, ctid=(5,4), flags=HOT_UPDATED, article_id=42, count=100]
Offset 4: [xmin=200, xmax=0, ctid=(5,4), flags=HEAP_ONLY, article_id=42, count=101]
The index still points to offset 3. When an index scan finds offset 3, PostgreSQL follows the HOT chain:
- Read tuple at offset 3: xmax is set, not visible
- Follow ctid to offset 4: visible, return this tuple
This chain following is fast (same page, no additional I/O) but adds CPU cost proportional to chain length.
HOT Chain Pruning
PostgreSQL prunes HOT chains during page access (when it visits a page for any reason):
-- After HOT chain pruning:
Offset 3: [DEAD - space reclaimed for reuse on this page]
Offset 4: [xmin=200, xmax=0, ctid=(5,4), article_id=42, count=101]
-- Line pointer 3 redirects to offset 4
Pruning happens without vacuum, triggered by any page access when dead HOT tuples are present. This is why high-frequency UPDATE tables benefit from fillfactor < 100: it keeps HOT chains short by ensuring space for new versions on the same page, and frequent access ensures prompt pruning.
Performance Implications Summary
| Operation | Cost | Notes |
|---|---|---|
| Visibility check (hint bits set) | ~5 ns | Integer comparison + bit check |
| Visibility check (CLOG lookup) | ~200 ns | First access after commit |
| Sequential scan, dead tuple skip | Same as live tuple scan | Read page + check visibility |
| HOT chain follow | ~50 ns per hop | Same page, no I/O |
| Visibility map check | ~10 ns | Bitmap lookup |
| Index Only Scan (VM bit set) | No heap access | Fast path |
| Index Only Scan (VM bit clear) | Heap page access | Falls back to index scan cost |
For the content platform with 1000 updates/sec on counter tables:
- Without HOT + aggressive vacuum: 1000 dead tuples/sec, table bloats indefinitely
- With HOT (fillfactor=70) + aggressive vacuum: dead tuples pruned on access, vacuum catches remainder within 1 second, table stays at 1.02x ideal size