Storage Selection for Database Workloads
Storage Selection for Database Workloads
The main chapter showed that moving PostgreSQL from gp3 to local NVMe dropped single-row INSERT latency from 2.4ms to 0.18ms. This section covers the implementation: separating WAL onto its own disk, placing tablespaces by access pattern, configuring cloud storage volumes for specific latency targets, and measuring every change to verify it delivers the expected improvement.
WAL on a Separate Disk
PostgreSQL’s WAL writer and the query executor compete for the same disk when they share a volume. The WAL writer needs sequential write bandwidth with minimal fsync latency. The query executor needs random read IOPS for index lookups. On a single volume, these two patterns interfere.
The WAL writer holds the WAL insertion lock while writing. If the underlying disk is busy servicing a random read for a query, the WAL write stalls. Every transaction waiting to commit stalls behind it. A queue forms. Under load, this mutual interference adds 200-400 microseconds to every commit.
Shared volume baseline (PostgreSQL 16, local NVMe, 16 concurrent clients):
pgbench -c 16 -j 4 -T 300 -P 5
tps = 42,800
avg latency = 0.37ms
P99 latency = 1.24ms
WAL write latency (pg_stat_wal):
avg = 0.041ms
max = 0.892ms
wal_sync_time = 14.2s over 300s run
iostat (nvme0n1, 1-second samples):
avg read IOPS = 128,000
avg write IOPS = 38,000
await (avg) = 0.082ms
await (max) = 1.240ms
Separating WAL onto its own NVMe device eliminates the interference:
# Create WAL directory on dedicated NVMe
mkfs.xfs /dev/nvme1n1
mount -o noatime,discard /dev/nvme1n1 /mnt/wal
# Move WAL to separate volume (requires restart)
pg_ctl stop -D /var/lib/postgresql/16/main
mv /var/lib/postgresql/16/main/pg_wal /mnt/wal/pg_wal
ln -s /mnt/wal/pg_wal /var/lib/postgresql/16/main/pg_wal
pg_ctl start -D /var/lib/postgresql/16/main
Separate WAL volume (same test, WAL on nvme1n1, data on nvme0n1):
pgbench -c 16 -j 4 -T 300 -P 5
tps = 58,200 (+36%)
avg latency = 0.27ms (-27%)
P99 latency = 0.68ms (-45%)
WAL write latency (pg_stat_wal):
avg = 0.028ms (-32%)
max = 0.184ms (-79%)
wal_sync_time = 8.4s over 300s run (-41%)
iostat (nvme1n1, WAL volume):
avg write IOPS = 38,000
await (avg) = 0.026ms
await (max) = 0.142ms
r/s = 0 (no reads, WAL is write-only in normal operation)
The WAL volume shows zero read IOPS. WAL is append-only during normal operation. Reads from WAL happen only during crash recovery and replication. A dedicated WAL volume processes only sequential writes with fsync, the exact workload that NVMe excels at.
The P99 latency dropped from 1.24ms to 0.68ms. The improvement comes from eliminating the worst case: a WAL fsync waiting behind a burst of random data reads. With separate volumes, the WAL fsync never waits behind anything.
XFS vs ext4 for WAL
The WAL volume filesystem matters because the filesystem journal adds its own fsync overhead on top of the database’s fsync.
WAL volume filesystem comparison (same NVMe, pgbench -c 16):
Filesystem Mount options TPS WAL fsync avg
---------- -------------------------------- ------ -------------
XFS noatime,discard 58,200 0.028ms
XFS noatime,discard,nobarrier 61,400 0.022ms *unsafe*
ext4 noatime,discard 54,800 0.034ms
ext4 noatime,discard,data=writeback 57,100 0.029ms
ext4 noatime,discard,nobarrier 59,800 0.024ms *unsafe*
*nobarrier disables filesystem write barriers.
This is unsafe: a power loss can corrupt the filesystem journal.
Only use with battery-backed write cache (BBWC) or when the
database's own WAL provides sufficient recovery guarantees.
XFS with default mount options outperforms ext4 by 6.2%. The difference comes from XFS’s metadata allocation strategy. XFS uses delayed allocation and extent-based addressing, requiring fewer metadata I/O operations per data write. For a WAL volume with purely sequential appends, this advantage is small but consistent.
Tablespace Placement
PostgreSQL tablespaces allow placing individual tables and indexes on specific storage volumes. The strategy: put hot, frequently-accessed data on fast storage. Put cold, rarely-accessed data on cheap storage.
The content platform’s PostgreSQL database has clear hot and cold zones:
Table/Index Size Reads/sec Write/sec Temperature
------------------------ ------- --------- --------- -----------
articles (table) 12 GB 4,200 120 Hot
articles_pkey (index) 1.2 GB 4,200 0 Hot
articles_cat_pub (index) 800 MB 3,100 0 Hot
analytics_events (table) 48 GB 20 4,200 Write-hot
analytics_events_ts (idx) 6 GB 20 4,200 Write-hot
analytics_daily (table) 2 GB 800 24 Warm
subscriptions (table) 50 MB 10 2 Cold
old_articles (table) 28 GB 3 0 Cold
-- Create tablespaces on different volumes
CREATE TABLESPACE fast_data LOCATION '/mnt/nvme-data/pg_tblspc';
CREATE TABLESPACE fast_wal LOCATION '/mnt/nvme-wal/pg_tblspc';
CREATE TABLESPACE bulk_data LOCATION '/mnt/gp3/pg_tblspc';
-- Move hot tables and indexes to fast NVMe
ALTER TABLE articles SET TABLESPACE fast_data;
ALTER INDEX articles_pkey SET TABLESPACE fast_data;
ALTER INDEX articles_cat_pub SET TABLESPACE fast_data;
-- Keep write-heavy analytics on fast storage (needs fsync performance)
ALTER TABLE analytics_events SET TABLESPACE fast_data;
ALTER INDEX analytics_events_ts SET TABLESPACE fast_data;
-- Move cold data to cheap gp3 storage
ALTER TABLE old_articles SET TABLESPACE bulk_data;
ALTER TABLE subscriptions SET TABLESPACE bulk_data;
Moving a table to a different tablespace requires an ACCESS EXCLUSIVE lock. The table is inaccessible during the move. For the 28GB old_articles table on NVMe, the move takes about 15 seconds. Schedule this during low-traffic windows.
Measured impact of tablespace placement:
Before (everything on one NVMe volume):
Article query avg latency: 0.14ms
Analytics INSERT avg latency: 0.18ms
After (hot on NVMe, cold on gp3):
Article query avg latency: 0.14ms (unchanged, still on NVMe)
Analytics INSERT avg latency: 0.18ms (unchanged, still on NVMe)
Cold query (old_articles): 2.8ms (was 0.22ms, now on gp3)
NVMe volume space freed: 28 GB
gp3 cost for old_articles: $2.24/month (28GB * $0.08/GB)
Trade-off: 12.7x slower cold queries, $2.24/month savings, 28GB NVMe freed
The trade-off is acceptable because cold queries on old_articles happen 3 times per day. The freed NVMe space extends the runway before the next storage upgrade.
EBS Volume Types: A Cost-Performance Map
AWS EBS offers five volume types. Each occupies a different point on the cost-performance curve:
EBS volume type comparison (500GB, us-east-1, May 2026):
Type Baseline Max Baseline Max fsync $/month
IOPS IOPS Throughput Tput latency
-------- -------- ------ ---------- ------ ------- -------
gp3 3,000 16,000 125 MB/s 1,000 1.8ms $40
gp2 1,500* 3,000* 128 MB/s 250 2.1ms $50
io1 -- 64,000 -- 1,000 0.5ms $665
io2 -- 64,000 -- 1,000 0.45ms $665
io2 BE** -- 256,000 -- 4,000 0.2ms $1,280
st1 -- 500 -- 500 N/A*** $23
sc1 -- 250 -- 250 N/A*** $13
* gp2 IOPS scale with volume size: 3 IOPS/GB, burst to 3,000
** Block Express, requires Nitro-based R5b/R6i instances
*** HDD types not suitable for fsync-dependent workloads
gp3 Tuning
gp3 is the default choice and the wrong choice for database WAL. But it works for read-heavy workloads where the page cache absorbs most reads and the remaining I/O is sporadic.
gp3 allows independent IOPS and throughput provisioning above the baseline:
# Provision additional IOPS for gp3 (up to 16,000)
aws ec2 modify-volume --volume-id vol-0123456789abcdef0 \
--iops 6000 --throughput 250
# Cost: base $40 + ($0.005 * 3000 additional IOPS) + ($0.04 * 125 additional MB/s)
# = $40 + $15 + $5 = $60/month
The provisioned IOPS on gp3 do not change fsync latency. They change the maximum sustained IOPS rate. Each individual fsync still takes 1.8ms because the latency is dominated by the network round trip to the EBS backend. Provisioning more IOPS allows more of those 1.8ms operations to happen in parallel.
gp3 IOPS provisioning effect on PostgreSQL (pgbench -c 16):
Provisioned IOPS Measured TPS Avg Commit Latency
---------------- ------------ ------------------
3,000 (baseline) 4,200 3.2ms
6,000 7,800 2.8ms
10,000 11,400 2.4ms
16,000 14,200 2.1ms
Note: commit latency floor is ~1.8ms regardless of IOPS.
TPS scales because group commit amortizes fsync across
more concurrent transactions at higher IOPS.
io2 Block Express for Low-Latency Requirements
io2 Block Express volumes provide sub-millisecond fsync latency with high IOPS capacity. They require Nitro-based instances (R5b, R6i, or newer) that support the EBS Block Express architecture.
io2 Block Express configuration for the content platform:
Volume: 500GB, 64,000 IOPS provisioned
Instance: r6i.xlarge (4 vCPU, 32GB RAM)
Cost: $1,280/month (volume) + $0 (IOPS included up to 64K for io2)
Measured with pgbench -c 16:
TPS: 38,400
Avg commit lat: 0.38ms
P99 commit lat: 0.92ms
fsync avg: 0.20ms
Comparison:
vs gp3: 9.1x TPS, 8.4x lower commit latency, 32x cost
vs NVMe: 0.66x TPS, 2.1x higher commit latency, 14x cost
io2 Block Express sits between gp3 and local NVMe. It costs more than both but provides EBS durability (3-AZ replication) with acceptable latency. This is the right choice when the operational complexity of local instance stores (streaming replication, WAL archiving, automated snapshots) outweighs the cost premium.
Cloud Storage Latency Anatomy
EBS latency has three components that do not exist in local storage:
EBS I/O path (gp3):
Application
| write() syscall ~1 us
v
Kernel block layer
| NVMe-over-Fabrics submit ~5 us
v
EC2 Nitro Card (hardware)
| Encrypt (AES-256-XTS) ~3 us
| Network packet assembly ~2 us
v
Network (to EBS backend)
| Round trip ~200-800 us
v
EBS Storage Server
| Flash write ~30 us
| Replication (2 AZs) ~500-1500 us
| Commit ack ~5 us
v
Network (return)
| Round trip ~200-800 us
v
Nitro Card
| Completion interrupt ~3 us
v
Kernel
| Completion processing ~2 us
v
Application
| fsync returns
Total: 950-3120 us (avg ~1800 us for gp3)
The dominant cost is network round trip plus cross-AZ replication. The actual flash write on the EBS server takes 30 microseconds, nearly identical to a local NVMe. The remaining 1,770 microseconds is infrastructure overhead for durability.
EBS Burst Credits
gp3 volumes have a baseline IOPS of 3,000. Unlike gp2, gp3 does not have a burst credit mechanism for IOPS. The 3,000 IOPS is a hard floor and ceiling unless you provision additional IOPS.
However, gp3 does have burst behavior for throughput. The baseline is 125 MB/s, and short bursts can reach 250 MB/s. This matters for sequential scans during backups and VACUUM.
gp3 throughput behavior during pg_dump:
Time (seconds) Throughput Note
-------------- ---------- ----
0-30 248 MB/s Burst (above baseline)
30-60 246 MB/s Burst continuing
60-90 125 MB/s Burst exhausted, baseline
90+ 125 MB/s Sustained baseline
pg_dump of 12GB articles table:
With burst: ~50 seconds (first 7.5GB at 248, rest at 125)
At baseline: ~96 seconds (12GB / 125 MB/s)
Actual: ~62 seconds (mixed)
Monitoring EBS Performance
CloudWatch metrics reveal when EBS is the bottleneck:
Key CloudWatch metrics for EBS performance diagnosis:
VolumeReadOps / VolumeWriteOps
IOPS consumed. Compare to provisioned IOPS.
If consistently at provisioned limit: provision more or upgrade volume type.
VolumeQueueLength
Average number of I/O operations waiting.
Target: < 1 for gp3, < 4 for io2.
> 4 means the volume is saturated.
VolumeThroughputPercentage (io2 only)
Percentage of provisioned throughput used.
> 90% sustained: provision more throughput.
BurstBalance (gp2 only, not gp3)
Remaining burst credits as percentage.
If drops to 0: volume throttles to baseline.
VolumeReadLatency / VolumeWriteLatency
Average per-operation latency.
gp3 expected: 1-3ms.
io2 expected: 0.2-1ms.
> 5ms: network congestion or EBS service degradation.
-- PostgreSQL-side monitoring: check for I/O wait
SELECT
datname,
blk_read_time,
blk_write_time,
blks_read,
blks_hit,
round(100.0 * blks_hit / nullif(blks_hit + blks_read, 0), 1) AS cache_hit_pct
FROM pg_stat_database
WHERE datname = 'content_platform';
-- If blk_read_time is high relative to query count, storage is the bottleneck
-- If cache_hit_pct < 99%, shared_buffers is undersized or working set exceeds RAM
Filesystem Configuration for Database Volumes
The filesystem between PostgreSQL and the block device adds measurable overhead. The configuration choices:
Mount Options
# Database data volume
mount -o noatime,nodiratime,discard /dev/nvme0n1p1 /mnt/pgdata
# noatime: Skip updating access time on every read. Without this,
# every SELECT that reads a page triggers a metadata write.
# Measured impact: 3-5% IOPS improvement on read-heavy workloads.
# nodiratime: Same as noatime but for directory access times.
# discard: Enable online TRIM for SSD garbage collection.
# Alternative: scheduled fstrim via cron (lower runtime overhead).
ext4 Tuning
# Disable ext4 journal for WAL volume (PostgreSQL's WAL provides recovery)
# WARNING: Only safe for dedicated WAL volumes, not data volumes
mkfs.ext4 -O ^has_journal /dev/nvme1n1
# Set stripe width to match RAID or NVMe internal parallelism
mkfs.ext4 -E stride=16,stripe_width=64 /dev/nvme0n1p1
# Measured improvement from journal-less ext4 on WAL volume:
# fsync latency: 34us -> 26us (-24%)
# pgbench TPS: 58,200 -> 62,800 (+7.9%)
XFS Tuning
# XFS with allocation group sizing for parallel writers
mkfs.xfs -d agcount=16 /dev/nvme0n1p1
# Align to NVMe sector size (most modern drives use 4KB)
mkfs.xfs -s size=4096 /dev/nvme0n1p1
# XFS on WAL volume with optimized log
mkfs.xfs -l size=128m,lazy-count=1 /dev/nvme1n1
Putting It Together: The Content Platform Storage Architecture
The final storage layout for the content platform, balancing performance, durability, and cost:
Volume Layout:
/dev/nvme0n1 (local NVMe, 1.9TB, included with i3en.xlarge)
Partition 1: /mnt/pgdata (500GB, XFS, noatime,discard)
- PostgreSQL data directory
- Tablespace: fast_data (articles, analytics_events, indexes)
- effective_io_concurrency = 200
- random_page_cost = 1.1
Partition 2: /mnt/qdrant (200GB, XFS, noatime,discard)
- Qdrant vector database data
- HNSW index files (memory-mapped)
/dev/nvme1n1 (local NVMe, 1.9TB, second drive on i3en.xlarge)
Partition 1: /mnt/pgwal (50GB, XFS, noatime,discard)
- PostgreSQL WAL (symlinked from pg_wal)
- Dedicated to sequential writes + fsync
- No other workload shares this partition
Partition 2: /mnt/temp (200GB, XFS, noatime,discard)
- PostgreSQL temp_tablespaces
- Sort spill files, hash join overflow
- temp_file_limit = 10GB per query
/dev/ebs-gp3 (EBS gp3, 500GB, 3000 IOPS baseline)
/mnt/cold (500GB, ext4, noatime)
- Tablespace: bulk_data (old_articles, subscriptions)
- PostgreSQL base backups (retained 7 days)
- Application logs
Durability:
- Streaming replication to standby (also i3en.xlarge, separate AZ)
- WAL archiving to S3 every 60 seconds
- pg_basebackup to S3 every 6 hours
- Qdrant snapshot to S3 every 12 hours (index is rebuildable)
Performance summary (pgbench -c 16 -j 4 -T 300):
Metric gp3 (before) Optimized layout Improvement
--------------------- ------------ ---------------- -----------
TPS 4,200 62,800 15.0x
Avg commit latency 3.2ms 0.22ms 14.5x
P99 commit latency 7.2ms 0.58ms 12.4x
WAL fsync latency 1.8ms 0.026ms 69.2x
Article query P99 3.4ms 0.32ms 10.6x
Analytics INSERT P99 5.8ms 0.48ms 12.1x
Monthly storage cost $40 $94 2.4x
Cost per 1000 TPS: $9.52/mo $1.50/mo 6.4x better
The 15x throughput improvement and 14.5x latency reduction cost $54/month more. The cost per transaction dropped by 6.4x. The storage is no longer the ceiling. CPU at 78% utilization during the benchmark is now the next bottleneck, which is the subject of earlier chapters in this book.
The storage under your database is a multiplier on every other optimization. Query tuning, connection pooling, caching, and indexing all reduce the number of I/O operations. But each remaining operation takes as long as the storage device allows. A 10x reduction in query I/O on gp3 storage (1.8ms fsync) may be slower than the unoptimized query on local NVMe (0.035ms fsync). Fix the storage first. Then optimize the queries. The results compound.