Skip to main content
data systems from the ground up

Workload Analysis and the Decision Framework

4 min read Chapter 20 of 36

Workload Analysis and the Decision Framework

The Black Box

The team debates adding MongoDB for “flexible data.” The debate is based on feature lists and blog posts. Nobody has characterized the actual workload: how many reads per second, how many writes, what access patterns, what consistency requirements. Without numbers, the debate is opinion, and the loudest voice wins.

The Mechanism

Workload characterization starts with measuring the existing system. PostgreSQL’s pg_stat_statements extension captures query statistics.

-- Concept: quantitative workload profiling with pg_stat_statements
CREATE EXTENSION IF NOT EXISTS pg_stat_statements;

-- Top 10 queries by total execution time
SELECT
    queryid,
    calls,
    total_exec_time / 1000 AS total_seconds,
    mean_exec_time AS avg_ms,
    rows,
    query
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 10;

-- queryid | calls   | total_seconds | avg_ms | rows      | query
-- 389201  | 4200000 | 840           | 0.2    | 4200000   | SELECT status FROM packages WHERE package_id = $1
-- 102938  | 120000  | 720           | 6.0    | 2400000   | SELECT * FROM package_events WHERE package_id = $1
-- 884721  | 48      | 480           | 10000  | 270000000 | SELECT warehouse_id, count(*) FROM package_events...

-- Interpretation:
-- Query 389201: 4.2M calls, 0.2ms each. Point lookup. Perfect for PostgreSQL + index.
-- Query 102938: 120K calls, 6ms each. Range scan by package_id. Fine for PostgreSQL.
-- Query 884721: 48 calls, 10 seconds each. Full table aggregation.
--   This is the analytics query that should move to a column store.

The workload profile reveals that 99.99% of queries are transactional (point lookups and range scans) and 0.01% are analytical (aggregations). The analytical queries consume 36% of total database CPU time. Moving 48 queries to ClickHouse eliminates 36% of the database load.

The Decision Tree

  1. Is the primary access pattern point lookups by key?

    • Yes, with sub-millisecond requirement: Redis (in-memory)
    • Yes, with ACID requirement: PostgreSQL (B-Tree index)
    • Yes, with very high write throughput: RocksDB or Cassandra (LSM-tree)
  2. Is the primary access pattern aggregation over large datasets?

    • Yes: Column store (ClickHouse, Parquet)
  3. Is the primary access pattern append-only event streaming?

    • Yes: Kafka
  4. Is the data schema highly variable across records?

    • Yes, and each record is self-contained: Document store or PostgreSQL JSONB
    • Yes, but records relate to each other: PostgreSQL with JSONB columns
  5. Is the primary access pattern multi-hop graph traversal?

    • Fewer than 5 hops: PostgreSQL recursive CTE
    • More than 5 hops with millions of edges: Graph database (Neo4j)

The Observable Consequence

The logistics platform’s workload, mapped through the decision tree:

Data entityAccess patternVolumeChosen store
Package metadataPoint lookup, join5M rows, 100 reads/sPostgreSQL
Package eventsAppend, range scan50M rows, 2K writes/minPostgreSQL + Kafka
Delivery analyticsAggregation270M rows, 48 queries/dayClickHouse
Package status cachePoint lookup500K keys, 5K reads/sRedis
Scanner payloadsStore/retrieve10M docs, 200 writes/minPostgreSQL JSONB

The total data infrastructure: 3 stores (PostgreSQL, Kafka, ClickHouse) plus Redis for caching. Each serves a workload that the others serve poorly.

The Decision Rule

Profile before deciding. Use pg_stat_statements to measure actual query patterns. If 99% of your queries are point lookups and range scans, PostgreSQL with proper indexing is sufficient. Do not add a new database for the 1% until the 1% causes measurable problems.

When the analytical 1% consumes 30%+ of database resources, move it to a column store. When sub-millisecond latency is required for hot data, add Redis as a cache. When event streaming with multiple consumers is needed, add Kafka. Each addition is justified by a specific metric, not a general feeling that the architecture should be more complex.