The Right Database for the Job
The Right Database for the Job
Chapters 5 and 6 opened the internals of relational, document, and column stores. Each has mechanical advantages for specific access patterns and mechanical costs for others. This chapter provides the decision framework.
Workload Characterization
Before choosing a database, characterize the workload along five axes:
Read/write ratio. A 100:1 read-heavy workload (product catalog) has different storage requirements than a 1:100 write-heavy workload (event ingestion). B-Trees handle reads well. LSM-trees handle writes well. This was established in Chapter 2.
Access pattern. Point lookups by primary key. Range scans over a time window. Aggregations across millions of rows. Graph traversals. Each access pattern maps to a storage layout. Point lookups favor B-Trees. Aggregations favor column stores. Traversals favor adjacency lists.
Consistency requirement. Does every read need to see the latest write? ACID transactions with serializable isolation? Eventual consistency acceptable? PostgreSQL with synchronous replication provides strong consistency. Kafka provides eventual consistency for consumers. The durability and replication tradeoffs from Chapters 3 and 4 determine what is possible.
Data shape. Fixed schema with relationships (relational). Nested documents with varying structure (document). Wide tables with few columns queried at a time (columnar). Nodes and edges (graph).
Scale dimension. Total data volume. Concurrent connections. Write throughput. Query complexity. A 10GB database with 100 concurrent connections has different constraints than a 10TB database with 10 concurrent connections.
The Decision Matrix
| Workload | Primary access | Best fit | Why |
|---|---|---|---|
| Package metadata CRUD | Point lookup, FK joins | PostgreSQL | ACID, indexes, joins, mature tooling |
| Package event stream | Append-only, sequential read | Kafka | Append-only log, partitioned, replicated |
| Delivery analytics | Aggregation over time ranges | ClickHouse | Columnar compression, vectorized execution |
| Scanner API payloads | Store/retrieve full documents | PostgreSQL JSONB | Flexible schema within relational guarantees |
| Route optimization | Multi-hop graph traversal | PostgreSQL recursive CTE | Sufficient for 3-4 hop traversals |
| Session/cache data | Point lookup, TTL expiration | Redis | In-memory, sub-ms latency, built-in TTL |
| Audit trail | Append-only, immutable, range query | Kafka + ClickHouse | Kafka for durability, ClickHouse for querying |
The logistics platform uses four data stores: PostgreSQL (transactional), Kafka (event streaming), ClickHouse (analytics), and Redis (caching). This is not polyglot persistence for its own sake. Each store handles a specific access pattern that the others handle poorly.
Polyglot Persistence Without the Polyglot Disaster
The logistics platform’s data architecture flows from left to right. PostgreSQL is the system of record for transactional data. Debezium captures changes from PostgreSQL’s WAL and publishes them to Kafka. Kafka feeds ClickHouse for analytics and downstream consumers for event processing. Redis caches hot data (package status, warehouse inventory counts) with TTLs. The arrows represent data flow, not request flow. The application writes to PostgreSQL. Everything else is derived.
Using multiple databases creates a consistency problem: when the source of truth is PostgreSQL and a copy of the data lives in ClickHouse and Redis, the copies can be stale. Chapter 9 covers Change Data Capture with Debezium as the mechanism for keeping derived stores consistent with the source of truth. The key architectural rule:
One source of truth per data entity. Package metadata lives in PostgreSQL. Delivery events live in Kafka. Analytics aggregates live in ClickHouse. Redis caches are derived and can be rebuilt from the source at any time. If the Redis cache disagrees with PostgreSQL, PostgreSQL wins.
// Concept: read-through cache with PostgreSQL as source of truth
// Redis is the fast path. PostgreSQL is the fallback.
// If Redis does not have the data, read from PostgreSQL and populate Redis.
String getPackageStatus(String packageId) {
// Check Redis first
String cached = redis.get("pkg:status:" + packageId);
if (cached != null) return cached;
// Cache miss: read from PostgreSQL
String status = jdbc.queryForObject(
"SELECT status FROM packages WHERE package_id = ?",
String.class, packageId);
// Populate cache with 60-second TTL
redis.setex("pkg:status:" + packageId, 60, status);
return status;
}
// When PostgreSQL is updated, invalidate the cache:
void updatePackageStatus(String packageId, String newStatus) {
jdbc.update("UPDATE packages SET status = ? WHERE package_id = ?",
newStatus, packageId);
redis.del("pkg:status:" + packageId); // Invalidate, not update
}
// Invalidate, not update. If the delete fails (Redis unavailable),
// the stale cache entry expires after 60 seconds.
// If we tried to update Redis and it failed, the cache would be
// permanently stale until the next TTL expiry or manual intervention.
Anti-Patterns
Choosing MongoDB because “schema migrations are hard.” Schema migrations in PostgreSQL are a deployment concern. Schema inconsistencies in a document store are a runtime data quality concern. The first is solved with tooling (Flyway, Liquibase). The second is solved with defensive application code in every query, forever.
Choosing Elasticsearch as a primary database. Elasticsearch is a search engine built on Lucene. It has no transactions, no referential integrity, and its durability guarantees are weaker than PostgreSQL’s. Use it as a secondary index populated from a primary database via CDC.
Using Redis as a primary database. Redis stores data in memory. A dataset that exceeds available RAM requires sharding, eviction policies, and acceptance of data loss during restarts (unless AOF persistence is configured, which reduces Redis’s performance to near-disk speeds). Redis is a cache and a data structure server. PostgreSQL is a database.
Adding a new database to solve a query performance problem. Before adding ClickHouse, check whether PostgreSQL can serve the query with proper indexing, partitioning, and materialized views. Adding a database adds operational complexity: deployment, monitoring, backup, security, and the consistency mechanisms between stores. This complexity is justified only when PostgreSQL mechanically cannot serve the workload.
The Decision Rule
Start with PostgreSQL. Add Kafka when you need durable event streaming with multiple consumers. Add Redis when you need sub-millisecond reads for hot data. Add a column store when analytical queries degrade transactional performance. Add each store only when the mechanical limitation of the existing architecture is measurable and the new store directly addresses it.
Do not add a database because a conference talk recommended it. Add a database because your production metrics show a specific bottleneck that the new store’s mechanical properties address.