Beyond Relational: Document, Column, and Graph Stores
Beyond Relational: Document, Column, and Graph Stores
PostgreSQL serves the logistics platform’s transactional workloads. But the platform has data access patterns that do not fit the relational model well. Package metadata arrives as nested JSON from scanner APIs with schemas that change monthly. Analytical queries aggregate millions of delivery records by region, carrier, and time period. Route optimization requires traversing a graph of warehouses, hubs, and delivery zones.
This chapter examines three non-relational storage models through the same lens as Part I: what is the storage layout, what are its mechanical advantages, and what does it cost.
Document Stores: Schemaless Is Not Free
A document store (MongoDB, CouchDB) stores data as self-contained documents, typically JSON or BSON. Each document contains all the data for one entity, including nested objects and arrays.
The Storage Mechanism
MongoDB’s WiredTiger engine stores each document as a single BSON blob in an LSM-tree-like structure (B-Tree by default since WiredTiger, with an LSM option). The document is the unit of storage and the unit of locking. There is no separate TOAST table as in PostgreSQL. The entire document, including nested arrays, lives in one contiguous allocation.
// Concept: a package document with embedded tracking history
// All related data in one document. No joins needed.
{
"_id": "PKG-40291",
"origin": {"warehouse": "WH-042", "city": "Rotterdam"},
"destination": {"address": "123 Main St", "city": "Amsterdam"},
"weight_kg": 2.4,
"tracking": [
{"status": "CREATED", "ts": "2024-11-15T08:00:00Z", "location": "WH-042"},
{"status": "PICKED", "ts": "2024-11-15T09:15:00Z", "location": "WH-042"},
{"status": "IN_TRANSIT", "ts": "2024-11-15T10:30:00Z", "location": "HUB-NL-01"},
{"status": "OUT_FOR_DELIVERY", "ts": "2024-11-15T14:00:00Z", "location": "AMS-DEPOT"}
]
}
The advantage: reading one package’s complete history is a single document fetch. No joins. No multiple index lookups. The B-Tree index on _id returns a pointer to the document, and the document contains everything.
The cost: updating one field in a large document rewrites the entire document. Adding a tracking event to a package with 200 existing events requires reading 200 events, appending one, and writing 201 events back. If the document exceeds the allocated space on the page, it is moved, and the index pointer is updated.
When Embedding Breaks
The logistics platform’s audit history starts small: 5-10 events per package. After 6 months, packages in return loops have 50-100 events. The document grows. Updates slow down. The working set (the portion of data that must be in RAM for acceptable performance) bloats because reading any field of the package loads the entire document, including 100 tracking events, into memory.
// Concept: the embedded array growth problem
// MongoDB's $push appends to the array. If the document exceeds 16MB, the write fails.
db.packages.updateOne(
{ _id: "PKG-40291" },
{ $push: { tracking: {
status: "DELIVERY_ATTEMPT_FAILED",
ts: new Date(),
location: "AMS-DEPOT",
note: "Customer not home"
}}}
);
// Document sizes to monitor:
db.packages.aggregate([
{ $project: { size: { $bsonSize: "$$ROOT" }, tracking_count: { $size: "$tracking" } } },
{ $sort: { size: -1 } },
{ $limit: 5 }
]);
// { _id: "PKG-12001", size: 142891, tracking_count: 312 }
// 140KB per document. 312 tracking events.
// Reading this document to check current status loads 140KB.
// A query returning 100 packages loads 14MB.
The fix is to reference instead of embed: store tracking events in a separate collection with a foreign key. This is a join. The document model’s primary advantage, join avoidance, is surrendered. At that point, PostgreSQL with a package_events table and an index on package_id is mechanically equivalent and has 50 years of tooling around it.
Column Stores: Compression and Analytical Queries
A column store (ClickHouse, Apache Parquet on S3, PostgreSQL with the cstore_fdw extension) stores data by column instead of by row. All values of one column are stored contiguously.
The Storage Mechanism
In a row-oriented store (PostgreSQL), a page contains complete rows: [row1: id, status, timestamp, warehouse_id], [row2: id, status, timestamp, warehouse_id], …
In a column-oriented store, a page (or segment) contains values from a single column: [id: PKG-001, PKG-002, PKG-003, ...], [status: DELIVERED, IN_TRANSIT, DELIVERED, ...]
The consequence for analytical queries:
-- Concept: why column stores are faster for aggregations
-- "How many packages were delivered per warehouse last month?"
SELECT warehouse_id, COUNT(*) as delivered_count
FROM package_events
WHERE status = 'DELIVERED'
AND timestamp >= '2024-10-01'
AND timestamp < '2024-11-01'
GROUP BY warehouse_id;
-- Row store (PostgreSQL): reads every column of every row, then discards
-- all columns except warehouse_id, status, and timestamp.
-- 10 million rows * 82 bytes/row = 820 MB read from disk.
-- Column store: reads only the three columns needed.
-- status column: 10M * 12 bytes = 120 MB (before compression)
-- timestamp column: 10M * 8 bytes = 80 MB (before compression)
-- warehouse_id column: 10M * 8 bytes = 80 MB (before compression)
-- Total: 280 MB. With compression (status has ~6 distinct values): ~40 MB.
-- 20x less I/O than the row store.
Column stores compress aggressively because values in a single column tend to have low cardinality (few distinct values) and similar data types. Run-length encoding turns [DELIVERED, DELIVERED, DELIVERED, DELIVERED] into [DELIVERED x 4]. Dictionary encoding maps each distinct value to a small integer.
The cost: inserting a single row requires writing to every column file. A single INSERT touches as many files as there are columns. This is why column stores are batch-oriented: they buffer writes and flush entire column segments at once.
The Decision Rule
Use a document store when your data is genuinely document-shaped: self-contained entities that are read and written as a whole, with nested structure that varies between instances. API response caching, user profiles with varying attribute sets, configuration documents. Do not use a document store because “schema migrations are hard.” Schema changes in a document store surface as runtime data inconsistencies instead of migration errors, which is worse.
Use a column store when your primary workload is analytical aggregation over large datasets with filters on a few columns. Delivery analytics, warehouse throughput reporting, time-series metrics. Do not use a column store for transactional OLTP workloads. The per-row insert cost is prohibitive.
Use a graph database when your queries are dominated by multi-hop traversals: “find all packages that passed through the same hub as PKG-40291 in the last 48 hours.” If your queries are one-hop lookups (find packages by warehouse), a relational JOIN is faster and simpler.
For the logistics platform: PostgreSQL for transactional data (package metadata, inventory, assignments). ClickHouse or Parquet on S3 for delivery analytics and reporting. A document store for caching upstream API responses whose schema is not controlled by the platform. A graph database is not justified unless route optimization requires multi-hop traversals that cannot be expressed efficiently as recursive SQL CTEs.