Document Stores and When Denormalization Pays Off
Document Stores and When Denormalization Pays Off
The Black Box
The logistics platform receives package metadata from scanner APIs. Each scanner vendor sends a different JSON structure. Some include weight in grams, others in kilograms. Some nest the address inside a recipient object, others put it at the top level. The team stores these payloads as documents because “the schema keeps changing.” Six months later, every query that aggregates across packages requires defensive null checks, type coercion, and fallback logic for every field. The schema did not disappear. It moved from the database to the application code, where it is harder to enforce and impossible to index efficiently.
The Mechanism
Denormalization in a document store means storing redundant copies of data to avoid lookups. The package document embeds the warehouse name instead of referencing a warehouse collection:
// Concept: denormalization tradeoff
// Denormalized: warehouse name is stored in every package document
{
"_id": "PKG-40291",
"warehouse": {"id": "WH-042", "name": "Rotterdam Central", "capacity": 50000},
"status": "IN_TRANSIT"
}
// If the warehouse name changes (Rotterdam Central -> Rotterdam Main),
// every package document that references WH-042 must be updated.
// With 2 million packages from WH-042, that is 2 million document updates.
Denormalization pays off when:
- The denormalized data rarely changes (warehouse names change yearly, not daily).
- The read pattern always needs the denormalized data (every package display shows the warehouse name).
- The document count is bounded (packages are archived after delivery, not accumulated forever).
Denormalization costs too much when:
- The denormalized data changes frequently (driver phone numbers, route assignments).
- Consistency across documents is required (inventory counts that must agree with a central ledger).
PostgreSQL JSONB: The Middle Ground
PostgreSQL’s JSONB column type stores JSON documents inside a relational table. It combines relational guarantees (transactions, constraints, indexes) with document flexibility.
-- Concept: JSONB for semi-structured data within a relational model
CREATE TABLE scanner_payloads (
payload_id BIGSERIAL PRIMARY KEY,
package_id VARCHAR(12) NOT NULL REFERENCES packages(package_id),
scanner_vendor VARCHAR(50) NOT NULL,
received_at TIMESTAMPTZ NOT NULL DEFAULT now(),
raw_payload JSONB NOT NULL
);
-- Index a specific JSON field for querying
CREATE INDEX idx_scanner_payload_weight
ON scanner_payloads ((raw_payload->>'weight_grams')::numeric)
WHERE raw_payload ? 'weight_grams';
-- Query across varying schemas
SELECT
package_id,
COALESCE(
(raw_payload->>'weight_grams')::numeric,
(raw_payload->>'weight_kg')::numeric * 1000
) AS weight_grams
FROM scanner_payloads
WHERE scanner_vendor = 'ScanCorp'
AND received_at > '2024-11-01';
The JSONB column is stored using PostgreSQL’s TOAST mechanism (Chapter 5). Large JSON payloads are compressed and stored out-of-line. Queries that filter on JSONB fields use GIN indexes, which support containment (@>) and existence (?) operators.
The Observable Consequence
The storage cost of JSONB vs a normalized relational schema:
| Storage model | Row size | 10M rows | Index overhead |
|---|---|---|---|
| Normalized (4 tables, foreign keys) | 120 bytes/row total | 1.2 GB | 400 MB (B-Tree per FK) |
| JSONB (1 table, embedded document) | 800 bytes/row | 8 GB | 1.2 GB (GIN on JSONB) |
The JSONB model uses 6.7x more storage. Each query on a JSONB field is slower than a query on a native column because the GIN index is larger and the JSON parsing adds CPU overhead. The tradeoff is development speed: adding a new field to the JSON requires no migration, no ALTER TABLE, no deployment.
The Decision Rule
Use a document store (or JSONB column) when the data arrives in document form from external systems, the schema varies by source, and you need to query individual fields occasionally but primarily store and retrieve the whole document. Scanner API payloads in the logistics platform fit this pattern.
Do not use a document store for your core transactional data. Packages, inventory, assignments: these have stable schemas, require referential integrity, and benefit from relational joins. The effort saved by avoiding schema migrations is spent tenfold on application-level consistency checks, missing index support, and defensive data access code.
If your document is growing unbounded (tracking history appended over months), split it. Store the current state as a document and the history as a separate collection or table. The current state document stays small. The history is append-only and can be stored in a column store for efficient aggregation.