Lessons in Data Normalization: Avoiding Over-Abstraction in Production Migrations

What Phone Sanitization Revealed About Our Data

Engineer Omar Essaouaf undertook a task to normalize phone numbers across a sharded production system. What was estimated as a one-day task evolved into a week-long project due to unforeseen data identity conflicts.

Why This Matters

Technical reality often contradicts ideal models because migrations lack an ‘off switch.’ While developers often reach for generic frameworks to ensure reusability, the cost of wrong abstractions in a migration is high, as bugs can lead to permanent data loss or irreversible corruption of production records.

Key Insights

The Maintenance Command Corollary: Tools that run infrequently and touch sensitive data should optimize for auditability over extensibility (Essaouaf, 2026).
Normalization as Identity Discovery: Applying E.164 normalization can reveal duplicates that previously existed quietly across shards, such as ‘+1 (800) 555-0123’ and ‘18005550123’.
Database-Level Aggregation: Using temporary SQL tables with ‘GROUP BY’ and ‘HAVING COUNT(*) > 1’ is more memory-efficient than in-process hash maps for multi-million record datasets.
Semantic vs. Type Distinction: Column relationships do not imply identical business rules; identity fields require different migration logic than communication logs despite sharing the same data type.

Practical Applications

Use case: Sharded production systems implementing E.164 normalization to identify duplicate account holders.
Pitfall: Over-engineering migrations with dynamic schema discovery, which increases the surface area for subtle bugs in production data.

References:

https://dev.to/omaressaouaf/what-phone-sanitization-revealed-about-our-data-11da

On This Page

What Phone Sanitization Revealed About Our Data

Why This Matters

Key Insights

Practical Applications

Continue reading

Related Content

PostgreSQL Merge Into Equivalent for Conditional Updates

Mastering Database Sharding: Architecting Scalable Distributed Systems for Billions of Records

Resolving the Supabase Dual-DB Conflict in Lovable AI Workflows