Beyond the Warehouse: Architecting Data Lineage and Source of Truth

Architecting Data Lineage and Source of Truth

Sarah Usher’s presentation at QCon London 2025 emphasizes the importance of understanding data lineage and source of truth in architecting data systems. A key example from her experience involves a product engineering team that encountered a 5-minute query latency issue when using a warehouse for a simple product feature, illustrating the limitations of relying solely on data warehouses like BigQuery.

Why This Matters

The technical reality of data processing often diverges from ideal models, with warehouses struggling with latency and cost as they scale. For instance, a $100,000 monthly bill for a data warehouse can be a significant cost, highlighting the need for efficient data architecture. The failure to design a robust data lineage and source of truth can lead to data disorganization, confusion, and struggles to innovate, ultimately affecting the bottom line.

Key Insights

The medallion model categorizes data into bronze, silver, and gold layers based on quality, but this approach can be limiting when applied rigidly.
Data lineage is crucial for understanding how data flows and transforms across systems, enabling better data management and decision-making.
Using distributed batch systems like Spark or streaming technologies can help in curating data more efficiently than traditional warehouses.

Working Example

# Example of data processing using Apache Spark
from pyspark.sql import SparkSession

# Initialize Spark Session
spark = SparkSession.builder.appName("DataProcessing").getOrCreate()

# Load raw data into a DataFrame
raw_data = spark.read.csv("raw_data.csv", header=True, inferSchema=True)

# Curate data by applying transformations (e.g., filtering, grouping)
curated_data = raw_data.filter(raw_data["age"] > 18).groupBy("country").count()

# Output curated data to a file
curated_data.write.parquet("curated_data.parquet")

Practical Applications

Use Case: Implementing a data catalog to manage metadata and provide a single source of truth for data discovery and access.
Pitfall: Failing to store raw data in its original form, leading to loss of information and flexibility in data processing and analysis.

References:

On This Page

Architecting Data Lineage and Source of Truth

Why This Matters

Key Insights

Working Example

Practical Applications

Continue reading

Related Content

Architecting AWS-Snowflake Lakehouses with Apache Iceberg Integration Patterns

Architecting Decoupled Serverless Applications on Google Cloud Platform

When Iceberg Beats Parquet+Projection on AWS Glue: A Performance Comparison