Skip to main content

On This Page

Beyond the Warehouse: Architecting Data Lineage and Source of Truth

2 min read
Share

These articles are AI-generated summaries. Please check the original sources for full details.

Architecting Data Lineage and Source of Truth

Sarah Usher’s presentation at QCon London 2025 emphasizes the importance of understanding data lineage and source of truth in architecting data systems. A key example from her experience involves a product engineering team that encountered a 5-minute query latency issue when using a warehouse for a simple product feature, illustrating the limitations of relying solely on data warehouses like BigQuery.

Why This Matters

The technical reality of data processing often diverges from ideal models, with warehouses struggling with latency and cost as they scale. For instance, a $100,000 monthly bill for a data warehouse can be a significant cost, highlighting the need for efficient data architecture. The failure to design a robust data lineage and source of truth can lead to data disorganization, confusion, and struggles to innovate, ultimately affecting the bottom line.

Key Insights

  • The medallion model categorizes data into bronze, silver, and gold layers based on quality, but this approach can be limiting when applied rigidly.
  • Data lineage is crucial for understanding how data flows and transforms across systems, enabling better data management and decision-making.
  • Using distributed batch systems like Spark or streaming technologies can help in curating data more efficiently than traditional warehouses.

Working Example

# Example of data processing using Apache Spark
from pyspark.sql import SparkSession

# Initialize Spark Session
spark = SparkSession.builder.appName("DataProcessing").getOrCreate()

# Load raw data into a DataFrame
raw_data = spark.read.csv("raw_data.csv", header=True, inferSchema=True)

# Curate data by applying transformations (e.g., filtering, grouping)
curated_data = raw_data.filter(raw_data["age"] > 18).groupBy("country").count()

# Output curated data to a file
curated_data.write.parquet("curated_data.parquet")

Practical Applications

  • Use Case: Implementing a data catalog to manage metadata and provide a single source of truth for data discovery and access.
  • Pitfall: Failing to store raw data in its original form, leading to loss of information and flexibility in data processing and analysis.

References:

Continue reading

Next article

The First 90 Seconds of Incident Response

Related Content