Agoda Unifies Data Pipelines with Apache Spark to Achieve 95.6% Uptime
These articles are AI-generated summaries. Please check the original sources for full details.
Agoda Unified Data Pipelines
Agoda recently consolidated multiple independent financial data pipelines into a centralized Apache Spark-based platform, improving data consistency and achieving 95.6% uptime. The Financial Unified Data Pipeline (FINUDP) processes millions of daily booking transactions, providing hourly updates to downstream teams.
The move addresses a common enterprise issue: siloed data pipelines leading to inconsistent metrics and potential financial reporting errors. Without a unified system, discrepancies can impact critical business decisions and regulatory compliance, costing organizations significant time and resources to reconcile.
Key Insights
- 64% of organizations cite poor data quality as their biggest challenge, 2023.
- Data contracts define expectations for schemas and quality requirements between data producers and consumers, Gartner.
- Apache Spark is used by companies like Netflix and Databricks for large-scale data processing.
Working Example
# Example of a basic data validation check in PySpark
from pyspark.sql.functions import col
def validate_data(df, column_name, min_value, max_value):
"""
Validates that values in a specified column fall within a given range.
"""
return df.filter((col(column_name) >= min_value) & (col(column_name) <= max_value))
# Assuming 'sales_df' is a Spark DataFrame with a 'amount' column
validated_df = validate_data(sales_df, "amount", 0, 1000)
validated_df.show()
Practical Applications
- Financial Institutions: Implementing a unified data pipeline for accurate regulatory reporting and risk management.
- Pitfall: Over-reliance on automated validations without data contracts can lead to undetected schema drift and data quality issues.
References:
Continue reading
Next article
Microsoft Disrupts RedVDS Cybercrime Service, Seizing Key Infrastructure
Related Content
Decathlon Switches to Polars to Optimize Data Pipelines and Infrastructure Costs
Decathlon reduced compute launch time from 8 to 2 minutes by migrating from Apache Spark to Polars for datasets under 50GB.
Mastering Data Workflow Orchestration with Apache Airflow
Apache Airflow, an open-source platform created by Airbnb in 2014, allows engineers to schedule and monitor complex data pipelines using Directed Acyclic Graphs and automated retry logic.
Building Scalable ML Data Pipelines for Image and Structured Data with Daft
Learn how to build an end-to-end ML pipeline using Daft, a Python-native data engine that handles MNIST image reshaping, feature engineering via batch UDFs, and Parquet persistence for high-performance processing.