Skip to main content
← All Tags

Data Engineering

53 articles in this category (Page 2 of 3)

AI NewsArtificial IntelligenceData Engineering

Beyond the Vector Store: Why Production AI Requires a Relational Data Layer

Production AI applications require a hybrid data layer combining vector databases for semantic retrieval with relational databases to manage permissions, billing, and state with ACID guarantees.

Read more
AI NewsData EngineeringSystem Design

Scalable Event Streaming: Understanding Kafka Architecture for High-Volume Data

Apache Kafka provides a distributed event streaming platform to solve database write-read bottlenecks by decoupling producers from consumers across partitioned topics.

Read more
AI NewsData EngineeringDevOps

Eliminate Environment Inconsistency: Deploy Data Pipelines in 10 Minutes with Dataflow

Dataflow enables data teams to transition from setup to production pipelines in under 10 minutes by unifying dependencies and cloud-agnostic infrastructure.

Read more
AI NewsData EngineeringDevOps

Orchestrating Healthcare Data: The PECOS AWS Glue and Step Functions Pipeline

The PECOS Pipeline uses AWS Step Functions and Glue to process four datasets in parallel with 3-retry logic for healthcare data ingestion.

Read more
AI NewsMachine LearningData Engineering

Building Scalable ML Data Pipelines for Image and Structured Data with Daft

Learn how to build an end-to-end ML pipeline using Daft, a Python-native data engine that handles MNIST image reshaping, feature engineering via batch UDFs, and Parquet persistence for high-performance processing.

Read more
AI NewsData EngineeringArtificial Intelligence

Beyond Block or Allow: The Shift to Pay-Per-Crawl Data Monetization

Stack Overflow and Cloudflare launch a pay-per-crawl model using HTTP 402 to monetize AI bot traffic directly.

Read more
AI NewsData EngineeringSoftware Development

Semantic Layer vs. Metrics Layer: A Technical Distinction

Distinguish metrics from semantic layers to prevent AI hallucinations and security leaks in modern data architecture by centralizing logic and governance.

Read more
AI NewsSoftware DevelopmentData Engineering

Why Your AI Initiatives Fail Without a Semantic Layer

AI-driven natural language analytics often fail due to a lack of business context, leading to metric hallucinations that can result in 15% revenue discrepancies.

Read more
AI NewsData EngineeringCloud Computing

Redesigning a Failing Data Pipeline to Eliminate Cascading Failures

A redesigned data pipeline using AWS managed services and Terraform achieved 99.7% ingestion success rate and zero cascading failures during traffic spikes.

Read more
AI NewsData EngineeringCloud Architecture

Beyond the Warehouse: Architecting Data Lineage and Source of Truth

Sarah Usher discusses the limitations of relying solely on data warehouses like BigQuery, highlighting a 5-minute query latency issue in a real-world example.

Read more
AI NewsDevOpsData Engineering

Rapid API-Driven Data Cleanup for DevOps under Pressure

Dirty data can lead to operational inefficiencies, with 80% of data scientists' time spent on data cleaning, highlighting the need for rapid API-driven solutions.

Read more
AI NewsElasticsearchData Engineering

Rename Existing Field With Elasticsearch Mapping

Learn how renaming fields in Elasticsearch typically requires recreating an index and reindexing data, a process essential for maintaining data integrity.

Read more
AI NewsData EngineeringApache Spark

Agoda Unifies Data Pipelines with Apache Spark to Achieve 95.6% Uptime

Agoda consolidated independent financial data pipelines into a centralized Apache Spark platform, reducing inconsistencies and achieving 95.6% uptime while processing millions of daily transactions.

Read more
AI NewsData EngineeringDatabases

GCAIDB Certification: Bridging AI and Database Expertise

The GCAIDB certification validates skills needed to manage databases supporting AI workloads, addressing a key failure point in AI initiatives.

Read more
AI NewsData EngineeringDevOps

Solved: Canceled my $15K/year ZoomInfo subscription. Built my own for $50/month.

A Reddit user reduced annual data costs from $15,000 to $600 by building a custom data solution using open-source tools and APIs.

Read more
AI NewsData EngineeringWebAssembly

DuckDB Enables Browser-Based Queries of Iceberg Datasets

DuckDB's new WebAssembly client allows querying Iceberg datasets directly in the browser, eliminating infrastructure setup.

Read more
AI NewsData EngineeringMachine Learning

Swiggy’s Hermes V3 Achieves 93% SQL Accuracy with GenAI

Swiggy’s Hermes V3, a GenAI-powered text-to-SQL assistant, improved SQL generation accuracy from 54% to 93% by leveraging vector retrieval and conversational memory.

Read more
AI NewsData EngineeringData Science

Decathlon Switches to Polars to Optimize Data Pipelines and Infrastructure Costs

Decathlon reduced compute launch time from 8 to 2 minutes by migrating from Apache Spark to Polars for datasets under 50GB.

Read more
AI NewsData EngineeringBusiness Intelligence

Data Mashup vs. Data Stack Assumptions: Choosing the Right BI Architecture

Modern BI discussions often center on tools, but the key differentiator lies in data preparation assumptions, impacting cost and agility.

Read more
AI NewsMLOpsData Engineering

Powering Enterprise AI Applications with Data and Open Source Software

Feast, an open-source feature store, addresses challenges in the AI/ML lifecycle, with 87% of data science projects failing due to productionization issues.

Read more
AI NewsSoftware EngineeringData Engineering

Continuous Journey through Dagster - bugs and testing

Recent contributions to Dagster highlight the challenges of debugging race conditions and CI pipeline failures in open-source projects.

Read more
AI NewsPostgreSQLData Engineering

Dynamic SQL in PostgreSQL for Payroll Data Retrieval

Dynamic SQL in PostgreSQL processes payroll data with parameterized queries for secure, scalable HR systems.

Read more
AI NewsData EngineeringPlatform Architecture

Data Contracts: Bridging the Gap Between Data Producers and Consumers

Data contracts reduce misalignment by 80% in FinTech through explicit schema and SLA definitions.

Read more
AI NewsNLPData Engineering

Preparing Data for BERT Training

BERT training requires specialized data preparation, including masked language modeling and next sentence prediction, to achieve optimal performance.

Read more