Skip to main content

On This Page

Building Scalable ML Data Pipelines for Image and Structured Data with Daft

3 min read
Share

These articles are AI-generated summaries. Please check the original sources for full details.

A Coding Guide to Build a Scalable End-to-End Machine Learning Data Pipeline Using Daft for High-Performance Structured and Image Data Processing

Michal Sutter demonstrates Daft, a high-performance Python-native data engine designed for complex analytical pipelines. The system processes the MNIST dataset by transforming raw JSON-based pixel arrays into 28x28 structured images using scalable User-Defined Functions (UDFs).

Why This Matters

Traditional Python data tools often struggle with the hybrid nature of modern ML workloads that mix structured tabular data with unstructured image tensors. Engineers frequently face bottlenecks when switching between Pandas for metadata and specialized libraries for image processing, leading to fragmented and non-scalable code. Daft addresses this by providing a unified, lazy execution engine that handles both data types within a single, memory-efficient framework. This eliminates the performance limitations of standard Python libraries while maintaining a familiar API for engineers.

Key Insights

  • Daft versioning and environment stability are ensured by installing daft, pyarrow, and numpy in a clean environment like Google Colab.
  • Row-wise UDFs enable the transformation of 1D pixel arrays into 28x28 matrices using np.reshape for model-ready inputs.
  • Batch UDFs with batch_size=512 optimize feature extraction by processing multiple images simultaneously, reducing overhead in Python-native execution.
  • Integration of statistical features like pixel_mean and pixel_std (float32) allows for immediate data validation and enrichment before model training.
  • Persistence to Parquet format via df.write_parquet() ensures that processed features are stored in a compressed, industry-standard format for production reuse.

Working Examples

Reshaping raw pixel data into 28x28 images using Daft’s row-wise column application.

def to_28x28(pixels):
    arr = np.array(pixels, dtype=np.float32)
    if arr.size != 784:
        return None
    return arr.reshape(28, 28)

df2 = (
    df
    .with_column(
        "img_28x28",
        col("image").apply(to_28x28, return_dtype=daft.DataType.python())
    )
)

Implementing a batch UDF for optimized feature engineering with a specified batch size of 512.

@daft.udf(return_dtype=daft.DataType.list(daft.DataType.float32()), batch_size=512)
def featurize(images_28x28):
    out = []
    for img in images_28x28.to_pylist():
        if img is None:
            out.append(None)
            continue
        img = np.asarray(img, dtype=np.float32)
        row_sums = img.sum(axis=1) / 255.0
        col_sums = img.sum(axis=0) / 255.0
        total = img.sum() + 1e-6
        ys, xs = np.indices(img.shape)
        cy = float((ys * img).sum() / total) / 28.0
        cx = float((xs * img).sum() / total) / 28.0
        vec = np.concatenate([row_sums, col_sums, np.array([cy, cx, img.mean()/255.0, img.std()/255.0], dtype=np.float32)])
        out.append(vec.astype(np.float32).tolist())
    return out

Practical Applications

  • Use Case: MNIST Classification Pipeline: Transforming raw JSON image data into engineered features for Scikit-learn Logistic Regression models.
  • Pitfall: Row-wise vs. Batch UDFs: Using row-wise operations for heavy feature extraction can lead to significant performance degradation compared to Daft’s vectorized batch UDFs.
  • Use Case: Feature Persistence: Storing large-scale image metadata and engineered vectors in Parquet to avoid re-computing complex transformations in downstream training loops.
  • Pitfall: Schema Mismatch: Failing to specify return_dtype in Daft UDFs can cause execution failures when the engine attempts to optimize the lazy physical plan.

References:

Continue reading

Next article

Linux Timekeeping Internals: How RTC, TSC, and Kernel Clocks Align

Related Content