Skip to main content
pragmatic data science with python

Repository Structure and Version Control for Data

11 min read Chapter 3 of 33
Summary

This section dismantles the monolithic notebook anti-pattern and...

This section dismantles the monolithic notebook anti-pattern and replaces it with a modular repository structure that separates data loading, feature engineering, training, and serving into testable Python modules. We then introduce DVC (Data Version Control) to solve the problem git cannot: tracking large datasets and model artifacts with the same rigor as source code. By the end, the reader has a complete, reproducible ML project where every model can be traced to the exact data, code, and hyperparameters that produced it.

The Notebook Monolith

Open any data science team’s shared drive and you will find a file named something like model_v2_final_actually_final_johns_fixes.ipynb. This notebook contains 147 cells. Cells 1–30 load and clean data. Cells 31–80 engineer features. Cells 81–120 train three different models. Cells 121–140 evaluate them. Cells 141–147 serialize the winner and make predictions.

This notebook has four fatal properties:

It is unreviewable. GitHub diffs on .ipynb files show JSON noise — cell metadata changes, output blobs, execution counters. A reviewer cannot tell what code changed without reconstructing the entire notebook in their head.

It is untestable. You cannot import cell [47] into a test file. You cannot mock the database connection in cell [12] without executing all preceding cells. The only way to test the notebook is to run it end-to-end, which takes 45 minutes and requires access to production data.

It is stateful. Cell execution order matters. Run cell 50 before cell 30 and you get a NameError. Run cell 80 twice and you double your feature matrix. The notebook’s correctness depends on the sequence of human clicks, which is not reproducible.

It is undeployable. No production system executes Jupyter notebooks. Eventually, someone must extract the “important” cells into a .py file. This extraction is manual, error-prone, and invalidates all previous testing (which was already inadequate).

The solution is to never put production logic in notebooks in the first place. Notebooks are for exploration — visualizing distributions, testing hypotheses, prototyping transformations. Once an approach proves viable, the code moves to a Python module with type hints, tests, and a clear interface.

A Production ML Repository Structure

Here is the layout this book uses for every ML project. Each directory has a single responsibility.

ml-forecasting/
├── pyproject.toml          # Dependencies and tool configuration
├── uv.lock                 # Deterministic lockfile
├── dvc.yaml                # Pipeline stages
├── dvc.lock                # Data/model version hashes
├── params.yaml             # Hyperparameters (single source of truth)
├── .gitignore
├── .dvcignore

├── src/
│   └── forecasting/
│       ├── __init__.py
│       ├── data/
│       │   ├── __init__.py
│       │   ├── loader.py       # Raw data ingestion + Pydantic validation
│       │   ├── transforms.py   # Feature engineering (pure functions)
│       │   └── splitter.py     # Train/val/test splitting logic
│       ├── models/
│       │   ├── __init__.py
│       │   ├── train.py        # Training entrypoint
│       │   ├── evaluate.py     # Metric computation + reporting
│       │   └── registry.py     # Model serialization/loading
│       └── serving/
│           ├── __init__.py
│           ├── predict.py      # Inference logic
│           └── schemas.py      # API request/response models

├── configs/
│   ├── train.yaml          # Model hyperparameters
│   └── features.yaml       # Feature definitions and encodings

├── data/
│   ├── raw/                # Immutable source data (DVC-tracked)
│   └── processed/          # Transformed features (DVC-tracked)

├── models/                 # Serialized model artifacts (DVC-tracked)

├── notebooks/              # Exploration ONLY
│   └── eda/
│       └── 01_distributions.ipynb

├── tests/
│   ├── conftest.py         # Shared fixtures
│   ├── test_loader.py
│   ├── test_transforms.py
│   ├── test_train.py
│   └── fixtures/
│       └── sample_data.csv # Small, deterministic test data

└── scripts/
    └── run_pipeline.py     # CLI entry point for full pipeline

Why This Layout Works

src/ layout with namespace package. Placing code under src/forecasting/ prevents accidental imports from the project root. When you run uv run python -m forecasting.models.train, Python resolves the import from the installed package, not from a loose file in the working directory. This eliminates an entire class of “it works locally but not in Docker” bugs.

Pure functions in transforms.py. Feature engineering functions take a DataFrame and return a DataFrame. No side effects, no database calls, no file I/O. This makes them trivially testable:

# src/forecasting/data/transforms.py
import polars as pl


def add_rolling_features(
    df: pl.DataFrame,
    value_col: str,
    windows: list[int],
) -> pl.DataFrame:
    """Add rolling mean and std features for specified windows.
    
    Pure function: no side effects, deterministic output
    for the same input.
    """
    expressions = []
    for w in windows:
        expressions.extend([
            pl.col(value_col)
            .rolling_mean(window_size=w)
            .alias(f"{value_col}_rolling_mean_{w}"),
            pl.col(value_col)
            .rolling_std(window_size=w)
            .alias(f"{value_col}_rolling_std_{w}"),
        ])
    return df.with_columns(expressions)
# tests/test_transforms.py
import polars as pl
from forecasting.data.transforms import add_rolling_features


def test_rolling_features_column_names() -> None:
    df = pl.DataFrame({
        "date": ["2024-01-01", "2024-01-02", "2024-01-03",
                 "2024-01-04", "2024-01-05"],
        "price": [100.0, 102.0, 99.0, 105.0, 103.0],
    })
    result = add_rolling_features(df, "price", windows=[3])
    
    assert "price_rolling_mean_3" in result.columns
    assert "price_rolling_std_3" in result.columns
    assert result.shape[0] == 5


def test_rolling_features_values() -> None:
    df = pl.DataFrame({"value": [1.0, 2.0, 3.0, 4.0, 5.0]})
    result = add_rolling_features(df, "value", windows=[3])
    
    # First two rows should be null (insufficient window)
    assert result["value_rolling_mean_3"][0] is None
    assert result["value_rolling_mean_3"][1] is None
    # Third row: mean of [1, 2, 3] = 2.0
    assert result["value_rolling_mean_3"][2] == 2.0

This test runs in milliseconds. No GPU, no database, no 10 GB dataset. The function is pure, so it’s fast to test and impossible to break with execution order.

Configs separated from code. Hyperparameters live in params.yaml, not hardcoded in training scripts. This enables two things: DVC can track parameter changes as part of the pipeline, and you can swap configurations without modifying Python files.

# params.yaml
train:
  n_estimators: 500
  max_depth: 8
  learning_rate: 0.05
  early_stopping_rounds: 20
  
features:
  rolling_windows: [7, 14, 30]
  target_col: "price"
  categorical_cols: ["sector", "region"]
# src/forecasting/models/train.py
from pathlib import Path

import yaml
import polars as pl
from sklearn.ensemble import GradientBoostingRegressor
import joblib


def load_params(path: Path = Path("params.yaml")) -> dict:
    """Load hyperparameters from YAML config."""
    with open(path) as f:
        return yaml.safe_load(f)


def train_model(
    train_df: pl.DataFrame,
    target_col: str,
    params: dict,
) -> GradientBoostingRegressor:
    """Train a gradient boosting model with specified parameters."""
    feature_cols = [c for c in train_df.columns if c != target_col]
    
    X = train_df.select(feature_cols).to_numpy()
    y = train_df[target_col].to_numpy()
    
    model = GradientBoostingRegressor(
        n_estimators=params["n_estimators"],
        max_depth=params["max_depth"],
        learning_rate=params["learning_rate"],
    )
    model.fit(X, y)
    return model


if __name__ == "__main__":
    params = load_params()
    train_data = pl.read_parquet("data/processed/train.parquet")
    
    model = train_model(
        train_data,
        target_col=params["features"]["target_col"],
        params=params["train"],
    )
    
    output_path = Path("models/model.joblib")
    output_path.parent.mkdir(parents=True, exist_ok=True)
    joblib.dump(model, output_path)
    print(f"Model saved to {output_path}")

Starting from scratch every time is wasteful. The cookiecutter-data-science template provides a reasonable starting point:

uv tool install cookiecutter
cookiecutter https://github.com/drivendataorg/cookiecutter-data-science

Use it as a starting point, then deviate deliberately. Common deviations for production ML:

  • Remove Makefile in favor of DVC pipelines or uv run scripts. Makefiles can’t express data dependencies.
  • Add src/ layout — the default template puts code at the project root, which causes import ambiguity.
  • Add serving/ directory — most templates assume offline batch processing, not real-time inference.
  • Replace setup.py with pyproject.toml — the Python packaging ecosystem has moved on.

DVC: Git for Data and Models

Git tracks code. It cannot track data. A 2 GB training dataset stored in git will bloat the repository permanently — even after deletion, the blob lives in git history. git clone becomes a 20-minute ordeal. Large model files (500 MB+) make it worse.

DVC (Data Version Control) solves this by storing pointer files in git and the actual data in remote storage (S3, GCS, Azure Blob, or even a local directory).

The DVC Mental Model

Think of DVC as a layer on top of git that extends version control to large files.

WhatTracked byStored in
Python source codegitGitHub/GitLab
pyproject.toml, params.yamlgitGitHub/GitLab
Raw data (2 GB CSV)DVC (pointer in git)S3/GCS/local remote
Processed features (500 MB parquet)DVC (pointer in git)S3/GCS/local remote
Trained model (200 MB joblib)DVC (pointer in git)S3/GCS/local remote
Pipeline definition (dvc.yaml)gitGitHub/GitLab
Pipeline state (dvc.lock)gitGitHub/GitLab

When you run git log, you see when code and parameters changed. When you inspect dvc.lock, you see the exact hash of the data and model for that commit. Together, they answer: “What code, data, and parameters produced this model?”

Setting Up DVC

# Install DVC (with S3 support)
uv add dvc dvc-s3

# Initialize DVC in an existing git repo
uv run dvc init

# Configure remote storage
uv run dvc remote add -d myremote s3://my-ml-bucket/dvc-store

# Or use a local directory for learning
uv run dvc remote add -d localremote /tmp/dvc-store

DVC creates a .dvc/ directory (analogous to .git/) and a .dvcignore file. Both should be committed to git.

Tracking Data Files

# Add a large dataset to DVC tracking
uv run dvc add data/raw/transactions.csv

This command does three things:

  1. Computes the MD5 hash of transactions.csv
  2. Creates data/raw/transactions.csv.dvc — a small pointer file containing the hash
  3. Adds data/raw/transactions.csv to .gitignore so git never touches the actual data

The pointer file looks like this:

# data/raw/transactions.csv.dvc
outs:
- md5: a1b2c3d4e5f6789012345678abcdef01
  size: 2147483648
  hash: md5
  path: transactions.csv

Now commit the pointer to git and push the data to remote storage:

git add data/raw/transactions.csv.dvc data/raw/.gitignore
git commit -m "Track transactions dataset v1"

# Push actual data to remote storage
uv run dvc push

When a colleague clones the repo, they get the pointer file from git and pull the actual data from remote storage:

git clone https://github.com/team/ml-forecasting.git
cd ml-forecasting
uv sync          # Install Python dependencies
uv run dvc pull  # Download data from remote storage

DVC Pipelines: Reproducible End-to-End

Individual file tracking is useful, but the real power of DVC is pipelines. A dvc.yaml file defines the stages of your ML workflow — their commands, dependencies, and outputs. DVC tracks the entire graph and only re-executes stages whose inputs have changed.

# dvc.yaml
stages:
  prepare:
    cmd: uv run python src/forecasting/data/loader.py
    deps:
      - src/forecasting/data/loader.py
      - data/raw/transactions.csv
    params:
      - features.target_col
      - features.categorical_cols
    outs:
      - data/processed/train.parquet
      - data/processed/test.parquet

  train:
    cmd: uv run python src/forecasting/models/train.py
    deps:
      - src/forecasting/models/train.py
      - data/processed/train.parquet
    params:
      - train.n_estimators
      - train.max_depth
      - train.learning_rate
    outs:
      - models/model.joblib

  evaluate:
    cmd: uv run python src/forecasting/models/evaluate.py
    deps:
      - src/forecasting/models/evaluate.py
      - models/model.joblib
      - data/processed/test.parquet
    metrics:
      - metrics.json:
          cache: false

DVC Pipeline

Run the entire pipeline:

uv run dvc repro

DVC computes the dependency graph (prepare → train → evaluate), checks which stages have stale inputs, and re-executes only what’s necessary. If you change a hyperparameter in params.yaml, DVC skips prepare (data hasn’t changed) and re-runs train and evaluate.

After a successful run, DVC updates dvc.lock with the hashes of all inputs and outputs. Commit this file to git:

git add dvc.lock metrics.json
git commit -m "Train model: RMSE=0.043, n_estimators=500"

# Push updated data/model artifacts to remote
uv run dvc push

Reproducing Any Historical Model

This is where the investment pays off. Suppose you need to reproduce the model from three months ago — the one that performed better on a specific customer segment. With DVC + git:

# Check out the code and DVC pointers from that date
git checkout abc123f

# Pull the exact data that existed at that commit
uv sync
uv run dvc checkout

# Reproduce the pipeline (should be a no-op if nothing changed)
uv run dvc repro

You now have the identical model, trained on the identical data, with the identical hyperparameters. No guessing which CSV was on whose laptop. No “I think we used the March version of the dataset.” The answer is in dvc.lock.

The Complete Workflow in Practice

Here’s what a typical development day looks like with this setup:

# Morning: pull latest code and data
git pull
uv sync
uv run dvc pull

# Develop: modify feature engineering
# Edit src/forecasting/data/transforms.py

# Test your changes
uv run pytest tests/test_transforms.py -v

# Run the pipeline (only changed stages re-execute)
uv run dvc repro

# Check metrics
cat metrics.json

# Commit everything
git add -A
git commit -m "Add 30-day rolling volatility feature, RMSE improved 0.043→0.038"
uv run dvc push
git push

Every commit in your git history now tells a complete story: what code changed, what data was used, what parameters were set, and what metrics resulted. You’ve built an audit trail that would take weeks to reconstruct manually — and it cost you two extra commands (dvc repro and dvc push) added to your existing workflow.

Bringing It All Together

The four pillars from Chapter 1 are now in place:

  1. uv locks your dependencies so environments are identical across machines and time.
  2. Pydantic validates data at ingestion so corrupted rows never reach your models.
  3. The src/ layout separates concerns so code is testable, reviewable, and importable.
  4. DVC versions data and models alongside code so any result can be reproduced exactly.

None of these tools are complex. Each adds minutes to your setup and seconds to your daily workflow. What they eliminate — the 2 AM debugging sessions, the irreproducible results, the “it works on my machine” conversations — saves weeks per quarter.

In Chapter 2, we turn to the data itself: loading, cleaning, and transforming real-world datasets with Polars — and understanding why it’s replacing Pandas in performance-critical pipelines.