PyTorch Fundamentals and Tabular Deep Learning

6.1 — PyTorch Fundamentals

PyTorch is not the only deep learning framework, but it is the one that won. TensorFlow still has production deployments, JAX has a following in research, but PyTorch dominates new projects in both academia and industry. The reason is not technical superiority in any single dimension — it is the imperative programming model. You write Python. The code executes line by line. You can set breakpoints, print intermediate tensors, and debug with standard tools. There is no compilation step, no session object, no graph-building ceremony.

What follows is not a comprehensive PyTorch tutorial. It is the 20% of the framework you will use 80% of the time.

Tensors: The Fundamental Data Type

A tensor is a multi-dimensional array with a data type and a device. That is it. If you understand NumPy arrays, you understand 90% of tensors. The remaining 10% is hardware placement and automatic differentiation.

import torch
import numpy as np

# Creation — from Python lists, NumPy arrays, or factory functions
x_list: torch.Tensor = torch.tensor([[1.0, 2.0], [3.0, 4.0]])
x_numpy: torch.Tensor = torch.from_numpy(np.array([1.0, 2.0, 3.0]))
x_zeros: torch.Tensor = torch.zeros(3, 4, dtype=torch.float32)
x_randn: torch.Tensor = torch.randn(64, 128)  # Standard normal

# dtype matters — float32 is the default for training, float16/bfloat16 for inference
x_half: torch.Tensor = x_randn.to(dtype=torch.float16)
x_bf16: torch.Tensor = x_randn.to(dtype=torch.bfloat16)  # Preferred on Ampere+ GPUs

# Device movement — CPU ↔ GPU
device: torch.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
x_gpu: torch.Tensor = x_randn.to(device)

# Broadcasting works like NumPy
a: torch.Tensor = torch.randn(64, 1)
b: torch.Tensor = torch.randn(1, 128)
c: torch.Tensor = a + b  # Shape: (64, 128)

Two rules you must internalize: (1) all tensors in an operation must be on the same device — mixing CPU and GPU tensors raises a RuntimeError, not a silent conversion; (2) torch.from_numpy shares memory with the source array, so modifying one modifies the other.

Autograd: Automatic Differentiation

Autograd is the engine that makes neural network training possible. When you set requires_grad=True on a tensor, PyTorch records every operation applied to it in a computational graph. Calling .backward() on a scalar output traverses that graph in reverse and accumulates gradients in each tensor’s .grad attribute.

# Manual gradient computation — illustrating the mechanics
w: torch.Tensor = torch.tensor([2.0, -1.0], requires_grad=True)
x: torch.Tensor = torch.tensor([3.0, 4.0])

# Forward pass: compute prediction and loss
prediction: torch.Tensor = (w * x).sum()        # 2*3 + (-1)*4 = 2
loss: torch.Tensor = (prediction - 5.0) ** 2     # (2 - 5)^2 = 9

# Backward pass: compute gradients
loss.backward()

# w.grad now contains d(loss)/d(w)
# d(loss)/d(w) = 2 * (prediction - 5) * x = 2 * (-3) * [3, 4] = [-18, -24]
print(w.grad)  # tensor([-18., -24.])

The critical detail: gradients accumulate. Calling .backward() a second time adds to .grad rather than replacing it. This is why every training loop calls optimizer.zero_grad() — without it, gradients from previous batches contaminate the current update.

Custom Dataset and DataLoader

Every PyTorch training pipeline follows the same pattern: a Dataset that knows how to fetch one sample, and a DataLoader that batches, shuffles, and parallelizes the fetching.

from torch.utils.data import Dataset, DataLoader
import polars as pl


class TabularDataset(Dataset):
    """Dataset for tabular data with numeric and categorical features."""

    def __init__(
        self,
        numeric_features: np.ndarray,  # Shape: (n_samples, n_numeric)
        categorical_features: np.ndarray,  # Shape: (n_samples, n_categorical), integer-encoded
        targets: np.ndarray,  # Shape: (n_samples,)
    ) -> None:
        self.numeric = torch.tensor(numeric_features, dtype=torch.float32)
        self.categorical = torch.tensor(categorical_features, dtype=torch.long)
        self.targets = torch.tensor(targets, dtype=torch.float32)

    def __len__(self) -> int:
        return len(self.targets)

    def __getitem__(self, idx: int) -> tuple[torch.Tensor, torch.Tensor, torch.Tensor]:
        return self.numeric[idx], self.categorical[idx], self.targets[idx]


# Usage: wrap data and create loader
dataset = TabularDataset(
    numeric_features=np.random.randn(10_000, 15).astype(np.float32),
    categorical_features=np.random.randint(0, 100, size=(10_000, 5)),
    targets=np.random.rand(10_000).astype(np.float32),
)

train_loader = DataLoader(
    dataset,
    batch_size=256,
    shuffle=True,
    num_workers=4,       # Parallel data loading — crucial for GPU utilization
    pin_memory=True,     # Speeds up CPU → GPU transfer
    drop_last=True,      # Avoids small final batch causing BatchNorm instability
)

The __getitem__ method returns a single sample. The DataLoader calls it batch_size times, stacks the results into batches, and optionally shuffles the order each epoch. Using num_workers > 0 spawns subprocesses that prefetch batches while the GPU is busy computing — without this, your GPU will spend most of its time waiting for data.

The Training Loop

There is no .fit() method in PyTorch. You write the loop yourself. This is not a deficiency — it is the reason PyTorch won. Full control means you can implement any training strategy without fighting the framework.

import torch.nn as nn
from torch.optim import AdamW
from torch.optim.lr_scheduler import CosineAnnealingLR


def train_model(
    model: nn.Module,
    train_loader: DataLoader,
    val_loader: DataLoader,
    device: torch.device,
    n_epochs: int = 50,
    lr: float = 1e-3,
    patience: int = 5,
) -> nn.Module:
    """Complete training loop with validation and early stopping."""
    model = model.to(device)
    optimizer = AdamW(model.parameters(), lr=lr, weight_decay=1e-2)
    scheduler = CosineAnnealingLR(optimizer, T_max=n_epochs)
    criterion = nn.MSELoss()

    best_val_loss: float = float("inf")
    epochs_without_improvement: int = 0

    for epoch in range(n_epochs):
        # --- Training phase ---
        model.train()
        train_loss_sum: float = 0.0
        train_samples: int = 0

        for numeric, categorical, targets in train_loader:
            numeric = numeric.to(device, non_blocking=True)
            categorical = categorical.to(device, non_blocking=True)
            targets = targets.to(device, non_blocking=True)

            optimizer.zero_grad()                       # Reset gradients
            predictions = model(numeric, categorical)   # Forward pass
            loss = criterion(predictions.squeeze(), targets)  # Compute loss
            loss.backward()                             # Backward pass
            optimizer.step()                            # Update weights

            train_loss_sum += loss.item() * len(targets)
            train_samples += len(targets)

        scheduler.step()

        # --- Validation phase ---
        model.eval()
        val_loss_sum: float = 0.0
        val_samples: int = 0

        with torch.no_grad():  # Disable gradient tracking for validation
            for numeric, categorical, targets in val_loader:
                numeric = numeric.to(device, non_blocking=True)
                categorical = categorical.to(device, non_blocking=True)
                targets = targets.to(device, non_blocking=True)

                predictions = model(numeric, categorical)
                loss = criterion(predictions.squeeze(), targets)
                val_loss_sum += loss.item() * len(targets)
                val_samples += len(targets)

        val_loss = val_loss_sum / val_samples
        train_loss = train_loss_sum / train_samples
        print(f"Epoch {epoch+1:3d} | Train: {train_loss:.4f} | Val: {val_loss:.4f}")

        # --- Early stopping ---
        if val_loss < best_val_loss:
            best_val_loss = val_loss
            epochs_without_improvement = 0
            torch.save(model.state_dict(), "best_model.pt")  # Checkpoint
        else:
            epochs_without_improvement += 1
            if epochs_without_improvement >= patience:
                print(f"Early stopping at epoch {epoch+1}")
                break

    # Restore best weights
    model.load_state_dict(torch.load("best_model.pt", weights_only=True))
    return model

Every line in that loop exists for a reason. model.train() and model.eval() toggle dropout and batch normalization behavior. torch.no_grad() disables gradient computation during validation, saving memory and compute. non_blocking=True on .to(device) allows asynchronous CPU-to-GPU transfers that overlap with computation. The checkpoint saves the best model, not the last one — because the last epoch is almost never the best when using early stopping.

PyTorch Training Loop

6.2 — Tabular Deep Learning

Here is the honest assessment: on standard tabular benchmarks, neural networks lose to gradient-boosted trees. This is not a close call.

The 2022 paper by Grinsztajn et al. (“Why do tree-based models still outperform deep learning on tabular data?”) evaluated random forests, GBDTs, ResNets, FT-Transformers, and SAINT across 45 datasets. Trees won on medium-sized datasets (under 50K rows) decisively. On larger datasets, the gap narrowed but did not close. The 2023 TabR paper achieved competitive results with a retrieval-augmented approach, but required careful tuning that GBDT did not need. The FT-Transformer (Gorishniy et al., 2021) and TabNet (Arik & Pfister, 2021) showed promise in specific settings but failed to establish consistent dominance.

The structural reasons were covered in Chapter 5: trees handle heterogeneous features, mixed scales, and irregular missing patterns natively. Neural networks require extensive preprocessing, careful architecture design, and more hyperparameter tuning to achieve comparable results.

When Tabular DL Earns Its Cost

The exceptions are real, and they cluster around three scenarios:

Very large datasets (>1M rows). When you have millions of training examples, neural networks can learn feature interactions that trees express less efficiently. The capacity advantage of wide, deep networks starts to matter when there is enough data to train them without overfitting.

High-cardinality categorical features. Entity embeddings — learned dense vector representations of categories — capture similarity structure that one-hot encoding and target encoding miss. A city embedding can learn that San Francisco and Seattle share tech-hub characteristics without explicit feature engineering. This is the strongest use case for tabular deep learning.

Multi-task learning. When you need to predict multiple related targets from the same features (customer lifetime value and churn probability and product category preference), a shared-trunk neural network with multiple heads can leverage cross-task information. Trees require separate models for each target.

Entity Embeddings

The key idea: replace each categorical feature with a learnable dense vector. Instead of representing a city as a one-hot vector of dimension 10,000 (one element per city), you learn a 50-dimensional embedding where similar cities end up near each other in the embedding space.

class EmbeddingTabularModel(nn.Module):
    """Neural network with entity embeddings for categorical features."""

    def __init__(
        self,
        n_numeric: int,
        category_cardinalities: list[int],  # Number of unique values per categorical feature
        embedding_dims: list[int],          # Embedding dimension per categorical feature
        hidden_dims: list[int] = [256, 128, 64],
        dropout: float = 0.3,
    ) -> None:
        super().__init__()

        # Create embedding layers — one per categorical feature
        self.embeddings = nn.ModuleList([
            nn.Embedding(num_embeddings=card, embedding_dim=dim)
            for card, dim in zip(category_cardinalities, embedding_dims)
        ])

        # Total input dimension = numeric features + sum of all embedding dimensions
        total_embed_dim: int = sum(embedding_dims)
        input_dim: int = n_numeric + total_embed_dim

        # Feedforward trunk
        layers: list[nn.Module] = []
        prev_dim: int = input_dim
        for hidden_dim in hidden_dims:
            layers.extend([
                nn.Linear(prev_dim, hidden_dim),
                nn.BatchNorm1d(hidden_dim),
                nn.ReLU(),
                nn.Dropout(dropout),
            ])
            prev_dim = hidden_dim
        layers.append(nn.Linear(prev_dim, 1))

        self.trunk = nn.Sequential(*layers)

    def forward(
        self, numeric: torch.Tensor, categorical: torch.Tensor
    ) -> torch.Tensor:
        # Embed each categorical feature and concatenate
        embedded: list[torch.Tensor] = [
            emb(categorical[:, i]) for i, emb in enumerate(self.embeddings)
        ]
        x = torch.cat([numeric] + embedded, dim=1)
        return self.trunk(x)


def embedding_dim_rule(cardinality: int) -> int:
    """Heuristic for embedding dimension: min(50, cardinality // 2)."""
    return min(50, max(2, cardinality // 2))


# Example: 15 numeric features, 5 categorical features
category_cards: list[int] = [1000, 50, 200, 30, 500]
embed_dims: list[int] = [embedding_dim_rule(c) for c in category_cards]

model = EmbeddingTabularModel(
    n_numeric=15,
    category_cardinalities=category_cards,
    embedding_dims=embed_dims,
)
print(f"Model parameters: {sum(p.numel() for p in model.parameters()):,}")

The embedding_dim_rule heuristic — min(50, cardinality // 2) — comes from the original entity embedding paper (Guo & Berkhahn, 2016) and works well in practice. A 1,000-category feature gets a 50-dimensional embedding. A 30-category feature gets a 15-dimensional embedding. Oversizing the embedding wastes parameters; undersizing loses representational capacity.

Comparison: Entity Embeddings vs. XGBoost with Target Encoding

To make this comparison concrete, here is a side-by-side evaluation on the same data:

import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import root_mean_squared_error
from category_encoders import TargetEncoder


def compare_models(
    df: pl.DataFrame,
    numeric_cols: list[str],
    categorical_cols: list[str],
    target_col: str,
) -> dict[str, float]:
    """Compare entity embedding neural net vs. XGBoost with target encoding."""
    # Split data
    train_df, test_df = train_test_split(df, test_size=0.2, random_state=42)

    # --- XGBoost with target encoding ---
    encoder = TargetEncoder(cols=categorical_cols)
    X_train_xgb = encoder.fit_transform(
        train_df.select(numeric_cols + categorical_cols).to_pandas(),
        train_df[target_col].to_pandas(),
    )
    X_test_xgb = encoder.transform(
        test_df.select(numeric_cols + categorical_cols).to_pandas()
    )

    xgb_model = xgb.XGBRegressor(
        n_estimators=500, learning_rate=0.05, max_depth=6,
        early_stopping_rounds=20, random_state=42,
    )
    xgb_model.fit(
        X_train_xgb, train_df[target_col].to_numpy(),
        eval_set=[(X_test_xgb, test_df[target_col].to_numpy())],
        verbose=False,
    )
    xgb_rmse = root_mean_squared_error(
        test_df[target_col].to_numpy(),
        xgb_model.predict(X_test_xgb),
    )

    # --- Entity Embedding Neural Net ---
    # (Uses TabularDataset and train_model from above)
    # Integer-encode categoricals for embeddings
    cat_encoders: dict[str, dict] = {}
    for col in categorical_cols:
        unique_vals = train_df[col].unique().to_list()
        cat_encoders[col] = {v: i for i, v in enumerate(unique_vals)}

    def encode_cats(data: pl.DataFrame) -> np.ndarray:
        encoded = np.zeros((len(data), len(categorical_cols)), dtype=np.int64)
        for j, col in enumerate(categorical_cols):
            mapping = cat_encoders[col]
            encoded[:, j] = [mapping.get(v, 0) for v in data[col].to_list()]
        return encoded

    train_dataset = TabularDataset(
        numeric_features=train_df.select(numeric_cols).to_numpy().astype(np.float32),
        categorical_features=encode_cats(train_df),
        targets=train_df[target_col].to_numpy().astype(np.float32),
    )
    test_dataset = TabularDataset(
        numeric_features=test_df.select(numeric_cols).to_numpy().astype(np.float32),
        categorical_features=encode_cats(test_df),
        targets=test_df[target_col].to_numpy().astype(np.float32),
    )

    cards = [len(cat_encoders[col]) for col in categorical_cols]
    dims = [embedding_dim_rule(c) for c in cards]
    nn_model = EmbeddingTabularModel(
        n_numeric=len(numeric_cols),
        category_cardinalities=cards,
        embedding_dims=dims,
    )

    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    nn_model = train_model(
        nn_model,
        DataLoader(train_dataset, batch_size=256, shuffle=True),
        DataLoader(test_dataset, batch_size=512),
        device=device,
    )

    # Evaluate neural net
    nn_model.eval()
    with torch.no_grad():
        preds = nn_model(
            test_dataset.numeric.to(device),
            test_dataset.categorical.to(device),
        ).squeeze().cpu().numpy()
    nn_rmse = root_mean_squared_error(test_df[target_col].to_numpy(), preds)

    return {"xgboost_rmse": xgb_rmse, "embedding_nn_rmse": nn_rmse}

On most datasets under 100K rows, XGBoost wins this comparison. The RMSE difference is typically 2–8% in favor of trees. On datasets with millions of rows and categorical features with thousands of unique values (user IDs, product SKUs, postal codes), the entity embedding model closes the gap and sometimes wins by 1–3%. The embedding model also produces a side benefit that XGBoost cannot: the learned embeddings themselves can be extracted and used as features in other models or as input to recommendation systems.

The decision rule is straightforward. If your dataset has fewer than 500K rows and your categoricals have fewer than 100 unique values each, use XGBoost with target encoding. If you have millions of rows and categoricals with thousands of unique values, entity embeddings are worth the additional complexity. In between, benchmark both on your data — the answer depends on your specific feature distributions.