Monotonic Constraints and Imbalanced Data

5.3 — Monotonic Constraints

Your credit scoring model says that a customer earning $200,000 per year is a worse credit risk than a customer earning $50,000, all else being equal. You know this is wrong. Everyone who has ever underwritten a loan knows this is wrong. But the model found a pocket of noisy data where a few high-income applicants defaulted, and it learned a spurious non-monotonic relationship.

In production, this prediction will be challenged by regulators, rejected by underwriters, and erode trust in your entire ML pipeline. The model is technically fitting the training data — but it is fitting noise that violates domain knowledge.

Monotonic constraints solve this by restricting the model’s hypothesis space. You tell the model: “For this feature, the predicted output must be non-decreasing (or non-increasing) as the feature value increases.” The model is still free to learn the magnitude of the effect and arbitrarily complex interactions with other features — it is only constrained in the direction.

When to Use Monotonic Constraints

Use them when the directional relationship between a feature and the target is known with certainty from domain expertise:

Credit scoring: Higher income → lower default risk. Longer employment history → lower default risk.
Pricing models: Higher mileage → lower vehicle value. Newer construction → higher property value.
Insurance risk: More prior claims → higher risk premium.
Medical risk: Higher BMI → higher diabetes risk (within the modeling range).

Do not use them when the relationship is genuinely non-monotonic. Age and default risk, for instance, may have a U-shaped relationship — very young and very old borrowers may both carry higher risk. Forcing monotonicity when the true relationship is non-monotonic will hurt performance.

Implementation: XGBoost with Monotonic Constraints

import numpy as np
import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score


def monotonic_constraint_demo(
    n_samples: int = 10_000,
    random_state: int = 42,
) -> dict[str, float]:
    """Compare XGBoost with and without monotonic constraints.

    Generates a credit scoring dataset where income has a true positive
    relationship with creditworthiness, but noise can cause the model
    to learn spurious non-monotonic patterns.
    """
    rng = np.random.default_rng(random_state)

    # Features: income, employment_years, debt_ratio, num_accounts
    income = rng.lognormal(mean=10.5, sigma=0.8, size=n_samples)
    employment_years = rng.exponential(scale=5, size=n_samples)
    debt_ratio = rng.beta(2, 5, size=n_samples)
    num_accounts = rng.poisson(lam=4, size=n_samples)

    X = np.column_stack([income, employment_years, debt_ratio, num_accounts])
    feature_names = ["income", "employment_years", "debt_ratio", "num_accounts"]

    # True relationship: higher income → better credit (lower default)
    # But we add correlated noise that can fool an unconstrained model
    true_score = (
        0.3 * np.log(income / income.mean())       # Income: positive
        + 0.2 * np.log1p(employment_years)          # Employment: positive
        - 0.5 * debt_ratio                          # Debt ratio: negative
        - 0.1 * np.log1p(num_accounts)              # Accounts: negative
    )
    noise = rng.normal(0, 0.5, size=n_samples)
    y = (true_score + noise > 0).astype(int)  # 1 = good credit

    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=random_state,
    )

    # --- Unconstrained model ---
    model_free = xgb.XGBClassifier(
        n_estimators=300,
        max_depth=6,
        learning_rate=0.1,
        random_state=random_state,
        eval_metric="logloss",
    )
    model_free.fit(X_train, y_train)
    auc_free = roc_auc_score(y_test, model_free.predict_proba(X_test)[:, 1])

    # --- Constrained model ---
    # Constraints: (1, 1, -1, -1) means:
    #   income:           must be non-decreasing (1)
    #   employment_years: must be non-decreasing (1)
    #   debt_ratio:       must be non-increasing (-1)
    #   num_accounts:     must be non-increasing (-1)
    model_constrained = xgb.XGBClassifier(
        n_estimators=300,
        max_depth=6,
        learning_rate=0.1,
        monotone_constraints=(1, 1, -1, -1),
        random_state=random_state,
        eval_metric="logloss",
    )
    model_constrained.fit(X_train, y_train)
    auc_constrained = roc_auc_score(
        y_test, model_constrained.predict_proba(X_test)[:, 1],
    )

    print(f"AUC without constraints: {auc_free:.4f}")
    print(f"AUC with constraints:    {auc_constrained:.4f}")
    print(f"AUC difference:          {auc_free - auc_constrained:+.4f}")

    # Verify monotonicity: predict across income range, holding others constant
    income_range = np.linspace(income.min(), income.max(), 50)
    X_check = np.column_stack([
        income_range,
        np.full(50, np.median(employment_years)),
        np.full(50, np.median(debt_ratio)),
        np.full(50, np.median(num_accounts)),
    ])

    preds_free = model_free.predict_proba(X_check)[:, 1]
    preds_constrained = model_constrained.predict_proba(X_check)[:, 1]

    # Check if predictions are monotonically non-decreasing with income
    mono_free = all(
        preds_free[i] <= preds_free[i + 1] for i in range(len(preds_free) - 1)
    )
    mono_constrained = all(
        preds_constrained[i] <= preds_constrained[i + 1]
        for i in range(len(preds_constrained) - 1)
    )

    print(f"\nMonotonicity check (income → credit score):")
    print(f"  Unconstrained model monotonic: {mono_free}")
    print(f"  Constrained model monotonic:   {mono_constrained}")

    return {
        "auc_free": auc_free,
        "auc_constrained": auc_constrained,
    }


result = monotonic_constraint_demo()
# Typical result: AUC difference < 0.005, but the constrained model
# is guaranteed to produce domain-consistent predictions.

The typical result: the constrained model’s AUC is within 0.005 of the unconstrained model — often identical, occasionally better. The constraint acts as a regularizer, preventing the model from fitting noise in directions that violate domain knowledge. You lose almost nothing in predictive accuracy and gain predictions that are explainable, defensible, and trustworthy to stakeholders.

Performance Tradeoff

Monotonic constraints restrict the model’s hypothesis space, which is a form of regularization. In low-noise settings, the unconstrained model may eke out a fractional advantage. In noisy settings — which is to say, most real-world data — the constraint improves generalization because it prevents the model from fitting spurious patterns.

The practical advice: if you have domain knowledge that a feature’s effect should be monotonic, apply the constraint. The downside risk (tiny AUC loss) is almost always smaller than the upside (trustworthy, regulation-compliant predictions). If a stakeholder asks “why does the model say higher income is worse?”, the answer cannot be “because the data says so” — it must be “it does not, because we encoded that constraint.”

5.4 — Imbalanced Data

You are building a fraud detection model. Your dataset has 1,000,000 transactions: 999,000 legitimate and 1,000 fraudulent. You train a classifier and it achieves 99.9% accuracy. Your manager is thrilled.

The model has learned to predict “legitimate” for every single transaction. It has not detected a single fraud case. The 99.9% accuracy is the base rate — the accuracy you get by predicting the majority class for every observation.

This is the fundamental failure of accuracy on imbalanced data: the metric is dominated by the majority class. A model that learns nothing about the minority class can still achieve near-perfect accuracy.

Why Accuracy Is Meaningless

Consider the confusion matrix for the “predict all legitimate” model:

	Predicted Legit	Predicted Fraud
Actual Legit	999,000 (TN)	0 (FP)
Actual Fraud	1,000 (FN)	0 (TP)

Accuracy: 999,000 / 1,000,000 = 99.9%. Precision for fraud: 0/0 = undefined. Recall for fraud: 0/1,000 = 0%. The model is useless for its intended purpose.

The metrics that matter for imbalanced classification:

Precision (of the positive class): Of all observations you predicted as fraud, what fraction actually is fraud?
Recall (of the positive class): Of all actual fraud cases, what fraction did you catch?
PR-AUC (Precision-Recall Area Under Curve): Summarizes the precision-recall tradeoff across all thresholds.

ROC-AUC, which is the default metric for many practitioners, is misleading on imbalanced data. ROC-AUC measures the tradeoff between true positive rate and false positive rate. When the negative class is 1,000× larger than the positive class, a small false positive rate translates to a massive number of false positives in absolute terms. A model with 0.99 ROC-AUC might still generate thousands of false alarms for every genuine fraud case detected. PR-AUC surfaces this problem; ROC-AUC hides it.

SMOTE: Why It Usually Fails

SMOTE (Synthetic Minority Oversampling Technique) generates synthetic minority class examples by interpolating between existing minority samples in feature space. The intuition is appealing: if you do not have enough fraud examples, create more by blending existing ones.

The problem is geometric. SMOTE draws straight lines between minority samples and places synthetic points along those lines. If the decision boundary between classes is non-linear (and it almost always is), SMOTE generates synthetic points that fall on the wrong side of the boundary — in the majority class’s territory. These synthetic points add noise to the training data, confusing the classifier about where the true boundary lies.

SMOTE also fails in high-dimensional spaces, where the nearest-neighbor assumptions break down, and it creates artificial density in regions of feature space that may not correspond to real-world minority class distribution.

When does SMOTE work? On low-dimensional datasets with well-separated classes and moderate imbalance (maybe 90/10). For extreme imbalance (99.9/0.1) with complex feature interactions — which is the regime where you actually need help — SMOTE makes things worse more often than it helps.

What Actually Works: An Empirical Comparison

Let’s compare SMOTE, class weights, and threshold tuning on a synthetic fraud detection problem:

import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import (
    precision_recall_curve,
    average_precision_score,
    f1_score,
    classification_report,
)
from imblearn.over_sampling import SMOTE
import xgboost as xgb


def compare_imbalance_strategies(
    n_samples: int = 100_000,
    fraud_rate: float = 0.005,
    random_state: int = 42,
) -> dict[str, dict[str, float]]:
    """Compare SMOTE, class weights, and threshold tuning for imbalanced data.

    Generates a synthetic fraud dataset with extreme class imbalance and
    evaluates each strategy on precision, recall, F1, and PR-AUC.
    """
    X, y = make_classification(
        n_samples=n_samples,
        n_features=30,
        n_informative=12,
        n_redundant=5,
        weights=[1 - fraud_rate, fraud_rate],
        flip_y=0.01,
        random_state=random_state,
    )

    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=random_state, stratify=y,
    )

    results: dict[str, dict[str, float]] = {}

    # --- Strategy 1: Baseline (no correction) ---
    model_base = xgb.XGBClassifier(
        n_estimators=300, max_depth=6, learning_rate=0.1,
        random_state=random_state, eval_metric="logloss",
    )
    model_base.fit(X_train, y_train)
    y_pred_base = model_base.predict(X_test)
    y_prob_base = model_base.predict_proba(X_test)[:, 1]
    results["baseline"] = {
        "f1": f1_score(y_test, y_pred_base),
        "pr_auc": average_precision_score(y_test, y_prob_base),
    }

    # --- Strategy 2: SMOTE ---
    smote = SMOTE(random_state=random_state)
    X_train_smote, y_train_smote = smote.fit_resample(X_train, y_train)
    model_smote = xgb.XGBClassifier(
        n_estimators=300, max_depth=6, learning_rate=0.1,
        random_state=random_state, eval_metric="logloss",
    )
    model_smote.fit(X_train_smote, y_train_smote)
    y_pred_smote = model_smote.predict(X_test)
    y_prob_smote = model_smote.predict_proba(X_test)[:, 1]
    results["smote"] = {
        "f1": f1_score(y_test, y_pred_smote),
        "pr_auc": average_precision_score(y_test, y_prob_smote),
    }

    # --- Strategy 3: Class weights ---
    neg_count = int((y_train == 0).sum())
    pos_count = int((y_train == 1).sum())
    model_weighted = xgb.XGBClassifier(
        n_estimators=300, max_depth=6, learning_rate=0.1,
        scale_pos_weight=neg_count / pos_count,
        random_state=random_state, eval_metric="logloss",
    )
    model_weighted.fit(X_train, y_train)
    y_pred_weighted = model_weighted.predict(X_test)
    y_prob_weighted = model_weighted.predict_proba(X_test)[:, 1]
    results["class_weights"] = {
        "f1": f1_score(y_test, y_pred_weighted),
        "pr_auc": average_precision_score(y_test, y_prob_weighted),
    }

    # --- Strategy 4: Threshold tuning ---
    # Use the baseline model's probabilities but optimize the decision threshold
    precisions, recalls, thresholds = precision_recall_curve(
        y_test, y_prob_base,
    )
    # Find threshold that maximizes F1
    f1_scores = np.where(
        (precisions + recalls) > 0,
        2 * precisions * recalls / (precisions + recalls),
        0,
    )
    best_idx = np.argmax(f1_scores)
    best_threshold = thresholds[best_idx] if best_idx < len(thresholds) else 0.5
    y_pred_tuned = (y_prob_base >= best_threshold).astype(int)
    results["threshold_tuned"] = {
        "f1": f1_score(y_test, y_pred_tuned),
        "pr_auc": average_precision_score(y_test, y_prob_base),
        "optimal_threshold": float(best_threshold),
    }

    # Print comparison
    print(f"{'Strategy':<20s} {'F1':>8s} {'PR-AUC':>8s}")
    print("-" * 38)
    for name, metrics in results.items():
        print(f"{name:<20s} {metrics['f1']:>8.4f} {metrics['pr_auc']:>8.4f}")

    print(f"\nOptimal threshold: {best_threshold:.4f} (vs. default 0.5)")
    print(f"\nClass distribution — train: {pos_count} pos / {neg_count} neg "
          f"({pos_count/len(y_train)*100:.2f}%)")

    return results


results = compare_imbalance_strategies()

The typical result: class weights and threshold tuning outperform SMOTE on PR-AUC. SMOTE sometimes improves F1 at the default 0.5 threshold but degrades precision — it catches more fraud at the cost of many more false alarms. Class weights achieve a better precision-recall tradeoff because they modify the loss function rather than the data distribution.

Threshold tuning is particularly powerful because it is free — you use the same model and the same predictions, you adjust where you draw the line between “flag as fraud” and “pass through.” The optimal threshold for imbalanced problems is almost never 0.5. It is typically much lower, reflecting the asymmetric cost of false negatives versus false positives.

A Complete Fraud Detection Pipeline

Here is how you put it all together in production — class weights, proper evaluation, and threshold optimization:

import numpy as np
import xgboost as xgb
from sklearn.datasets import make_classification
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import (
    average_precision_score,
    precision_recall_curve,
    f1_score,
    precision_score,
    recall_score,
)


def fraud_detection_pipeline(
    n_samples: int = 200_000,
    fraud_rate: float = 0.001,
    random_state: int = 42,
) -> dict:
    """Production-grade fraud detection pipeline with proper evaluation.

    Uses stratified cross-validation, class weights, threshold optimization,
    and PR-AUC as the primary metric. Returns per-fold metrics and the
    optimal threshold.
    """
    X, y = make_classification(
        n_samples=n_samples,
        n_features=40,
        n_informative=15,
        n_redundant=5,
        weights=[1 - fraud_rate, fraud_rate],
        flip_y=0.005,
        random_state=random_state,
    )

    neg_count = int((y == 0).sum())
    pos_count = int((y == 1).sum())
    print(f"Dataset: {n_samples:,} samples, {pos_count:,} fraud "
          f"({pos_count/n_samples*100:.3f}%)")

    cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=random_state)
    fold_metrics: list[dict[str, float]] = []
    all_y_true: list[np.ndarray] = []
    all_y_prob: list[np.ndarray] = []

    for fold_idx, (train_idx, val_idx) in enumerate(cv.split(X, y)):
        X_train, X_val = X[train_idx], X[val_idx]
        y_train, y_val = y[train_idx], y[val_idx]

        train_neg = int((y_train == 0).sum())
        train_pos = int((y_train == 1).sum())

        model = xgb.XGBClassifier(
            n_estimators=1_000,
            max_depth=6,
            learning_rate=0.05,
            scale_pos_weight=train_neg / train_pos,
            subsample=0.8,
            colsample_bytree=0.8,
            reg_alpha=0.1,
            reg_lambda=1.0,
            early_stopping_rounds=50,
            random_state=random_state,
            eval_metric="aucpr",    # PR-AUC as the early stopping metric
            verbosity=0,
        )
        model.fit(
            X_train, y_train,
            eval_set=[(X_val, y_val)],
            verbose=False,
        )

        y_prob = model.predict_proba(X_val)[:, 1]
        pr_auc = average_precision_score(y_val, y_prob)

        fold_metrics.append({
            "fold": fold_idx,
            "pr_auc": pr_auc,
            "best_iteration": model.best_iteration,
        })
        all_y_true.append(y_val)
        all_y_prob.append(y_prob)

    # Aggregate predictions for threshold optimization
    y_true_all = np.concatenate(all_y_true)
    y_prob_all = np.concatenate(all_y_prob)

    # Find optimal threshold maximizing F1
    precisions, recalls, thresholds = precision_recall_curve(
        y_true_all, y_prob_all,
    )
    f1_scores = np.where(
        (precisions + recalls) > 0,
        2 * precisions * recalls / (precisions + recalls),
        0,
    )
    best_idx = np.argmax(f1_scores)
    optimal_threshold = float(
        thresholds[best_idx] if best_idx < len(thresholds) else 0.5,
    )

    # Final evaluation at optimal threshold
    y_pred_optimal = (y_prob_all >= optimal_threshold).astype(int)
    final_precision = precision_score(y_true_all, y_pred_optimal)
    final_recall = recall_score(y_true_all, y_pred_optimal)
    final_f1 = f1_score(y_true_all, y_pred_optimal)
    final_pr_auc = average_precision_score(y_true_all, y_prob_all)

    # Also show default threshold = 0.5 for comparison
    y_pred_default = (y_prob_all >= 0.5).astype(int)
    default_precision = precision_score(
        y_true_all, y_pred_default, zero_division=0,
    )
    default_recall = recall_score(y_true_all, y_pred_default)
    default_f1 = f1_score(y_true_all, y_pred_default)

    print(f"\nCross-validated PR-AUC per fold:")
    for m in fold_metrics:
        print(f"  Fold {m['fold']}: PR-AUC = {m['pr_auc']:.4f} "
              f"(best iter: {m['best_iteration']})")

    mean_pr_auc = np.mean([m["pr_auc"] for m in fold_metrics])
    print(f"\nMean PR-AUC: {mean_pr_auc:.4f}")

    print(f"\n{'Metric':<15s} {'Default (0.5)':>14s} {'Optimized':>14s}")
    print("-" * 45)
    print(f"{'Threshold':<15s} {'0.5000':>14s} {optimal_threshold:>14.4f}")
    print(f"{'Precision':<15s} {default_precision:>14.4f} {final_precision:>14.4f}")
    print(f"{'Recall':<15s} {default_recall:>14.4f} {final_recall:>14.4f}")
    print(f"{'F1':<15s} {default_f1:>14.4f} {final_f1:>14.4f}")
    print(f"{'PR-AUC':<15s} {final_pr_auc:>14.4f} {final_pr_auc:>14.4f}")

    return {
        "optimal_threshold": optimal_threshold,
        "mean_pr_auc": mean_pr_auc,
        "final_precision": final_precision,
        "final_recall": final_recall,
        "final_f1": final_f1,
    }


pipeline_results = fraud_detection_pipeline()

Four design decisions in this pipeline deserve attention:

1. eval_metric="aucpr" for early stopping. We stop training when PR-AUC on the validation set stops improving — not when log-loss or ROC-AUC stops improving. The early stopping metric should match the metric you care about. PR-AUC focuses optimization pressure on the minority class where it belongs.

2. scale_pos_weight computed per fold. The class ratio might vary slightly between folds due to stratified splitting. Computing the weight per fold is more precise than using a global estimate.

3. Threshold optimization on aggregated OOF predictions. We collect all out-of-fold predictions and optimize the threshold on the combined set. This gives us a more robust threshold estimate than optimizing on a single validation split.

4. Comparison of default vs. optimized threshold. The table makes the impact of threshold tuning explicit. On extreme imbalance, the default 0.5 threshold almost always produces high precision but near-zero recall — the model predicts fraud only when it is extremely confident, missing the majority of actual fraud cases. Lowering the threshold trades precision for recall, catching more fraud at the cost of more false alarms.

The Decision Framework for Imbalanced Data

When you encounter class imbalance, work through this checklist:

Step	Action	Why
1	Switch to PR-AUC as your primary metric	ROC-AUC hides false positive volume in imbalanced settings
2	Apply `scale_pos_weight` in XGBoost	Reweight the loss to treat minority class errors as more costly
3	Use early stopping on `aucpr`	Optimize for the metric you report
4	Tune the decision threshold on the PR curve	The optimal threshold is rarely 0.5 for imbalanced problems
5	Consider stratified ensemble methods	Bagging with balanced bootstrap samples can help at extreme ratios

What to skip: SMOTE (unless you have empirically verified it helps on your specific dataset), random undersampling of the majority class (throws away data), and accuracy as a metric (lies to you).

The broader principle: imbalanced data is not a data problem — it is an evaluation and loss function problem. You do not need to change your data. You need to change how you measure performance and how you weight errors. The model can learn from 0.1% positive examples if you give it the right loss function and evaluate it with the right metrics.

Imbalanced Data Strategies