Calibration and A/B Testing

8.3 — Calibration: When Probabilities Must Mean Something

Your XGBoost model predicts a 90% probability that a customer will churn. Should you send them a $50 retention offer? That depends entirely on whether “90%” actually means 90%. If the model’s 90%-confidence predictions are only correct 60% of the time, your expected value calculation is wrong, your threshold is wrong, and your business decision is wrong.

Calibration is the property that a model’s predicted probabilities match observed frequencies. A perfectly calibrated model’s 30% predictions come true 30% of the time, its 70% predictions come true 70% of the time, and so on across the entire probability range.

Most models are not calibrated. And the models that data scientists reach for most often — gradient-boosted trees, random forests — are among the worst offenders.

Why Trees Are Badly Calibrated

Tree-based models output the proportion of positive examples in a leaf node. If a leaf contains 8 positives and 2 negatives during training, the tree outputs 0.8. But that 0.8 is the proportion in one specific leaf of one specific tree, shaped by the particular splits chosen during training. Random forests average these proportions across trees, pushing predictions toward the center (they rarely output values near 0 or 1). Gradient-boosted trees combine additive corrections through a sigmoid or softmax, producing a different — but still miscalibrated — distribution.

Logistic regression, by contrast, is calibrated by construction on the training data. Its objective function (log loss) directly optimizes the predicted probabilities to match observed frequencies. On well-specified problems, logistic regression’s probabilities are among the most reliable you will get.

Calibration Curves: Diagnosing the Problem

A calibration curve (reliability diagram) bins predictions by predicted probability, then plots the actual fraction of positives in each bin against the predicted probability. A perfectly calibrated model traces the diagonal.

import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.calibration import calibration_curve, CalibratedClassifierCV
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import brier_score_loss

# Generate a moderately difficult binary classification problem
X, y = make_classification(
    n_samples=15_000, n_features=20, n_informative=10,
    n_redundant=5, weights=[0.6, 0.4], random_state=42,
)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y,
)

# Train both models
lr = LogisticRegression(max_iter=1_000, random_state=42)
lr.fit(X_train, y_train)

xgb = GradientBoostingClassifier(
    n_estimators=200, max_depth=5, learning_rate=0.1, random_state=42,
)
xgb.fit(X_train, y_train)

# Get probability predictions
lr_probs = lr.predict_proba(X_test)[:, 1]
xgb_probs = xgb.predict_proba(X_test)[:, 1]

# Compute calibration curves
lr_fraction, lr_mean_pred = calibration_curve(y_test, lr_probs, n_bins=10)
xgb_fraction, xgb_mean_pred = calibration_curve(y_test, xgb_probs, n_bins=10)

# Print calibration comparison
print("Logistic Regression calibration:")
print(f"  Brier score: {brier_score_loss(y_test, lr_probs):.4f}")
for pred, actual in zip(lr_mean_pred, lr_fraction):
    print(f"  Predicted: {pred:.2f} → Actual: {actual:.2f}")

print("\nGradient Boosting calibration:")
print(f"  Brier score: {brier_score_loss(y_test, xgb_probs):.4f}")
for pred, actual in zip(xgb_mean_pred, xgb_fraction):
    print(f"  Predicted: {pred:.2f} → Actual: {actual:.2f}")

You will see the logistic regression points cluster near the diagonal. The gradient boosting points will be systematically off — the model’s predicted probabilities do not match the observed frequencies. The Brier score quantifies this overall: lower is better, and it decomposes into calibration error plus refinement.

Expected Calibration Error (ECE)

ECE provides a single number summarizing miscalibration. It is the weighted average of the absolute difference between predicted and observed frequency across bins:

def expected_calibration_error(
    y_true: np.ndarray,
    y_prob: np.ndarray,
    n_bins: int = 10,
) -> float:
    """
    Expected Calibration Error.
    Weighted average of |accuracy - confidence| across bins.
    """
    bin_edges = np.linspace(0, 1, n_bins + 1)
    ece = 0.0

    for i in range(n_bins):
        mask = (y_prob > bin_edges[i]) & (y_prob <= bin_edges[i + 1])
        if mask.sum() == 0:
            continue

        bin_accuracy = y_true[mask].mean()
        bin_confidence = y_prob[mask].mean()
        bin_weight = mask.sum() / len(y_true)
        ece += bin_weight * abs(bin_accuracy - bin_confidence)

    return ece


print(f"LR ECE:  {expected_calibration_error(y_test, lr_probs):.4f}")
print(f"XGB ECE: {expected_calibration_error(y_test, xgb_probs):.4f}")

Fixing Calibration: Platt Scaling and Isotonic Regression

Two standard methods post-hoc calibrate a model’s outputs:

Platt scaling fits a logistic regression on the model’s output scores. It learns a sigmoid mapping from raw score to calibrated probability. Works well when the miscalibration is a smooth, monotonic distortion — which is common for boosted trees and SVMs.

Isotonic regression fits a non-parametric, monotone-increasing function. More flexible than Platt scaling, but requires more calibration data to avoid overfitting. Use when you have at least a few thousand calibration samples.

Sklearn’s CalibratedClassifierCV handles both methods with internal cross-validation to avoid overfitting the calibration to the same data used for training:

# Calibrate the XGBoost model using CalibratedClassifierCV
xgb_platt = CalibratedClassifierCV(xgb, method="sigmoid", cv=5)
xgb_platt.fit(X_train, y_train)

xgb_isotonic = CalibratedClassifierCV(xgb, method="isotonic", cv=5)
xgb_isotonic.fit(X_train, y_train)

# Compare calibration quality
platt_probs = xgb_platt.predict_proba(X_test)[:, 1]
iso_probs = xgb_isotonic.predict_proba(X_test)[:, 1]

print("After calibration:")
print(f"  Uncalibrated XGB Brier: {brier_score_loss(y_test, xgb_probs):.4f}")
print(f"  Platt-scaled Brier:     {brier_score_loss(y_test, platt_probs):.4f}")
print(f"  Isotonic Brier:         {brier_score_loss(y_test, iso_probs):.4f}")
print(f"  Uncalibrated XGB ECE:   {expected_calibration_error(y_test, xgb_probs):.4f}")
print(f"  Platt-scaled ECE:       {expected_calibration_error(y_test, platt_probs):.4f}")
print(f"  Isotonic ECE:           {expected_calibration_error(y_test, iso_probs):.4f}")

Both Brier score and ECE should improve substantially after calibration. The predicted probabilities now mean something.

When Calibration Matters

Calibration is critical in three scenarios:

Threshold-based decisions. If you set a threshold at 0.7 to trigger an action, miscalibrated probabilities mean your threshold is in the wrong place. The model’s “0.7” might correspond to a true probability of 0.5.
Expected value calculations. Any decision that multiplies probability by payoff — bid optimization, risk pricing, inventory planning — requires calibrated probabilities to produce correct expected values.
Model ensembles and stacking. If you average probabilities from multiple models, miscalibrated components will distort the ensemble. Calibrate each model before combining.

Calibration does not matter when you only care about rank ordering — pure sorting tasks where you take the top-k predictions regardless of their probability values.

Calibration Curves

8.4 — A/B Testing: From Offline Metrics to Business Outcomes

Your model beats the baseline on every offline metric. Calibration looks solid. Cross-validation is rigorous. You are confident. Should you deploy?

Not without a proper online experiment. Offline metrics are a necessary filter — they eliminate bad models cheaply. But they cannot tell you whether the model improves the business outcome you care about, because that outcome depends on user behavior, system interactions, and real-world distributions that your test set only approximates.

Statistical Significance vs. Practical Significance

A common trap: your A/B test produces p < 0.001. The result is “highly significant.” But the effect size is a 0.02% improvement in click-through rate. On a base rate of 3%, this is an increase to 3.0006%. Statistically, you can distinguish the two with enough data. Practically, this difference generates zero business value.

Always report effect sizes alongside p-values. A p-value tells you whether the difference is real. The effect size tells you whether the difference matters.

Sample Size Calculation

Calculate your required sample size before launching the test. If you do not, you will be tempted to peek at results daily and stop when you see significance — a practice that inflates your false positive rate far beyond the nominal 5%.

from scipy import stats
import math


def ab_test_sample_size(
    baseline_rate: float,
    minimum_detectable_effect: float,
    alpha: float = 0.05,
    power: float = 0.80,
) -> int:
    """
    Calculate required sample size per group for a two-proportion z-test.

    Args:
        baseline_rate: Current conversion rate (e.g., 0.03 for 3%)
        minimum_detectable_effect: Smallest relative change worth detecting
            (e.g., 0.05 for a 5% relative lift)
        alpha: Significance level (Type I error rate)
        power: Statistical power (1 - Type II error rate)
    """
    p1 = baseline_rate
    p2 = baseline_rate * (1 + minimum_detectable_effect)
    p_avg = (p1 + p2) / 2

    z_alpha = stats.norm.ppf(1 - alpha / 2)
    z_beta = stats.norm.ppf(power)

    numerator = (
        z_alpha * math.sqrt(2 * p_avg * (1 - p_avg))
        + z_beta * math.sqrt(p1 * (1 - p1) + p2 * (1 - p2))
    ) ** 2
    denominator = (p2 - p1) ** 2

    return math.ceil(numerator / denominator)


# How many users to detect a 5% relative lift from 3% baseline?
n_per_group = ab_test_sample_size(
    baseline_rate=0.03,
    minimum_detectable_effect=0.05,
)
print(f"Required sample size per group: {n_per_group:,}")
print(f"Total users needed: {2 * n_per_group:,}")

# What about a 10% lift?
n_10pct = ab_test_sample_size(baseline_rate=0.03, minimum_detectable_effect=0.10)
print(f"For 10% lift: {2 * n_10pct:,} total users")

The numbers will be large — tens or hundreds of thousands per group for small effects on low base rates. This is not an inconvenience; it is reality. If you cannot afford the sample size, you cannot reliably detect the effect, and running the test anyway is theater.

Proper A/B Test Analysis

When the test reaches the pre-calculated sample size, analyze with confidence intervals — not just a p-value:

def analyze_ab_test(
    conversions_a: int,
    total_a: int,
    conversions_b: int,
    total_b: int,
    alpha: float = 0.05,
) -> dict:
    """
    Analyze an A/B test with confidence intervals and effect size.
    Reports the difference in proportions with a CI.
    """
    p_a = conversions_a / total_a
    p_b = conversions_b / total_b
    diff = p_b - p_a
    relative_lift = diff / p_a if p_a > 0 else float("inf")

    # Standard error of the difference
    se = math.sqrt(p_a * (1 - p_a) / total_a + p_b * (1 - p_b) / total_b)

    # Confidence interval
    z = stats.norm.ppf(1 - alpha / 2)
    ci_lower = diff - z * se
    ci_upper = diff + z * se

    # Two-proportion z-test
    p_pooled = (conversions_a + conversions_b) / (total_a + total_b)
    se_pooled = math.sqrt(p_pooled * (1 - p_pooled) * (1 / total_a + 1 / total_b))
    z_stat = diff / se_pooled
    p_value = 2 * (1 - stats.norm.cdf(abs(z_stat)))

    return {
        "control_rate": p_a,
        "treatment_rate": p_b,
        "absolute_diff": diff,
        "relative_lift": relative_lift,
        "ci_lower": ci_lower,
        "ci_upper": ci_upper,
        "p_value": p_value,
        "significant": p_value < alpha,
    }


# Example: experiment completed with pre-calculated sample size
result = analyze_ab_test(
    conversions_a=900, total_a=30_000,   # Control: 3.00%
    conversions_b=975, total_b=30_000,   # Treatment: 3.25%
)

print(f"Control rate:    {result['control_rate']:.4f}")
print(f"Treatment rate:  {result['treatment_rate']:.4f}")
print(f"Relative lift:   {result['relative_lift']:+.2%}")
print(f"95% CI:          [{result['ci_lower']:+.4f}, {result['ci_upper']:+.4f}]")
print(f"P-value:         {result['p_value']:.4f}")
print(f"Significant:     {result['significant']}")

The confidence interval is more informative than the p-value. If the CI is [+0.0005, +0.0045], you know the true effect is positive but could be anywhere in that range. Your business decision should depend on whether the lower bound of the CI is above your minimum worthwhile effect — not just whether p < 0.05.

Common Mistakes

Peeking. Checking results daily and stopping when p < 0.05 inflates your false positive rate from 5% to as high as 30%. Pre-commit to a sample size and analyze only when you reach it. If you must monitor continuously, use sequential testing methods (e.g., alpha spending functions) that control the overall error rate.

Imbalanced splits. A 90/10 traffic split reduces statistical power dramatically. The rare group determines the noise floor. A 50/50 split maximizes power for a given total sample.

Novelty effects. Users may interact differently with a new experience because it is new, not because it is better. Run experiments long enough to let novelty wear off — at least two full cycles of your users’ typical engagement pattern.

Bayesian A/B Testing

Frequentist testing answers an awkward question: “If there were no difference, how unlikely would this data be?” Bayesian testing answers the question you actually care about: “Given the data, what is the probability that B is better than A?”

from scipy import stats as sp_stats


def bayesian_ab_test(
    conversions_a: int,
    total_a: int,
    conversions_b: int,
    total_b: int,
    prior_alpha: float = 1.0,
    prior_beta: float = 1.0,
    n_simulations: int = 100_000,
) -> dict:
    """
    Bayesian A/B test using Beta-Binomial conjugate model.

    Prior: Beta(prior_alpha, prior_beta). Default is uniform (uninformative).
    Returns posterior probability that B > A and expected lift distribution.
    """
    rng = np.random.default_rng(42)

    # Posterior distributions (conjugate update: Beta + Binomial → Beta)
    alpha_a = prior_alpha + conversions_a
    beta_a = prior_beta + (total_a - conversions_a)

    alpha_b = prior_alpha + conversions_b
    beta_b = prior_beta + (total_b - conversions_b)

    # Monte Carlo sampling from posteriors
    samples_a = rng.beta(alpha_a, beta_a, size=n_simulations)
    samples_b = rng.beta(alpha_b, beta_b, size=n_simulations)

    # Probability that B > A
    prob_b_better = (samples_b > samples_a).mean()

    # Expected lift distribution
    lift_samples = (samples_b - samples_a) / samples_a
    expected_lift = lift_samples.mean()
    lift_ci = np.percentile(lift_samples, [2.5, 97.5])

    return {
        "prob_b_better": prob_b_better,
        "expected_lift": expected_lift,
        "lift_ci_lower": lift_ci[0],
        "lift_ci_upper": lift_ci[1],
        "posterior_a_mean": alpha_a / (alpha_a + beta_a),
        "posterior_b_mean": alpha_b / (alpha_b + beta_b),
    }


# Same data as before
bayes_result = bayesian_ab_test(
    conversions_a=900, total_a=30_000,
    conversions_b=975, total_b=30_000,
)

print(f"P(B > A):        {bayes_result['prob_b_better']:.4f}")
print(f"Expected lift:   {bayes_result['expected_lift']:+.2%}")
print(f"95% credible:    [{bayes_result['lift_ci_lower']:+.2%}, {bayes_result['lift_ci_upper']:+.2%}]")
print(f"Posterior A:     {bayes_result['posterior_a_mean']:.4f}")
print(f"Posterior B:     {bayes_result['posterior_b_mean']:.4f}")

The output directly states: “There is an X% probability that B is better than A.” No null hypotheses, no p-values, no awkward interpretation. You can set a decision threshold — “deploy B if P(B > A) > 0.95” — that maps directly to your risk tolerance.

The credible interval on the lift tells you the range of plausible effect sizes. If the interval includes both economically meaningful and negligible effects, you need more data, regardless of what the point estimate says.

Multi-Armed Bandits: Adaptive Experimentation

Standard A/B testing allocates traffic equally for the entire experiment. If variant A is clearly worse after 10% of the data, you still send it 50% of traffic for the remaining 90%. Multi-armed bandits adapt allocation based on observed performance — sending more traffic to the variant that appears to be winning.

The trade-off: bandits reduce regret (the cost of showing inferior variants) at the expense of statistical clarity. You get a faster practical answer but a less precise estimate of the true effect size. Use bandits when:

The cost of showing bad variants is high (revenue-impacting changes, user experience experiments)
You have many variants to test (bandits handle 10+ arms gracefully; A/B tests struggle past 3-4)
You care more about cumulative reward than precise effect estimation

Thompson Sampling is the most practical bandit algorithm — it naturally balances exploration and exploitation using the same Beta posterior we computed above:

def thompson_sampling_step(
    successes: list[int],
    failures: list[int],
    rng: np.random.Generator | None = None,
) -> int:
    """
    Thompson Sampling: choose an arm by sampling from each arm's
    posterior and picking the one with the highest sample.
    """
    if rng is None:
        rng = np.random.default_rng()

    samples = [
        rng.beta(s + 1, f + 1)
        for s, f in zip(successes, failures)
    ]
    return int(np.argmax(samples))


# Simulate a 3-arm bandit over 10,000 rounds
true_rates = [0.030, 0.033, 0.028]  # Arm 1 is the winner
n_arms = len(true_rates)
successes = [0] * n_arms
failures = [0] * n_arms
rng = np.random.default_rng(42)

for round_num in range(10_000):
    arm = thompson_sampling_step(successes, failures, rng)
    reward = rng.random() < true_rates[arm]

    if reward:
        successes[arm] += 1
    else:
        failures[arm] += 1

# Results
total_pulls = [s + f for s, f in zip(successes, failures)]
observed_rates = [s / t if t > 0 else 0 for s, t in zip(successes, total_pulls)]

for i in range(n_arms):
    print(
        f"Arm {i}: pulled {total_pulls[i]:,} times, "
        f"observed rate = {observed_rates[i]:.4f}, "
        f"true rate = {true_rates[i]:.4f}"
    )

Notice that the best arm gets pulled significantly more often than the others. Thompson Sampling automatically concentrates traffic on the winner while continuing to explore the others enough to catch up if the true rankings change. In production, this means less revenue lost to inferior variants during the experiment.

The key insight across this entire chapter: evaluation is not a step you do after modeling. It is a discipline that shapes every decision — from which metric you optimize, to how you split your data, to whether your probabilities mean anything, to how you prove that your model actually moves the business needle. Get evaluation wrong, and nothing else matters.