Difference-in-Differences and Uplift Modeling

9.3 — Difference-in-Differences

Propensity score methods assume you have measured every confounder. That is a strong claim, and you often cannot defend it. Difference-in-differences (DiD) takes a different approach: instead of controlling for confounders directly, it cancels them out by using the change in outcomes over time.

The Setup

You have two groups: one that receives a treatment (e.g., a new policy, a product change) and one that does not. You observe both groups before and after the treatment. The treatment effect is estimated as:

$$\text{DiD} = (\bar{Y}{\text{treated, after}} - \bar{Y}{\text{treated, before}}) - (\bar{Y}{\text{control, after}} - \bar{Y}{\text{control, before}})$$

The first difference removes time-invariant confounders for each group. The second difference removes common time trends. What remains — if the assumptions hold — is the causal effect of the treatment.

The Parallel Trends Assumption

DiD requires one critical assumption: absent treatment, the treated and control groups would have followed the same trajectory over time. They do not need the same level of outcomes — the treated group can start higher or lower — but the change over time must be parallel.

This assumption is partially testable. You can check whether the groups moved in parallel during the pre-treatment period. If pre-treatment trends diverge, DiD is not credible. If they track closely, the assumption becomes more plausible — though it is never provable, because you cannot observe the counterfactual post-treatment trajectory.

DiD with OLS Regression

The standard way to estimate DiD is with a linear regression including a treatment indicator, a post-period indicator, and their interaction. The interaction coefficient is the DiD estimate.

import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from scipy import stats

rng = np.random.default_rng(42)


def simulate_did_data(
    n_units: int = 200,
    n_pre_periods: int = 6,
    n_post_periods: int = 6,
    true_effect: float = 15.0,
) -> pd.DataFrame:
    """Simulate panel data for difference-in-differences.

    Scenario: A retail chain rolls out a loyalty program in 100 stores
    (treated) while 100 stores serve as control. We observe monthly
    sales per store before and after the program launch.
    """
    n_treated = n_units // 2
    total_periods = n_pre_periods + n_post_periods
    records: list[dict] = []

    for unit in range(n_units):
        is_treated = int(unit < n_treated)
        # Unit-specific baseline (time-invariant confounder — canceled by DiD)
        unit_baseline = 100 + rng.normal(0, 20)
        if is_treated:
            unit_baseline += 10  # Treated stores start higher (confounder!)

        for period in range(total_periods):
            is_post = int(period >= n_pre_periods)
            # Common time trend (both groups follow)
            time_trend = 2.0 * period
            # Treatment effect: only for treated units in post-period
            treatment_effect = true_effect * is_treated * is_post
            # Noise
            noise = rng.normal(0, 8)

            sales = unit_baseline + time_trend + treatment_effect + noise

            records.append({
                "unit": unit,
                "period": period,
                "treated": is_treated,
                "post": is_post,
                "sales": sales,
            })

    return pd.DataFrame(records)


df = simulate_did_data()

# Naive comparison: just compare treated vs control in post-period
post = df[df["post"] == 1]
naive = (
    post[post["treated"] == 1]["sales"].mean()
    - post[post["treated"] == 0]["sales"].mean()
)
print(f"Naive post-period comparison: {naive:.1f}")
print(f"  (Biased: includes pre-existing level difference)\n")

# DiD estimation via OLS
# Y = β0 + β1*treated + β2*post + β3*(treated × post) + ε
# β3 is the DiD estimate
df["treated_x_post"] = df["treated"] * df["post"]

X = df[["treated", "post", "treated_x_post"]].values
y = df["sales"].values

# Use statsmodels-style OLS for proper standard errors
X_with_const = np.column_stack([np.ones(len(X)), X])
beta = np.linalg.lstsq(X_with_const, y, rcond=None)[0]
residuals = y - X_with_const @ beta
n, k = X_with_const.shape
mse = np.sum(residuals**2) / (n - k)
var_beta = mse * np.linalg.inv(X_with_const.T @ X_with_const)
se_beta = np.sqrt(np.diag(var_beta))

did_estimate = beta[3]
did_se = se_beta[3]
t_stat = did_estimate / did_se
p_value = 2 * (1 - stats.t.cdf(abs(t_stat), df=n - k))
ci_lower = did_estimate - 1.96 * did_se
ci_upper = did_estimate + 1.96 * did_se

print(f"=== Difference-in-Differences ===")
print(f"  DiD estimate (β3): {did_estimate:.2f}")
print(f"  Standard error:    {did_se:.2f}")
print(f"  95% CI:            [{ci_lower:.2f}, {ci_upper:.2f}]")
print(f"  p-value:           {p_value:.4f}")
print(f"  True effect:       15.0")
print(f"\n  Coefficient interpretation:")
print(f"    β0 (intercept):  {beta[0]:.2f}  (control group, pre-period mean)")
print(f"    β1 (treated):    {beta[1]:.2f}  (level difference, not causal)")
print(f"    β2 (post):       {beta[2]:.2f}  (common time change)")
print(f"    β3 (DiD):        {beta[3]:.2f}  (causal effect estimate)")

The naive comparison overestimates because treated stores were already higher-performing (a confounder). DiD strips out that level difference and isolates the treatment effect.

Checking Parallel Trends

The most important diagnostic for DiD is whether the treatment and control groups followed similar trajectories before the intervention. You check this visually and with a formal test.

def check_parallel_trends(df: pd.DataFrame) -> None:
    """Check the parallel trends assumption using pre-treatment data.

    Compute group means by period. In the pre-treatment window,
    the difference between groups should be roughly constant.
    """
    group_means = (
        df.groupby(["period", "treated"])["sales"]
        .mean()
        .unstack("treated")
    )
    group_means.columns = ["control", "treated"]
    group_means["difference"] = group_means["treated"] - group_means["control"]

    pre_treatment = group_means[group_means.index < 6]
    post_treatment = group_means[group_means.index >= 6]

    print("=== Parallel Trends Check ===")
    print("\nPre-treatment group differences (should be ~constant):")
    for period, row in pre_treatment.iterrows():
        print(f"  Period {period}: control={row['control']:.1f}, "
              f"treated={row['treated']:.1f}, diff={row['difference']:.1f}")

    # Formal test: regress pre-treatment difference on time
    # A significant slope means trends are NOT parallel
    pre_periods = pre_treatment.index.values.astype(float)
    pre_diffs = pre_treatment["difference"].values

    if len(pre_periods) > 2:
        slope, intercept, r_value, p_value, std_err = stats.linregress(
            pre_periods, pre_diffs
        )
        print(f"\n  Pre-trend slope: {slope:.3f} (p={p_value:.3f})")
        if p_value < 0.05:
            print("  WARNING: Pre-treatment trends are NOT parallel.")
            print("  DiD estimate may be biased.")
        else:
            print("  Pre-treatment trends are approximately parallel. ✓")

    print("\nPost-treatment (treatment effect visible):")
    for period, row in post_treatment.iterrows():
        print(f"  Period {period}: control={row['control']:.1f}, "
              f"treated={row['treated']:.1f}, diff={row['difference']:.1f}")


check_parallel_trends(df)

If pre-treatment trends diverge, you have three options: (1) find a better control group, (2) add unit-specific time trends to the regression, or (3) abandon DiD and use a different method. Forcing DiD onto data with non-parallel trends produces confident-looking but unreliable estimates.

9.4 — Synthetic Control

DiD works when you have many treated and control units. But what if only one unit received treatment? A single country enacted a policy. A single city launched a program. A single product line changed pricing. With $n=1$ in the treated group, standard DiD cannot estimate treatment effects.

Synthetic control solves this by constructing a synthetic version of the treated unit from a weighted combination of untreated units. The weights are chosen so that the synthetic unit matches the treated unit’s pre-treatment trajectory as closely as possible. After the treatment, the gap between the real and synthetic unit estimates the treatment effect.

from scipy.optimize import minimize


def synthetic_control(
    panel: pd.DataFrame,
    treated_unit: int,
    intervention_period: int,
    outcome_col: str = "outcome",
    unit_col: str = "unit",
    time_col: str = "period",
) -> dict:
    """Estimate treatment effect using the synthetic control method.

    Finds weights for donor (untreated) units such that the weighted
    combination best matches the treated unit's pre-treatment outcomes.
    """
    units = panel[unit_col].unique()
    donor_units = [u for u in units if u != treated_unit]
    periods = sorted(panel[time_col].unique())
    pre_periods = [p for p in periods if p < intervention_period]
    post_periods = [p for p in periods if p >= intervention_period]

    # Build outcome matrices: rows = time, columns = units
    def get_series(unit: int) -> np.ndarray:
        s = panel[panel[unit_col] == unit].sort_values(time_col)
        return s[outcome_col].values

    treated_series = get_series(treated_unit)
    donor_matrix = np.column_stack([get_series(u) for u in donor_units])

    n_pre = len(pre_periods)

    # Pre-treatment fit: minimize squared distance
    def objective(weights: np.ndarray) -> float:
        synthetic = donor_matrix[:n_pre] @ weights
        return float(np.sum((treated_series[:n_pre] - synthetic) ** 2))

    n_donors = len(donor_units)
    # Constraints: weights sum to 1, all non-negative
    constraints = {"type": "eq", "fun": lambda w: np.sum(w) - 1.0}
    bounds = [(0.0, 1.0)] * n_donors
    x0 = np.ones(n_donors) / n_donors

    result = minimize(objective, x0, bounds=bounds, constraints=constraints,
                      method="SLSQP")
    weights = result.x

    # Synthetic unit: full time series
    synthetic_series = donor_matrix @ weights

    # Treatment effect: gap between real and synthetic in post-period
    post_gaps = treated_series[n_pre:] - synthetic_series[n_pre:]
    pre_gaps = treated_series[:n_pre] - synthetic_series[:n_pre]

    return {
        "weights": dict(zip(donor_units, weights)),
        "pre_rmse": float(np.sqrt(np.mean(pre_gaps**2))),
        "post_treatment_effects": post_gaps.tolist(),
        "average_effect": float(np.mean(post_gaps)),
        "top_donors": sorted(
            zip(donor_units, weights), key=lambda x: -x[1]
        )[:5],
    }


# Simulate state-level policy data
n_states = 20
n_periods = 20
intervention = 12
true_effect = 8.0

records: list[dict] = []
state_baselines = rng.normal(50, 10, n_states)
for state in range(n_states):
    for period in range(n_periods):
        trend = 0.5 * period + rng.normal(0, 1)
        effect = true_effect if (state == 0 and period >= intervention) else 0
        outcome = state_baselines[state] + trend + effect + rng.normal(0, 2)
        records.append({
            "unit": state, "period": period, "outcome": outcome
        })

panel = pd.DataFrame(records)
sc_result = synthetic_control(panel, treated_unit=0,
                               intervention_period=intervention)

print("=== Synthetic Control ===")
print(f"  Pre-treatment RMSE: {sc_result['pre_rmse']:.2f}")
print(f"  (Lower = better pre-treatment fit)")
print(f"\n  Average post-treatment effect: {sc_result['average_effect']:.2f}")
print(f"  True effect: {true_effect}")
print(f"\n  Top donor weights:")
for unit, weight in sc_result["top_donors"]:
    if weight > 0.01:
        print(f"    State {unit}: {weight:.3f}")

Placebo tests are essential for validating synthetic control. Apply the same method to each untreated unit, pretending it was treated. If the estimated effect for the actually-treated unit is much larger than the placebo effects, you have evidence that the effect is real. If several untreated units show similarly large “effects,” your method is picking up noise, not signal.

Difference-in-Differences

9.5 — Uplift Modeling

Every method so far answers the same question: what is the average treatment effect? Uplift modeling asks a more actionable question: which individuals benefit from treatment, and by how much?

This distinction matters enormously in practice. A marketing campaign with a positive average effect might be wasting budget on customers who would have bought anyway (“sure things”) while failing to reach customers who only buy when prompted (“persuadables”). An average treatment effect of +5% could mask the fact that 80% of treated customers had zero response and 20% had a +25% response. Targeting those 20% generates the same revenue at one-fifth the cost.

The Four Customer Segments

When you cross treatment (yes/no) with response (yes/no), customers fall into four segments:

	Buys if treated	Does not buy if treated
Buys if not treated	Sure Things (treatment is wasted)	Sleeping Dogs (treatment hurts!)
Does not buy if not treated	Persuadables (target these)	Lost Causes (nothing works)

You want to spend your budget on the persuadables — those with high uplift (positive difference between treated and untreated outcomes). You want to avoid sleeping dogs — customers who would have bought but are turned off by the treatment (negative uplift). Sure things and lost causes get zero incremental value from treatment.

The problem: you never observe both cells for the same customer. This is the fundamental problem of causal inference again, now at the individual level.

The T-Learner

The T-learner is the simplest approach to estimating individual-level treatment effects. Train two separate models: one on treated units ($\hat{\mu}_1(X)$) and one on control units ($\hat{\mu}_0(X)$). The estimated uplift for an individual is:

$$\hat{\tau}(X) = \hat{\mu}_1(X) - \hat{\mu}_0(X)$$

from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import cross_val_predict


def simulate_uplift_data(n: int = 10_000) -> pd.DataFrame:
    """Simulate marketing campaign data with heterogeneous treatment effects.

    Some customers are persuadable, some are sure things,
    some are lost causes, and some are sleeping dogs.
    """
    age = rng.normal(40, 12, n).clip(18, 75)
    income = rng.normal(55_000, 18_000, n).clip(15_000, 130_000)
    recency = rng.exponential(30, n).clip(1, 365)  # Days since last purchase
    frequency = rng.poisson(4, n)  # Past purchases

    # Random treatment assignment (like an actual experiment)
    treatment = rng.binomial(1, 0.5, n)

    # Base purchase probability (without treatment)
    logit_base = (
        -1.5
        + 0.02 * (age - 40)
        + 0.00001 * (income - 55_000)
        - 0.005 * recency
        + 0.1 * frequency
    )

    # HETEROGENEOUS treatment effect:
    # - Young, low-frequency customers: highly persuadable (+20 pp)
    # - Old, high-frequency customers: sure things (0 pp uplift)
    # - Mid-age, high-recency: sleeping dogs (-5 pp, annoyed by spam)
    uplift_logit = (
        0.8
        - 0.015 * (age - 30)
        - 0.08 * frequency
        + 0.003 * recency
        + 0.000005 * (income - 55_000)
    )

    logit_treated = logit_base + uplift_logit * treatment
    prob = 1 / (1 + np.exp(-logit_treated))
    purchase = rng.binomial(1, prob)

    return pd.DataFrame({
        "age": age, "income": income, "recency": recency,
        "frequency": frequency, "treatment": treatment,
        "purchase": purchase,
    })


df = simulate_uplift_data()

# Average treatment effect (for reference)
ate = (
    df[df["treatment"] == 1]["purchase"].mean()
    - df[df["treatment"] == 0]["purchase"].mean()
)
print(f"Average Treatment Effect: {ate:.3f}")
print(f"  (Positive on average, but varies hugely by individual)\n")

# T-Learner: separate models for treated and control
features = ["age", "income", "recency", "frequency"]
X = df[features].values
treated_mask = df["treatment"] == 1
control_mask = df["treatment"] == 0

model_treated = GradientBoostingClassifier(
    n_estimators=100, max_depth=4, random_state=42
)
model_control = GradientBoostingClassifier(
    n_estimators=100, max_depth=4, random_state=42
)

model_treated.fit(X[treated_mask], df.loc[treated_mask, "purchase"])
model_control.fit(X[control_mask], df.loc[control_mask, "purchase"])

# Predict uplift: P(purchase | treated) - P(purchase | control)
prob_treated = model_treated.predict_proba(X)[:, 1]
prob_control = model_control.predict_proba(X)[:, 1]
df["uplift"] = prob_treated - prob_control

# Segment customers
df["segment"] = pd.cut(
    df["uplift"],
    bins=[-np.inf, -0.02, 0.02, np.inf],
    labels=["sleeping_dogs", "neutral", "persuadable"],
)

print("=== Customer Segments ===")
for segment in ["persuadable", "neutral", "sleeping_dogs"]:
    sub = df[df["segment"] == segment]
    actual_uplift = (
        sub[sub["treatment"] == 1]["purchase"].mean()
        - sub[sub["treatment"] == 0]["purchase"].mean()
    )
    print(f"  {segment:>15}: n={len(sub):>5}, "
          f"predicted_uplift={sub['uplift'].mean():.3f}, "
          f"actual_uplift={actual_uplift:.3f}")

Uplift Scoring and Targeting

The T-learner gives every customer an uplift score. You rank by uplift and target from the top. The uplift curve measures cumulative incremental gain as you increase the fraction of customers targeted — the causal analogue of a lift curve.

def evaluate_uplift(
    df: pd.DataFrame,
    uplift_col: str = "uplift",
    treatment_col: str = "treatment",
    outcome_col: str = "purchase",
    n_bins: int = 10,
) -> pd.DataFrame:
    """Evaluate uplift model by computing actual uplift within deciles.

    Sort by predicted uplift, bin into deciles, and compute actual
    treatment effect within each bin. A good uplift model concentrates
    the persuadables at the top.
    """
    df_sorted = df.sort_values(uplift_col, ascending=False).copy()
    df_sorted["decile"] = pd.qcut(
        range(len(df_sorted)), q=n_bins, labels=False
    )

    results: list[dict] = []
    for decile in range(n_bins):
        sub = df_sorted[df_sorted["decile"] == decile]
        n_total = len(sub)
        treated = sub[sub[treatment_col] == 1]
        control = sub[sub[treatment_col] == 0]

        if len(treated) > 0 and len(control) > 0:
            actual_uplift = treated[outcome_col].mean() - control[outcome_col].mean()
        else:
            actual_uplift = 0.0

        results.append({
            "decile": decile + 1,
            "n": n_total,
            "predicted_uplift": sub[uplift_col].mean(),
            "actual_uplift": actual_uplift,
        })

    return pd.DataFrame(results)


uplift_eval = evaluate_uplift(df)
print("\n=== Uplift by Decile ===")
print(f"  {'Decile':>7} {'N':>6} {'Pred Uplift':>12} {'Actual Uplift':>14}")
print(f"  {'-'*42}")
for _, row in uplift_eval.iterrows():
    print(f"  {row['decile']:>7.0f} {row['n']:>6.0f} "
          f"{row['predicted_uplift']:>12.3f} {row['actual_uplift']:>14.3f}")

# Targeting decision
top_deciles = uplift_eval[uplift_eval["actual_uplift"] > 0.02]
target_pct = len(top_deciles) / len(uplift_eval) * 100
print(f"\n  Recommendation: target top {target_pct:.0f}% of customers")
print(f"  These deciles have positive actual uplift > 2 pp.")
print(f"  Treating others wastes budget (neutral) or hurts (sleeping dogs).")

When to Use Uplift Modeling

Uplift modeling is most valuable when:

Treatment is costly. If sending an email costs nothing, you might as well send it to everyone (unless sleeping dogs exist). If treatment costs $50 per person or requires a sales rep’s time, targeting matters.
Treatment effects are heterogeneous. If the treatment helps everyone equally, you do not need individual-level targeting. But in practice, effects almost always vary by customer characteristics.
You have experimental data. Uplift modeling works best with randomized treatment assignment. Without randomization, you first need the propensity score methods from Section 9.1 to deconfound, then you estimate heterogeneous effects — adding complexity and assumptions.
You have sufficient sample size. You are estimating a difference between two conditional expectations ($\hat{\mu}_1(X) - \hat{\mu}_0(X)$). Each model needs enough data in its training set, and the difference amplifies noise from both.

The Decision Framework

Putting it together, here is how causal inference methods connect to business decisions:

Is the intervention worth doing at all? → Estimate the ATE using DiD, propensity score methods, or synthetic control. If the ATE is zero or negative, stop here.
Whom should we target? → Estimate CATE (Conditional Average Treatment Effect) using uplift modeling. Rank by uplift, target persuadables.
How confident are we? → Report confidence intervals, run placebo tests, check assumptions. A precise but wrong estimate is worse than an imprecise but honest one.

No method in this chapter gives you certainty. Every method rests on assumptions that could be wrong. The discipline of causal inference is not about finding the “right” technique — it is about being explicit about what you are assuming, testing those assumptions where possible, and communicating the limits of your conclusions alongside the conclusions themselves.

The alternative — treating correlations as causes and optimizing against spurious associations — is not just theoretically wrong. It leads to real decisions that waste real money and sometimes cause real harm. The tools in this chapter will not make causal inference easy, but they will make it possible — and that is a significant upgrade over the default.