Difference-in-Differences and Uplift Modeling
SummaryPropensity score methods require measuring every confounder —...
Propensity score methods require measuring every confounder —...
Propensity score methods require measuring every confounder — a strong assumption that frequently fails. This section covers two approaches that sidestep that requirement and one that takes causal inference to its most actionable conclusion. Difference-in-differences (DiD) exploits panel data and the parallel trends assumption to cancel out time-invariant confounders, making it the workhorse of policy evaluation in economics and increasingly in tech. Synthetic control extends DiD to single-treated-unit settings by constructing a weighted donor pool. Uplift modeling shifts the question from 'does treatment work on average?' to 'which individuals should we treat?' — using T-learners to estimate conditional treatment effects and target the persuadables while ignoring sure things and lost causes. Code examples cover DiD regression with OLS, parallel trends validation, synthetic control construction with placebo tests, and T-learner uplift scoring for targeted marketing.
Difference-in-Differences and Uplift Modeling
9.3 — Difference-in-Differences
Propensity score methods assume you have measured every confounder. That is a strong claim, and you often cannot defend it. Difference-in-differences (DiD) takes a different approach: instead of controlling for confounders directly, it cancels them out by using the change in outcomes over time.
The Setup
You have two groups: one that receives a treatment (e.g., a new policy, a product change) and one that does not. You observe both groups before and after the treatment. The treatment effect is estimated as:
$$\text{DiD} = (\bar{Y}{\text{treated, after}} - \bar{Y}{\text{treated, before}}) - (\bar{Y}{\text{control, after}} - \bar{Y}{\text{control, before}})$$
The first difference removes time-invariant confounders for each group. The second difference removes common time trends. What remains — if the assumptions hold — is the causal effect of the treatment.
The Parallel Trends Assumption
DiD requires one critical assumption: absent treatment, the treated and control groups would have followed the same trajectory over time. They do not need the same level of outcomes — the treated group can start higher or lower — but the change over time must be parallel.
This assumption is partially testable. You can check whether the groups moved in parallel during the pre-treatment period. If pre-treatment trends diverge, DiD is not credible. If they track closely, the assumption becomes more plausible — though it is never provable, because you cannot observe the counterfactual post-treatment trajectory.
DiD with OLS Regression
The standard way to estimate DiD is with a linear regression including a treatment indicator, a post-period indicator, and their interaction. The interaction coefficient is the DiD estimate.
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from scipy import stats
rng = np.random.default_rng(42)
def simulate_did_data(
n_units: int = 200,
n_pre_periods: int = 6,
n_post_periods: int = 6,
true_effect: float = 15.0,
) -> pd.DataFrame:
"""Simulate panel data for difference-in-differences.
Scenario: A retail chain rolls out a loyalty program in 100 stores
(treated) while 100 stores serve as control. We observe monthly
sales per store before and after the program launch.
"""
n_treated = n_units // 2
total_periods = n_pre_periods + n_post_periods
records: list[dict] = []
for unit in range(n_units):
is_treated = int(unit < n_treated)
# Unit-specific baseline (time-invariant confounder — canceled by DiD)
unit_baseline = 100 + rng.normal(0, 20)
if is_treated:
unit_baseline += 10 # Treated stores start higher (confounder!)
for period in range(total_periods):
is_post = int(period >= n_pre_periods)
# Common time trend (both groups follow)
time_trend = 2.0 * period
# Treatment effect: only for treated units in post-period
treatment_effect = true_effect * is_treated * is_post
# Noise
noise = rng.normal(0, 8)
sales = unit_baseline + time_trend + treatment_effect + noise
records.append({
"unit": unit,
"period": period,
"treated": is_treated,
"post": is_post,
"sales": sales,
})
return pd.DataFrame(records)
df = simulate_did_data()
# Naive comparison: just compare treated vs control in post-period
post = df[df["post"] == 1]
naive = (
post[post["treated"] == 1]["sales"].mean()
- post[post["treated"] == 0]["sales"].mean()
)
print(f"Naive post-period comparison: {naive:.1f}")
print(f" (Biased: includes pre-existing level difference)\n")
# DiD estimation via OLS
# Y = β0 + β1*treated + β2*post + β3*(treated × post) + ε
# β3 is the DiD estimate
df["treated_x_post"] = df["treated"] * df["post"]
X = df[["treated", "post", "treated_x_post"]].values
y = df["sales"].values
# Use statsmodels-style OLS for proper standard errors
X_with_const = np.column_stack([np.ones(len(X)), X])
beta = np.linalg.lstsq(X_with_const, y, rcond=None)[0]
residuals = y - X_with_const @ beta
n, k = X_with_const.shape
mse = np.sum(residuals**2) / (n - k)
var_beta = mse * np.linalg.inv(X_with_const.T @ X_with_const)
se_beta = np.sqrt(np.diag(var_beta))
did_estimate = beta[3]
did_se = se_beta[3]
t_stat = did_estimate / did_se
p_value = 2 * (1 - stats.t.cdf(abs(t_stat), df=n - k))
ci_lower = did_estimate - 1.96 * did_se
ci_upper = did_estimate + 1.96 * did_se
print(f"=== Difference-in-Differences ===")
print(f" DiD estimate (β3): {did_estimate:.2f}")
print(f" Standard error: {did_se:.2f}")
print(f" 95% CI: [{ci_lower:.2f}, {ci_upper:.2f}]")
print(f" p-value: {p_value:.4f}")
print(f" True effect: 15.0")
print(f"\n Coefficient interpretation:")
print(f" β0 (intercept): {beta[0]:.2f} (control group, pre-period mean)")
print(f" β1 (treated): {beta[1]:.2f} (level difference, not causal)")
print(f" β2 (post): {beta[2]:.2f} (common time change)")
print(f" β3 (DiD): {beta[3]:.2f} (causal effect estimate)")
The naive comparison overestimates because treated stores were already higher-performing (a confounder). DiD strips out that level difference and isolates the treatment effect.
Checking Parallel Trends
The most important diagnostic for DiD is whether the treatment and control groups followed similar trajectories before the intervention. You check this visually and with a formal test.
def check_parallel_trends(df: pd.DataFrame) -> None:
"""Check the parallel trends assumption using pre-treatment data.
Compute group means by period. In the pre-treatment window,
the difference between groups should be roughly constant.
"""
group_means = (
df.groupby(["period", "treated"])["sales"]
.mean()
.unstack("treated")
)
group_means.columns = ["control", "treated"]
group_means["difference"] = group_means["treated"] - group_means["control"]
pre_treatment = group_means[group_means.index < 6]
post_treatment = group_means[group_means.index >= 6]
print("=== Parallel Trends Check ===")
print("\nPre-treatment group differences (should be ~constant):")
for period, row in pre_treatment.iterrows():
print(f" Period {period}: control={row['control']:.1f}, "
f"treated={row['treated']:.1f}, diff={row['difference']:.1f}")
# Formal test: regress pre-treatment difference on time
# A significant slope means trends are NOT parallel
pre_periods = pre_treatment.index.values.astype(float)
pre_diffs = pre_treatment["difference"].values
if len(pre_periods) > 2:
slope, intercept, r_value, p_value, std_err = stats.linregress(
pre_periods, pre_diffs
)
print(f"\n Pre-trend slope: {slope:.3f} (p={p_value:.3f})")
if p_value < 0.05:
print(" WARNING: Pre-treatment trends are NOT parallel.")
print(" DiD estimate may be biased.")
else:
print(" Pre-treatment trends are approximately parallel. ✓")
print("\nPost-treatment (treatment effect visible):")
for period, row in post_treatment.iterrows():
print(f" Period {period}: control={row['control']:.1f}, "
f"treated={row['treated']:.1f}, diff={row['difference']:.1f}")
check_parallel_trends(df)
If pre-treatment trends diverge, you have three options: (1) find a better control group, (2) add unit-specific time trends to the regression, or (3) abandon DiD and use a different method. Forcing DiD onto data with non-parallel trends produces confident-looking but unreliable estimates.
9.4 — Synthetic Control
DiD works when you have many treated and control units. But what if only one unit received treatment? A single country enacted a policy. A single city launched a program. A single product line changed pricing. With $n=1$ in the treated group, standard DiD cannot estimate treatment effects.
Synthetic control solves this by constructing a synthetic version of the treated unit from a weighted combination of untreated units. The weights are chosen so that the synthetic unit matches the treated unit’s pre-treatment trajectory as closely as possible. After the treatment, the gap between the real and synthetic unit estimates the treatment effect.
from scipy.optimize import minimize
def synthetic_control(
panel: pd.DataFrame,
treated_unit: int,
intervention_period: int,
outcome_col: str = "outcome",
unit_col: str = "unit",
time_col: str = "period",
) -> dict:
"""Estimate treatment effect using the synthetic control method.
Finds weights for donor (untreated) units such that the weighted
combination best matches the treated unit's pre-treatment outcomes.
"""
units = panel[unit_col].unique()
donor_units = [u for u in units if u != treated_unit]
periods = sorted(panel[time_col].unique())
pre_periods = [p for p in periods if p < intervention_period]
post_periods = [p for p in periods if p >= intervention_period]
# Build outcome matrices: rows = time, columns = units
def get_series(unit: int) -> np.ndarray:
s = panel[panel[unit_col] == unit].sort_values(time_col)
return s[outcome_col].values
treated_series = get_series(treated_unit)
donor_matrix = np.column_stack([get_series(u) for u in donor_units])
n_pre = len(pre_periods)
# Pre-treatment fit: minimize squared distance
def objective(weights: np.ndarray) -> float:
synthetic = donor_matrix[:n_pre] @ weights
return float(np.sum((treated_series[:n_pre] - synthetic) ** 2))
n_donors = len(donor_units)
# Constraints: weights sum to 1, all non-negative
constraints = {"type": "eq", "fun": lambda w: np.sum(w) - 1.0}
bounds = [(0.0, 1.0)] * n_donors
x0 = np.ones(n_donors) / n_donors
result = minimize(objective, x0, bounds=bounds, constraints=constraints,
method="SLSQP")
weights = result.x
# Synthetic unit: full time series
synthetic_series = donor_matrix @ weights
# Treatment effect: gap between real and synthetic in post-period
post_gaps = treated_series[n_pre:] - synthetic_series[n_pre:]
pre_gaps = treated_series[:n_pre] - synthetic_series[:n_pre]
return {
"weights": dict(zip(donor_units, weights)),
"pre_rmse": float(np.sqrt(np.mean(pre_gaps**2))),
"post_treatment_effects": post_gaps.tolist(),
"average_effect": float(np.mean(post_gaps)),
"top_donors": sorted(
zip(donor_units, weights), key=lambda x: -x[1]
)[:5],
}
# Simulate state-level policy data
n_states = 20
n_periods = 20
intervention = 12
true_effect = 8.0
records: list[dict] = []
state_baselines = rng.normal(50, 10, n_states)
for state in range(n_states):
for period in range(n_periods):
trend = 0.5 * period + rng.normal(0, 1)
effect = true_effect if (state == 0 and period >= intervention) else 0
outcome = state_baselines[state] + trend + effect + rng.normal(0, 2)
records.append({
"unit": state, "period": period, "outcome": outcome
})
panel = pd.DataFrame(records)
sc_result = synthetic_control(panel, treated_unit=0,
intervention_period=intervention)
print("=== Synthetic Control ===")
print(f" Pre-treatment RMSE: {sc_result['pre_rmse']:.2f}")
print(f" (Lower = better pre-treatment fit)")
print(f"\n Average post-treatment effect: {sc_result['average_effect']:.2f}")
print(f" True effect: {true_effect}")
print(f"\n Top donor weights:")
for unit, weight in sc_result["top_donors"]:
if weight > 0.01:
print(f" State {unit}: {weight:.3f}")
Placebo tests are essential for validating synthetic control. Apply the same method to each untreated unit, pretending it was treated. If the estimated effect for the actually-treated unit is much larger than the placebo effects, you have evidence that the effect is real. If several untreated units show similarly large “effects,” your method is picking up noise, not signal.
9.5 — Uplift Modeling
Every method so far answers the same question: what is the average treatment effect? Uplift modeling asks a more actionable question: which individuals benefit from treatment, and by how much?
This distinction matters enormously in practice. A marketing campaign with a positive average effect might be wasting budget on customers who would have bought anyway (“sure things”) while failing to reach customers who only buy when prompted (“persuadables”). An average treatment effect of +5% could mask the fact that 80% of treated customers had zero response and 20% had a +25% response. Targeting those 20% generates the same revenue at one-fifth the cost.
The Four Customer Segments
When you cross treatment (yes/no) with response (yes/no), customers fall into four segments:
| Buys if treated | Does not buy if treated | |
|---|---|---|
| Buys if not treated | Sure Things (treatment is wasted) | Sleeping Dogs (treatment hurts!) |
| Does not buy if not treated | Persuadables (target these) | Lost Causes (nothing works) |
You want to spend your budget on the persuadables — those with high uplift (positive difference between treated and untreated outcomes). You want to avoid sleeping dogs — customers who would have bought but are turned off by the treatment (negative uplift). Sure things and lost causes get zero incremental value from treatment.
The problem: you never observe both cells for the same customer. This is the fundamental problem of causal inference again, now at the individual level.
The T-Learner
The T-learner is the simplest approach to estimating individual-level treatment effects. Train two separate models: one on treated units ($\hat{\mu}_1(X)$) and one on control units ($\hat{\mu}_0(X)$). The estimated uplift for an individual is:
$$\hat{\tau}(X) = \hat{\mu}_1(X) - \hat{\mu}_0(X)$$
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import cross_val_predict
def simulate_uplift_data(n: int = 10_000) -> pd.DataFrame:
"""Simulate marketing campaign data with heterogeneous treatment effects.
Some customers are persuadable, some are sure things,
some are lost causes, and some are sleeping dogs.
"""
age = rng.normal(40, 12, n).clip(18, 75)
income = rng.normal(55_000, 18_000, n).clip(15_000, 130_000)
recency = rng.exponential(30, n).clip(1, 365) # Days since last purchase
frequency = rng.poisson(4, n) # Past purchases
# Random treatment assignment (like an actual experiment)
treatment = rng.binomial(1, 0.5, n)
# Base purchase probability (without treatment)
logit_base = (
-1.5
+ 0.02 * (age - 40)
+ 0.00001 * (income - 55_000)
- 0.005 * recency
+ 0.1 * frequency
)
# HETEROGENEOUS treatment effect:
# - Young, low-frequency customers: highly persuadable (+20 pp)
# - Old, high-frequency customers: sure things (0 pp uplift)
# - Mid-age, high-recency: sleeping dogs (-5 pp, annoyed by spam)
uplift_logit = (
0.8
- 0.015 * (age - 30)
- 0.08 * frequency
+ 0.003 * recency
+ 0.000005 * (income - 55_000)
)
logit_treated = logit_base + uplift_logit * treatment
prob = 1 / (1 + np.exp(-logit_treated))
purchase = rng.binomial(1, prob)
return pd.DataFrame({
"age": age, "income": income, "recency": recency,
"frequency": frequency, "treatment": treatment,
"purchase": purchase,
})
df = simulate_uplift_data()
# Average treatment effect (for reference)
ate = (
df[df["treatment"] == 1]["purchase"].mean()
- df[df["treatment"] == 0]["purchase"].mean()
)
print(f"Average Treatment Effect: {ate:.3f}")
print(f" (Positive on average, but varies hugely by individual)\n")
# T-Learner: separate models for treated and control
features = ["age", "income", "recency", "frequency"]
X = df[features].values
treated_mask = df["treatment"] == 1
control_mask = df["treatment"] == 0
model_treated = GradientBoostingClassifier(
n_estimators=100, max_depth=4, random_state=42
)
model_control = GradientBoostingClassifier(
n_estimators=100, max_depth=4, random_state=42
)
model_treated.fit(X[treated_mask], df.loc[treated_mask, "purchase"])
model_control.fit(X[control_mask], df.loc[control_mask, "purchase"])
# Predict uplift: P(purchase | treated) - P(purchase | control)
prob_treated = model_treated.predict_proba(X)[:, 1]
prob_control = model_control.predict_proba(X)[:, 1]
df["uplift"] = prob_treated - prob_control
# Segment customers
df["segment"] = pd.cut(
df["uplift"],
bins=[-np.inf, -0.02, 0.02, np.inf],
labels=["sleeping_dogs", "neutral", "persuadable"],
)
print("=== Customer Segments ===")
for segment in ["persuadable", "neutral", "sleeping_dogs"]:
sub = df[df["segment"] == segment]
actual_uplift = (
sub[sub["treatment"] == 1]["purchase"].mean()
- sub[sub["treatment"] == 0]["purchase"].mean()
)
print(f" {segment:>15}: n={len(sub):>5}, "
f"predicted_uplift={sub['uplift'].mean():.3f}, "
f"actual_uplift={actual_uplift:.3f}")
Uplift Scoring and Targeting
The T-learner gives every customer an uplift score. You rank by uplift and target from the top. The uplift curve measures cumulative incremental gain as you increase the fraction of customers targeted — the causal analogue of a lift curve.
def evaluate_uplift(
df: pd.DataFrame,
uplift_col: str = "uplift",
treatment_col: str = "treatment",
outcome_col: str = "purchase",
n_bins: int = 10,
) -> pd.DataFrame:
"""Evaluate uplift model by computing actual uplift within deciles.
Sort by predicted uplift, bin into deciles, and compute actual
treatment effect within each bin. A good uplift model concentrates
the persuadables at the top.
"""
df_sorted = df.sort_values(uplift_col, ascending=False).copy()
df_sorted["decile"] = pd.qcut(
range(len(df_sorted)), q=n_bins, labels=False
)
results: list[dict] = []
for decile in range(n_bins):
sub = df_sorted[df_sorted["decile"] == decile]
n_total = len(sub)
treated = sub[sub[treatment_col] == 1]
control = sub[sub[treatment_col] == 0]
if len(treated) > 0 and len(control) > 0:
actual_uplift = treated[outcome_col].mean() - control[outcome_col].mean()
else:
actual_uplift = 0.0
results.append({
"decile": decile + 1,
"n": n_total,
"predicted_uplift": sub[uplift_col].mean(),
"actual_uplift": actual_uplift,
})
return pd.DataFrame(results)
uplift_eval = evaluate_uplift(df)
print("\n=== Uplift by Decile ===")
print(f" {'Decile':>7} {'N':>6} {'Pred Uplift':>12} {'Actual Uplift':>14}")
print(f" {'-'*42}")
for _, row in uplift_eval.iterrows():
print(f" {row['decile']:>7.0f} {row['n']:>6.0f} "
f"{row['predicted_uplift']:>12.3f} {row['actual_uplift']:>14.3f}")
# Targeting decision
top_deciles = uplift_eval[uplift_eval["actual_uplift"] > 0.02]
target_pct = len(top_deciles) / len(uplift_eval) * 100
print(f"\n Recommendation: target top {target_pct:.0f}% of customers")
print(f" These deciles have positive actual uplift > 2 pp.")
print(f" Treating others wastes budget (neutral) or hurts (sleeping dogs).")
When to Use Uplift Modeling
Uplift modeling is most valuable when:
- Treatment is costly. If sending an email costs nothing, you might as well send it to everyone (unless sleeping dogs exist). If treatment costs $50 per person or requires a sales rep’s time, targeting matters.
- Treatment effects are heterogeneous. If the treatment helps everyone equally, you do not need individual-level targeting. But in practice, effects almost always vary by customer characteristics.
- You have experimental data. Uplift modeling works best with randomized treatment assignment. Without randomization, you first need the propensity score methods from Section 9.1 to deconfound, then you estimate heterogeneous effects — adding complexity and assumptions.
- You have sufficient sample size. You are estimating a difference between two conditional expectations ($\hat{\mu}_1(X) - \hat{\mu}_0(X)$). Each model needs enough data in its training set, and the difference amplifies noise from both.
The Decision Framework
Putting it together, here is how causal inference methods connect to business decisions:
- Is the intervention worth doing at all? → Estimate the ATE using DiD, propensity score methods, or synthetic control. If the ATE is zero or negative, stop here.
- Whom should we target? → Estimate CATE (Conditional Average Treatment Effect) using uplift modeling. Rank by uplift, target persuadables.
- How confident are we? → Report confidence intervals, run placebo tests, check assumptions. A precise but wrong estimate is worse than an imprecise but honest one.
No method in this chapter gives you certainty. Every method rests on assumptions that could be wrong. The discipline of causal inference is not about finding the “right” technique — it is about being explicit about what you are assuming, testing those assumptions where possible, and communicating the limits of your conclusions alongside the conclusions themselves.
The alternative — treating correlations as causes and optimizing against spurious associations — is not just theoretically wrong. It leads to real decisions that waste real money and sometimes cause real harm. The tools in this chapter will not make causal inference easy, but they will make it possible — and that is a significant upgrade over the default.