Feedback Loops and Safe Deployment
SummaryDrift detection tells you that the world has...
Drift detection tells you that the world has...
Drift detection tells you that the world has changed. Feedback loops tell you whether your model's predictions were actually correct. This section addresses the ground truth delay — the days, weeks, or months between prediction and outcome — and builds a feedback pipeline that pairs every prediction with its eventual result. Three retraining strategies are compared: schedule-based (predictable but wasteful), drift-triggered (reactive but sometimes too late), and performance-triggered (precise but requires ground truth). The retraining trap — blindly appending new data without recency weighting — is exposed with a concrete failure scenario. The second half covers safe deployment: shadow deployments that compare models with zero user impact, canary releases that route a small fraction of traffic to the new model and expand gradually, feature flags for instant toggling, and automatic rollback based on health metrics. The chapter closes with the complete ML lifecycle and the argument that the best ML systems are not the ones with the fanciest models — they are the ones that degrade gracefully and recover automatically.
Feedback Loops and Safe Deployment
11.3 — Feedback Loops and Retraining
Drift detection tells you the inputs have changed. But the question you actually care about is: are the outputs still correct? Answering that requires ground truth — the actual outcome that your model was trying to predict.
The Ground Truth Delay
In some applications, ground truth arrives quickly. A user clicks or does not click on a recommendation within minutes. A spam classifier’s prediction is confirmed or overridden within hours. In these settings, you can build tight feedback loops and detect model degradation almost in real time.
In most applications, ground truth is delayed. A loan default prediction is not resolved for 3–5 years. A churn prediction is validated over 90 days. A medical diagnosis is confirmed after lab results, follow-up appointments, and specialist referrals. A demand forecast is validated when the quarter ends.
During the delay, your model is making predictions you cannot evaluate. This is not a problem you can engineer away — it is a fundamental constraint of the domain. What you can do is build the infrastructure that collects, stores, and matches ground truth when it eventually arrives.
Building the Feedback Pipeline
Every prediction your model makes should be logged with enough context to match it against the eventual outcome:
import json
import sqlite3
from dataclasses import asdict, dataclass, field
from datetime import datetime, timezone
@dataclass
class PredictionRecord:
prediction_id: str
model_version: str
timestamp: str
features: dict[str, float]
prediction: float
confidence: float
ground_truth: float | None = None
ground_truth_timestamp: str | None = None
latency_ms: float = 0.0
class FeedbackStore:
"""
Stores predictions and matches them with ground truth outcomes.
Uses SQLite for simplicity. In production, you would use a proper
analytics database (Postgres, BigQuery, etc.), but the schema is
the same.
"""
def __init__(self, db_path: str = "feedback.db") -> None:
self.conn = sqlite3.connect(db_path)
self._create_tables()
def _create_tables(self) -> None:
self.conn.execute("""
CREATE TABLE IF NOT EXISTS predictions (
prediction_id TEXT PRIMARY KEY,
model_version TEXT NOT NULL,
timestamp TEXT NOT NULL,
features TEXT NOT NULL,
prediction REAL NOT NULL,
confidence REAL NOT NULL,
ground_truth REAL,
ground_truth_timestamp TEXT,
latency_ms REAL DEFAULT 0.0
)
""")
self.conn.commit()
def log_prediction(self, record: PredictionRecord) -> None:
self.conn.execute(
"""INSERT INTO predictions
(prediction_id, model_version, timestamp, features,
prediction, confidence, latency_ms)
VALUES (?, ?, ?, ?, ?, ?, ?)""",
(record.prediction_id, record.model_version,
record.timestamp, json.dumps(record.features),
record.prediction, record.confidence, record.latency_ms),
)
self.conn.commit()
def record_outcome(
self, prediction_id: str, ground_truth: float,
) -> None:
"""Called when ground truth becomes available — possibly days later."""
now = datetime.now(timezone.utc).isoformat()
self.conn.execute(
"""UPDATE predictions
SET ground_truth = ?, ground_truth_timestamp = ?
WHERE prediction_id = ?""",
(ground_truth, now, prediction_id),
)
self.conn.commit()
def get_recent_performance(
self, model_version: str, limit: int = 1000,
) -> list[PredictionRecord]:
"""
Retrieve recent predictions that have ground truth,
for performance evaluation.
"""
cursor = self.conn.execute(
"""SELECT prediction_id, model_version, timestamp, features,
prediction, confidence, ground_truth,
ground_truth_timestamp, latency_ms
FROM predictions
WHERE model_version = ? AND ground_truth IS NOT NULL
ORDER BY timestamp DESC LIMIT ?""",
(model_version, limit),
)
results = []
for row in cursor:
results.append(PredictionRecord(
prediction_id=row[0], model_version=row[1],
timestamp=row[2], features=json.loads(row[3]),
prediction=row[4], confidence=row[5],
ground_truth=row[6], ground_truth_timestamp=row[7],
latency_ms=row[8],
))
return results
The critical design choice: log predictions and outcomes in the same table, matched by prediction_id. When ground truth arrives, you update the record. When you need to evaluate model performance over a time window, you query for records where ground_truth IS NOT NULL.
Retraining Triggers: When to Pull the Lever
Three strategies, each with trade-offs:
Schedule-based retraining. Retrain every week, every month, or every quarter regardless of whether anything has changed. Predictable and easy to implement. Wasteful when the data distribution is stable — you spend compute for no improvement. Insufficient when drift is rapid — you serve a degraded model for the entire interval between retrains.
Drift-triggered retraining. Retrain when the drift detectors from Section 11.2 fire. Responsive to actual distribution changes. But drift does not always degrade performance — a feature distribution can shift without affecting prediction quality. You may retrain unnecessarily, or the drift detector may fire too late.
Performance-triggered retraining. Retrain when actual model performance on production data with ground truth drops below a threshold. The most precise trigger. But it depends on ground truth being available, which takes days to months. By the time you know performance has degraded, you have been serving bad predictions for the entire ground truth delay.
The pragmatic approach: combine drift-triggered and performance-triggered. Use drift detection as an early warning — it fires fast but may be noisy. Use performance monitoring as confirmation — it is slow but definitive. Retrain when drift is detected and confirmed by a performance drop, or when performance drops regardless of whether drift was detected.
from dataclasses import dataclass
@dataclass
class RetrainingDecision:
should_retrain: bool
reason: str
urgency: str # "routine", "elevated", "critical"
def evaluate_retraining_need(
drift_detected: bool,
drift_severity: float,
performance_current: float,
performance_baseline: float,
performance_threshold: float = 0.05,
critical_threshold: float = 0.10,
) -> RetrainingDecision:
"""
Decide whether to trigger retraining based on drift and performance.
Args:
drift_detected: Whether the drift detector fired.
drift_severity: Maximum PSI across features (0 = no drift).
performance_current: Current model metric (e.g., F1 on recent data).
performance_baseline: Metric at time of training.
performance_threshold: Relative drop that triggers retraining.
critical_threshold: Relative drop that triggers urgent retraining.
"""
perf_drop = (performance_baseline - performance_current) / performance_baseline
if perf_drop >= critical_threshold:
return RetrainingDecision(
should_retrain=True,
reason=f"Performance dropped {perf_drop:.1%} from baseline "
f"({performance_baseline:.3f} -> {performance_current:.3f}). "
f"Exceeds critical threshold of {critical_threshold:.1%}.",
urgency="critical",
)
if drift_detected and perf_drop >= performance_threshold:
return RetrainingDecision(
should_retrain=True,
reason=f"Drift detected (PSI={drift_severity:.3f}) and performance "
f"dropped {perf_drop:.1%}. Both signals confirm degradation.",
urgency="elevated",
)
if drift_detected and perf_drop < performance_threshold:
return RetrainingDecision(
should_retrain=False,
reason=f"Drift detected (PSI={drift_severity:.3f}) but performance "
f"is stable (drop={perf_drop:.1%}). Monitor closely.",
urgency="routine",
)
return RetrainingDecision(
should_retrain=False,
reason="No drift detected, performance within baseline.",
urgency="routine",
)
The Retraining Trap
Your model’s performance has degraded. You collect the latest three months of production data, append it to the original training set, and retrain. The new model performs worse.
This happens because more data is not better data when the distribution has shifted. If user behavior changed in month two, your training set now contains one month of old-distribution data, two months of transitional data, and the original training data from six months ago. The model tries to fit all three distributions and fits none of them well.
The fix is recency weighting. Give recent data higher weight during training, so the model emphasizes the current distribution while retaining general patterns from historical data. Exponential decay is the standard approach — observations from last week have weight 1.0, last month have weight 0.7, last quarter have weight 0.3.
Active Learning and Human-in-the-Loop
When you have a limited budget for collecting labeled data, spend it on the examples your model is most uncertain about. This is active learning: instead of labeling random production samples, query the predictions where the model’s confidence was lowest. A binary classifier predicting 0.51 is almost certainly wrong on a substantial fraction of those cases. A classifier predicting 0.99 is providing far less marginal information.
Sometimes the model is not uncertain — it is confidently wrong. This happens when the production data contains patterns the training data did not cover at all. No amount of retraining on the existing label space will fix a model that encounters a genuinely new category. This is when you need human-in-the-loop review: flag predictions that fall in low-density regions of the feature space and route them to domain experts for manual review and labeling.
11.4 — Shadow and Canary Deployments
You have retrained your model. It performs better on your evaluation set. Congratulations — you have established that the new model is better on a sample. Production traffic is the population, and the population has surprises that samples do not.
Deploying the new model to 100% of traffic immediately is gambling. If it is better, you win fast. If it is worse, every user sees degraded predictions until someone notices and rolls back. The blast radius is 100%.
Shadow Deployments: Risk-Free Comparison
In a shadow deployment, the new model runs alongside the production model. Both models receive every request. The production model’s prediction is returned to the user. The shadow model’s prediction is logged but discarded. There is zero user impact.
import logging
import time
from dataclasses import dataclass
from typing import Any, Protocol
logger = logging.getLogger(__name__)
class Model(Protocol):
def predict(self, features: dict[str, float]) -> float: ...
@property
def version(self) -> str: ...
@dataclass
class ShadowResult:
production_prediction: float
shadow_prediction: float
production_latency_ms: float
shadow_latency_ms: float
agreement: bool
class ShadowDeployment:
"""
Run two models on every request. Return the production model's
prediction. Log the shadow model's prediction for comparison.
When you have enough shadow data, compare:
- Prediction agreement rate
- Distribution of prediction differences
- Latency comparison
- If ground truth is available: actual performance comparison
"""
def __init__(
self, production_model: Model, shadow_model: Model,
agreement_threshold: float = 0.05,
) -> None:
self.production = production_model
self.shadow = shadow_model
self.agreement_threshold = agreement_threshold
self.results: list[ShadowResult] = []
def predict(self, features: dict[str, float]) -> float:
"""Return production prediction, log shadow prediction."""
# Production model — this is what the user sees
start = time.perf_counter()
prod_pred = self.production.predict(features)
prod_latency = (time.perf_counter() - start) * 1000
# Shadow model — this is logged but not returned
try:
start = time.perf_counter()
shadow_pred = self.shadow.predict(features)
shadow_latency = (time.perf_counter() - start) * 1000
except Exception:
logger.exception(
"Shadow model failed — production unaffected",
)
return prod_pred
agreement = abs(prod_pred - shadow_pred) < self.agreement_threshold
result = ShadowResult(
production_prediction=prod_pred,
shadow_prediction=shadow_pred,
production_latency_ms=prod_latency,
shadow_latency_ms=shadow_latency,
agreement=agreement,
)
self.results.append(result)
if not agreement:
logger.info(
"Shadow divergence: prod=%.4f shadow=%.4f diff=%.4f",
prod_pred, shadow_pred, abs(prod_pred - shadow_pred),
)
return prod_pred
def get_agreement_rate(self) -> float:
if not self.results:
return 0.0
return sum(r.agreement for r in self.results) / len(self.results)
Notice the try/except around the shadow model. If the shadow model crashes, the production model’s prediction is still returned. The shadow model must never affect the production path. This is the cardinal rule of shadow deployments.
Run the shadow deployment for at least one full business cycle (a week for most consumer applications, a month for B2B). Compare not just prediction agreement but the distribution of shadow predictions against production predictions. A shadow model that agrees 95% of the time but disagrees catastrophically on the remaining 5% is not safe to promote.
Canary Releases: Gradual Rollout
Once shadow comparison gives you confidence, a canary release routes a small percentage of live traffic to the new model. Unlike shadow mode, the new model’s predictions are returned to users — but only a fraction of them.
import hashlib
import time
from typing import Protocol
class Model(Protocol):
def predict(self, features: dict[str, float]) -> float: ...
@property
def version(self) -> str: ...
class CanaryRouter:
"""
Route traffic between production and canary models based on
a stable hash of the request ID.
Using a hash instead of random selection ensures the same user
consistently hits the same model — important for user experience
and for clean performance comparison.
"""
def __init__(
self, production_model: Model, canary_model: Model,
canary_percentage: float = 5.0,
) -> None:
self.production = production_model
self.canary = canary_model
self.canary_percentage = canary_percentage
self.canary_count = 0
self.production_count = 0
def route(
self, request_id: str, features: dict[str, float],
) -> tuple[float, str]:
"""
Returns (prediction, model_version).
The request_id hash determines routing — same user always
gets the same model for consistency.
"""
hash_value = int(hashlib.sha256(
request_id.encode(),
).hexdigest(), 16) % 100
if hash_value < self.canary_percentage:
self.canary_count += 1
prediction = self.canary.predict(features)
return prediction, self.canary.version
else:
self.production_count += 1
prediction = self.production.predict(features)
return prediction, self.production.version
def promote_canary(self, new_percentage: float) -> None:
"""Gradually increase canary traffic: 5% -> 10% -> 25% -> 50% -> 100%."""
self.canary_percentage = min(new_percentage, 100.0)
def rollback(self) -> None:
"""Kill the canary — route all traffic to production."""
self.canary_percentage = 0.0
The hash-based routing is deliberate. Random routing means the same user might see different models on consecutive requests, producing inconsistent behavior. Hashing the request ID (or user ID) ensures deterministic assignment — the same user always gets the same model.
The canary progression should be: 5% → 10% → 25% → 50% → 100%. At each stage, monitor for at least one hour (or one business cycle, whichever is longer) before increasing. If any monitored metric degrades, halt the rollout.
Automatic Rollback: Self-Healing Systems
Manual rollback requires someone to notice the problem, diagnose it, decide to roll back, and execute the rollback. At 3 AM, that sequence takes 30 minutes if you are lucky. Automatic rollback watches health metrics continuously and rolls back in seconds:
import logging
import time
from collections import deque
from dataclasses import dataclass
logger = logging.getLogger(__name__)
@dataclass
class HealthConfig:
error_rate_threshold: float = 0.05 # 5% error rate triggers rollback
latency_p99_threshold_ms: float = 500.0
window_size: int = 100 # Number of recent requests to evaluate
min_requests: int = 20 # Minimum requests before evaluating
class HealthMonitor:
"""
Track model health and trigger automatic rollback when thresholds
are breached.
Monitors two signals:
- Error rate: fraction of requests that raise exceptions
- Latency: p99 latency exceeds threshold
Uses a sliding window to be responsive to sudden degradation
without overreacting to single failures.
"""
def __init__(self, config: HealthConfig | None = None) -> None:
self.config = config or HealthConfig()
self.errors: deque[bool] = deque(maxlen=self.config.window_size)
self.latencies: deque[float] = deque(maxlen=self.config.window_size)
def record(self, is_error: bool, latency_ms: float) -> None:
self.errors.append(is_error)
self.latencies.append(latency_ms)
def should_rollback(self) -> tuple[bool, str]:
"""Check if current health metrics warrant a rollback."""
if len(self.errors) < self.config.min_requests:
return False, "Insufficient data for evaluation."
error_rate = sum(self.errors) / len(self.errors)
if error_rate > self.config.error_rate_threshold:
return True, (
f"Error rate {error_rate:.1%} exceeds threshold "
f"{self.config.error_rate_threshold:.1%}."
)
sorted_latencies = sorted(self.latencies)
p99_index = int(len(sorted_latencies) * 0.99)
p99_latency = sorted_latencies[min(p99_index, len(sorted_latencies) - 1)]
if p99_latency > self.config.latency_p99_threshold_ms:
return True, (
f"P99 latency {p99_latency:.0f}ms exceeds threshold "
f"{self.config.latency_p99_threshold_ms:.0f}ms."
)
return False, "All health metrics within bounds."
def serve_with_rollback(
router: "CanaryRouter",
monitor: HealthMonitor,
request_id: str,
features: dict[str, float],
) -> float:
"""
Serve a prediction with automatic rollback protection.
If the health monitor detects degradation, the canary is killed
and all traffic returns to the production model.
"""
start = time.perf_counter()
try:
prediction, version = router.route(request_id, features)
latency = (time.perf_counter() - start) * 1000
monitor.record(is_error=False, latency_ms=latency)
except Exception:
latency = (time.perf_counter() - start) * 1000
monitor.record(is_error=True, latency_ms=latency)
logger.exception("Prediction failed for request %s", request_id)
raise
should_rollback, reason = monitor.should_rollback()
if should_rollback:
logger.warning("AUTOMATIC ROLLBACK triggered: %s", reason)
router.rollback()
return prediction
Feature Flags: The Escape Hatch
Feature flags let you enable or disable a model without redeploying. Store a flag in a configuration service (Redis, a database, even an environment variable for small deployments). Check the flag before routing to the new model. When something goes wrong, flip the flag. No container rebuild, no CI/CD pipeline, no deployment wait. The flag check adds microseconds of latency and provides minutes of incident response time savings.
The Complete Lifecycle
Every concept in this book — from data manipulation in Chapter 2 through evaluation in Chapter 8 to deployment in Chapter 10 — converges into a single cycle:
- Train on historical data with proper evaluation (Chapters 5–6, 8)
- Validate with calibration, cross-validation, and subgroup analysis (Chapter 8)
- Register the model with full metadata in MLflow (Section 11.1)
- Shadow the new model against the production model (Section 11.4)
- Canary — route 5% of traffic, monitor, gradually increase (Section 11.4)
- Promote to 100% when canary metrics are stable (Section 11.4)
- Monitor for drift and degradation (Section 11.2)
- Collect feedback — pair predictions with ground truth (Section 11.3)
- Retrain when monitoring signals trigger (Section 11.3)
- Return to step 1.
This cycle has no end state. The model is never “done.” The system is never “finished.” There is always the next drift event, the next data pipeline change, the next feature that degrades, the next shift in user behavior.
Closing: What Makes an ML System Survive
The best ML systems in production are not the ones with the highest accuracy on a benchmark. They are not the ones using the latest architecture. They are not the ones with the most features or the most data.
The best ML systems are the ones that degrade gracefully and recover automatically. They detect when predictions are getting worse before the business dashboard does. They retrain on the right data at the right time. They deploy new models without risking the entire user base. They roll back in seconds when something goes wrong.
Building these systems is not glamorous work. It does not involve novel architectures or state-of-the-art research. It involves logging predictions, running statistical tests on feature distributions, writing health check endpoints, and setting up automatic rollback rules. It is plumbing.
But plumbing is what separates a model that impresses in a demo from a model that runs in production for years. And production — messy, hostile, surprising production — is where your model creates value.