Fraud Detection and Transaction Risk Scoring
Fraud Detection and Transaction Risk Scoring
Payment fraud is an adversarial machine learning problem. Unlike recommendation systems where wrong predictions cost engagement, wrong fraud predictions cost real money — both false negatives (missed fraud, direct financial loss) and false positives (legitimate transactions declined, customer churn and revenue loss).
The diagram above shows the real-time fraud detection pipeline. Every transaction passes through this pipeline in under 100 milliseconds — the authorization timeout. The pipeline must return a risk score before the card network’s authorization deadline, or the transaction proceeds without scoring.
The Economics of Fraud Detection
Before diving into the architecture, understand the economics that drive every design decision:
from dataclasses import dataclass
from decimal import Decimal
@dataclass
class FraudEconomics:
"""
The cost model that drives fraud detection system design.
These numbers are based on industry averages for a mid-size
payment processor handling $10B annual volume.
"""
# Direct costs
fraud_rate_bps: float = 8.0 # Basis points (0.08% of volume)
annual_volume: Decimal = Decimal("10_000_000_000")
# False positive costs
false_positive_rate: float = 0.02 # 2% of legitimate transactions
avg_transaction_value: Decimal = Decimal("85.00")
customer_churn_per_decline: float = 0.15 # 15% of falsely declined customers churn
customer_lifetime_value: Decimal = Decimal("2500.00")
@property
def annual_fraud_loss(self) -> Decimal:
"""Direct fraud losses."""
return self.annual_volume * Decimal(str(self.fraud_rate_bps)) / 10000
@property
def annual_false_positive_cost(self) -> Decimal:
"""
Cost of false positives = lost revenue + customer churn cost.
This is often LARGER than direct fraud losses. A processor
that aggressively blocks fraud but generates high false positives
may lose more money than one with higher fraud but better
customer experience.
"""
total_transactions = self.annual_volume / self.avg_transaction_value
false_declines = total_transactions * Decimal(str(self.false_positive_rate))
# Immediate revenue loss
revenue_loss = false_declines * self.avg_transaction_value
# Customer churn cost (long-term)
churn_cost = (false_declines *
Decimal(str(self.customer_churn_per_decline)) *
self.customer_lifetime_value)
return revenue_loss + churn_cost
@property
def optimal_detection_rate(self) -> str:
"""
The optimal detection rate balances fraud losses against
false positive costs. This is NOT "catch all fraud" —
it's "catch fraud up to the point where the marginal cost
of another false positive exceeds the marginal benefit of
another fraud detection."
Industry target: 90-95% detection rate with < 1% false positive rate.
"""
return "90-95% detection, <1% FPR"
# Calculate real numbers
economics = FraudEconomics()
# Annual fraud loss: ~$8M
# Annual false positive cost: can exceed $50M
# This is why false positive reduction is often more valuable
# than fraud detection improvement.
Scoring Architecture
Production fraud systems use a layered architecture combining rules, ML models, and network intelligence:
from abc import ABC, abstractmethod
from typing import Optional
import time
@dataclass
class TransactionContext:
"""
All available information about a transaction at scoring time.
This is assembled from multiple sources in < 10ms:
- Transaction message fields
- Cardholder profile (from profile store)
- Device data (from 3DS or device fingerprint)
- Merchant risk profile
- Network intelligence (consortium data)
"""
# Transaction fields
transaction_id: str
card_hash: str # Tokenized PAN
amount: Decimal
currency: str
merchant_id: str
merchant_category_code: str
pos_entry_mode: str # "chip", "contactless", "ecom", "moto"
# Cardholder profile
avg_transaction_amount: Decimal = Decimal(0)
transaction_count_24h: int = 0
transaction_count_7d: int = 0
distinct_merchants_24h: int = 0
days_since_last_transaction: int = 0
home_country: str = ""
# Device data
device_fingerprint: str = ""
ip_address: str = ""
ip_country: str = ""
is_vpn: bool = False
is_tor: bool = False
# Merchant profile
merchant_fraud_rate: float = 0.0
merchant_chargeback_rate: float = 0.0
@dataclass
class ScoringResult:
score: float # 0-1000 (higher = more risky)
decision: str # "approve", "challenge", "decline", "review"
reasons: list[str] # Top contributing factors
model_version: str
latency_ms: float
class ScoringLayer(ABC):
"""Base class for scoring layers."""
@abstractmethod
def score(self, ctx: TransactionContext) -> tuple[float, list[str]]:
"""Return (score_contribution, reasons)."""
pass
class RulesEngine(ScoringLayer):
"""
Hard rules that override ML scores.
Rules handle known fraud patterns that ML might miss due to
training data lag. They also implement regulatory requirements
(sanctions screening, velocity limits) that can't be left to
probabilistic models.
"""
def score(self, ctx: TransactionContext) -> tuple[float, list[str]]:
score = 0.0
reasons = []
# Velocity rules
if ctx.transaction_count_24h > 20:
score += 300
reasons.append(f"High velocity: {ctx.transaction_count_24h} txns in 24h")
if ctx.distinct_merchants_24h > 10:
score += 200
reasons.append(f"High merchant diversity: {ctx.distinct_merchants_24h} in 24h")
# Amount rules
if ctx.amount > ctx.avg_transaction_amount * 5 and ctx.avg_transaction_amount > 0:
score += 150
reasons.append(f"Amount {ctx.amount} is 5x average {ctx.avg_transaction_amount}")
# Geographic rules
if ctx.ip_country and ctx.home_country and ctx.ip_country != ctx.home_country:
score += 100
reasons.append(f"Cross-border: card country {ctx.home_country}, IP country {ctx.ip_country}")
# Network rules
if ctx.is_tor:
score += 200
reasons.append("Tor exit node detected")
elif ctx.is_vpn:
score += 50
reasons.append("VPN detected")
# Merchant risk
if ctx.merchant_fraud_rate > 0.05: # > 5% fraud rate
score += 150
reasons.append(f"High-risk merchant (fraud rate: {ctx.merchant_fraud_rate:.1%})")
return min(score, 1000), reasons
class MLScorer(ScoringLayer):
"""
ML model scoring layer.
In production, this loads a pre-trained model (XGBoost, LightGBM,
or neural network) and runs inference. The model is trained on
historical transaction data with confirmed fraud labels.
Model characteristics:
- Training data: 50M-500M transactions, ~0.1% positive (fraud) rate
- Features: 200-500 engineered features
- Retraining frequency: daily (incremental) + weekly (full)
- Inference latency: < 5ms per transaction
"""
def __init__(self, model_path: str = ""):
self._model = None # In production: load XGBoost/LightGBM model
self._feature_names: list[str] = []
def score(self, ctx: TransactionContext) -> tuple[float, list[str]]:
features = self._extract_features(ctx)
# Model prediction (probability of fraud)
# In production: self._model.predict_proba(features)[0][1]
fraud_probability = self._mock_predict(features)
# Scale to 0-1000 score
score = fraud_probability * 1000
# Feature importance for explainability
reasons = self._explain_prediction(features, fraud_probability)
return score, reasons
def _extract_features(self, ctx: TransactionContext) -> dict:
"""
Extract ML features from the transaction context.
Feature categories:
1. Transaction features (amount, currency, channel)
2. Velocity features (counts over time windows)
3. Behavioral features (deviation from normal patterns)
4. Device features (fingerprint match, VPN/proxy detection)
5. Graph features (connection to known fraud accounts)
"""
return {
"amount": float(ctx.amount),
"amount_to_avg_ratio": (
float(ctx.amount / ctx.avg_transaction_amount)
if ctx.avg_transaction_amount > 0 else 0
),
"txn_count_24h": ctx.transaction_count_24h,
"txn_count_7d": ctx.transaction_count_7d,
"distinct_merchants_24h": ctx.distinct_merchants_24h,
"days_since_last_txn": ctx.days_since_last_transaction,
"is_cross_border": int(
ctx.ip_country != ctx.home_country
) if ctx.ip_country and ctx.home_country else 0,
"is_vpn": int(ctx.is_vpn),
"is_tor": int(ctx.is_tor),
"merchant_fraud_rate": ctx.merchant_fraud_rate,
"is_ecommerce": int(ctx.pos_entry_mode == "ecom"),
}
def _mock_predict(self, features: dict) -> float:
"""Placeholder for model inference."""
# In production: model.predict_proba()
base = 0.001 # 0.1% base fraud rate
if features["is_tor"]:
base *= 50
if features["amount_to_avg_ratio"] > 5:
base *= 10
if features["txn_count_24h"] > 15:
base *= 5
return min(base, 1.0)
def _explain_prediction(
self, features: dict, probability: float
) -> list[str]:
"""
Generate human-readable explanations for the score.
In production: use SHAP values or feature importance
from the model to identify top contributing features.
"""
reasons = []
if probability > 0.5:
reasons.append(f"ML score: {probability:.3f} (high risk)")
elif probability > 0.1:
reasons.append(f"ML score: {probability:.3f} (elevated risk)")
return reasons
class FraudScoringPipeline:
"""
Orchestrates multiple scoring layers and produces a final decision.
Architecture:
1. Rules engine runs first (hard blocks / approvals)
2. ML model scores the transaction
3. Scores are combined with configurable weights
4. Decision thresholds determine action
"""
def __init__(self):
self._rules = RulesEngine()
self._ml = MLScorer()
# Decision thresholds (configurable per merchant/segment)
self._thresholds = {
"approve": 200, # score < 200 → approve
"challenge": 500, # 200 ≤ score < 500 → 3DS challenge
"review": 800, # 500 ≤ score < 800 → manual review queue
"decline": 800, # score ≥ 800 → decline
}
def score_transaction(
self, ctx: TransactionContext
) -> ScoringResult:
start = time.monotonic()
# Layer 1: Rules
rules_score, rules_reasons = self._rules.score(ctx)
# Layer 2: ML
ml_score, ml_reasons = self._ml.score(ctx)
# Combine scores (rules get priority for hard blocks)
if rules_score >= 800:
# Hard rule block — don't even consult ML
final_score = rules_score
all_reasons = rules_reasons
else:
# Weighted combination
final_score = rules_score * 0.4 + ml_score * 0.6
all_reasons = rules_reasons + ml_reasons
# Decision
if final_score < self._thresholds["approve"]:
decision = "approve"
elif final_score < self._thresholds["challenge"]:
decision = "challenge"
elif final_score < self._thresholds["decline"]:
decision = "review"
else:
decision = "decline"
elapsed = (time.monotonic() - start) * 1000
return ScoringResult(
score=final_score,
decision=decision,
reasons=all_reasons[:5], # Top 5 reasons
model_version="v3.2.1",
latency_ms=elapsed
)
The Feedback Loop
A fraud detection system without a feedback loop degrades over time as fraudsters adapt. The feedback loop connects confirmed fraud labels back to model training:
| Stage | Latency | Source |
|---|---|---|
| Transaction scoring | Real-time | Scoring pipeline |
| Cardholder dispute (chargeback) | 30-120 days | Card network |
| Bank investigation | 1-14 days | Issuer fraud team |
| Law enforcement report | Months | Regulatory |
| Model retraining (incremental) | Daily | Training pipeline |
| Model retraining (full) | Weekly | Training pipeline |
| Model deployment | Hours | MLOps pipeline |
The 30-120 day delay between fraud occurrence and chargeback label is the critical challenge. The model is always training on data that’s 1-4 months old. Fraudsters who discover a new attack vector have a window of opportunity until enough chargebacks accumulate to shift the model.
Sophisticated systems supplement chargeback labels with “early indicators” — transactions flagged by the issuer within hours, declined authorizations at other merchants, and consortium-shared intelligence. These early signals shrink the adaptation window from months to days.