Building a Payment Gateway
Building a Payment Gateway
A payment gateway sits between merchants and payment processors (PSPs), abstracting the complexity of payment processing behind a unified API. Building one that handles real money requires getting the state machine exactly right — every edge case around timeouts, partial failures, and concurrent operations must have a defined behavior.
The diagram above shows the layered architecture. The merchant-facing API layer accepts payment requests; the routing layer selects the optimal PSP; the connector layer translates to PSP-specific protocols; and the ledger records every financial event for reconciliation.
Transaction State Machine
The payment lifecycle is a state machine with well-defined transitions. Getting this right is the single most important design decision in a payment gateway:
from dataclasses import dataclass, field
from datetime import datetime
from decimal import Decimal
from enum import Enum
from typing import Optional
import uuid
class PaymentStatus(Enum):
"""
Payment states.
Each state represents a financially meaningful condition:
- CREATED: intent recorded, no financial action
- PROCESSING: sent to PSP, awaiting response
- AUTHORIZED: funds reserved, not yet captured
- CAPTURED: funds transferred (or will be in settlement)
- PARTIALLY_CAPTURED: portion of authorization captured
- VOIDED: authorization cancelled before capture
- REFUNDED: funds returned to cardholder
- PARTIALLY_REFUNDED: portion of captured amount refunded
- FAILED: terminal failure
- EXPIRED: authorization expired without capture
"""
CREATED = "created"
PROCESSING = "processing"
AUTHORIZED = "authorized"
CAPTURED = "captured"
PARTIALLY_CAPTURED = "partially_captured"
VOIDED = "voided"
REFUNDED = "refunded"
PARTIALLY_REFUNDED = "partially_refunded"
FAILED = "failed"
EXPIRED = "expired"
# Valid state transitions
VALID_TRANSITIONS: dict[PaymentStatus, set[PaymentStatus]] = {
PaymentStatus.CREATED: {
PaymentStatus.PROCESSING,
PaymentStatus.FAILED,
},
PaymentStatus.PROCESSING: {
PaymentStatus.AUTHORIZED,
PaymentStatus.CAPTURED, # Direct capture (no separate auth)
PaymentStatus.FAILED,
},
PaymentStatus.AUTHORIZED: {
PaymentStatus.CAPTURED,
PaymentStatus.PARTIALLY_CAPTURED,
PaymentStatus.VOIDED,
PaymentStatus.EXPIRED,
},
PaymentStatus.CAPTURED: {
PaymentStatus.REFUNDED,
PaymentStatus.PARTIALLY_REFUNDED,
},
PaymentStatus.PARTIALLY_CAPTURED: {
PaymentStatus.CAPTURED, # Capture remaining
PaymentStatus.VOIDED, # Void remaining authorization
PaymentStatus.REFUNDED,
PaymentStatus.PARTIALLY_REFUNDED,
},
# Terminal states — no transitions out
PaymentStatus.VOIDED: set(),
PaymentStatus.REFUNDED: set(),
PaymentStatus.PARTIALLY_REFUNDED: {
PaymentStatus.REFUNDED, # Refund remaining
},
PaymentStatus.FAILED: set(),
PaymentStatus.EXPIRED: set(),
}
@dataclass
class Payment:
"""
A payment record in the gateway.
"""
payment_id: str
merchant_id: str
idempotency_key: str
# Financial
amount: Decimal
currency: str
captured_amount: Decimal = Decimal(0)
refunded_amount: Decimal = Decimal(0)
# Status
status: PaymentStatus = PaymentStatus.CREATED
# PSP routing
psp_id: str = ""
psp_reference: str = ""
# Payment method
payment_method_type: str = "" # "card", "bank_transfer", "wallet"
payment_method_token: str = "" # Tokenized payment method
# Metadata
created_at: datetime = field(default_factory=datetime.utcnow)
updated_at: datetime = field(default_factory=datetime.utcnow)
# Audit trail
events: list[dict] = field(default_factory=list)
def transition_to(self, new_status: PaymentStatus, reason: str = ""):
"""
Transition to a new state with validation.
Raises ValueError if the transition is not allowed.
This is the ONLY way to change payment status — direct
field assignment should never be used.
"""
valid = VALID_TRANSITIONS.get(self.status, set())
if new_status not in valid:
raise InvalidStateTransition(
f"Cannot transition from {self.status.value} to "
f"{new_status.value}. Valid transitions: "
f"{[s.value for s in valid]}"
)
old_status = self.status
self.status = new_status
self.updated_at = datetime.utcnow()
self.events.append({
"timestamp": self.updated_at.isoformat(),
"from_status": old_status.value,
"to_status": new_status.value,
"reason": reason,
})
class InvalidStateTransition(Exception):
pass
Multi-PSP Routing
A production gateway routes to multiple PSPs for cost optimization, reliability, and geographic coverage:
@dataclass
class PSPConfig:
psp_id: str
name: str
supported_currencies: set[str]
supported_card_brands: set[str]
supported_countries: set[str]
# Cost structure
transaction_fee_pct: Decimal # Percentage fee (e.g., 2.9%)
transaction_fee_fixed: Decimal # Fixed fee per transaction (e.g., $0.30)
# Performance
avg_latency_ms: float
success_rate: float # Historical success rate (0-1)
# Operational
is_active: bool = True
max_tps: int = 1000 # Rate limit
current_tps: int = 0
# Failover
priority: int = 1 # Lower = higher priority
class PaymentRouter:
"""
Routes payments to the optimal PSP based on cost, performance,
and availability.
Routing strategies:
1. Cost-optimized: choose cheapest PSP that supports the payment
2. Performance-optimized: choose PSP with highest success rate
3. Balanced: weighted score of cost + performance
4. Failover: try primary, fall back to secondary on failure
The router also handles:
- Rate limiting per PSP
- Geographic routing (route EU cards to EU PSPs)
- Card brand routing (some PSPs have better Amex rates)
- A/B testing for new PSP integrations
"""
def __init__(self, psps: list[PSPConfig]):
self._psps = {p.psp_id: p for p in psps}
self._circuit_breakers: dict[str, 'CircuitBreaker'] = {
p.psp_id: CircuitBreaker(failure_threshold=5, reset_timeout=60)
for p in psps
}
def select_psp(
self, amount: Decimal, currency: str,
card_brand: str, card_country: str,
merchant_routing_rules: dict | None = None
) -> list[PSPConfig]:
"""
Select PSPs in priority order (primary + fallbacks).
Returns a ranked list of eligible PSPs. The gateway
tries the first one; if it fails, it tries the next.
"""
# Filter eligible PSPs
eligible = []
for psp in self._psps.values():
if not psp.is_active:
continue
if currency not in psp.supported_currencies:
continue
if card_brand not in psp.supported_card_brands:
continue
if self._circuit_breakers[psp.psp_id].is_open:
continue
if psp.current_tps >= psp.max_tps:
continue
eligible.append(psp)
if not eligible:
raise NoPSPAvailable(
f"No PSP available for {currency}/{card_brand}/{card_country}"
)
# Apply merchant-specific routing rules
if merchant_routing_rules:
preferred = merchant_routing_rules.get("preferred_psp")
if preferred and preferred in self._psps:
psp = self._psps[preferred]
if psp in eligible:
eligible.remove(psp)
eligible.insert(0, psp)
# Score and rank remaining PSPs
scored = []
for psp in eligible:
cost = float(
psp.transaction_fee_pct / 100 * amount +
psp.transaction_fee_fixed
)
# Balanced scoring: 60% success rate + 40% cost
# Lower score is better
score = (
(1 - psp.success_rate) * 0.6 +
(cost / float(amount)) * 0.4
)
scored.append((score, psp))
scored.sort(key=lambda x: x[0])
return [psp for _, psp in scored]
class NoPSPAvailable(Exception):
pass
class CircuitBreaker:
"""
Circuit breaker for PSP connections.
States:
- CLOSED: normal operation, requests pass through
- OPEN: PSP is failing, requests are immediately rejected
- HALF_OPEN: testing if PSP has recovered
Transitions:
- CLOSED → OPEN: failure_threshold consecutive failures
- OPEN → HALF_OPEN: after reset_timeout seconds
- HALF_OPEN → CLOSED: first success
- HALF_OPEN → OPEN: first failure
"""
def __init__(self, failure_threshold: int = 5, reset_timeout: int = 60):
self._state = "closed"
self._failure_count = 0
self._failure_threshold = failure_threshold
self._reset_timeout = reset_timeout
self._last_failure_time: float = 0
@property
def is_open(self) -> bool:
if self._state == "open":
# Check if reset timeout has elapsed
if time.time() - self._last_failure_time > self._reset_timeout:
self._state = "half_open"
return False
return True
return False
def record_success(self):
self._failure_count = 0
self._state = "closed"
def record_failure(self):
self._failure_count += 1
self._last_failure_time = time.time()
if self._failure_count >= self._failure_threshold:
self._state = "open"
if self._state == "half_open":
self._state = "open"
Payment Processing Engine
The processing engine orchestrates the payment lifecycle — authorization, capture, refund — with retry logic and PSP failover:
class PaymentProcessingEngine:
"""
Orchestrates payment processing with retry and failover.
"""
def __init__(
self, router: PaymentRouter,
connectors: dict[str, 'PSPConnector'],
payment_store: 'PaymentStore',
):
self._router = router
self._connectors = connectors
self._store = payment_store
def authorize(
self, payment: Payment, card_brand: str, card_country: str
) -> Payment:
"""
Authorize a payment: reserve funds on the cardholder's account.
"""
payment.transition_to(PaymentStatus.PROCESSING, "authorization_started")
self._store.save(payment)
# Get ranked PSP list
psps = self._router.select_psp(
payment.amount, payment.currency,
card_brand, card_country
)
last_error = None
for psp in psps:
connector = self._connectors.get(psp.psp_id)
if not connector:
continue
try:
result = connector.authorize(
amount=payment.amount,
currency=payment.currency,
payment_method_token=payment.payment_method_token,
merchant_reference=payment.payment_id,
)
if result["status"] == "authorized":
payment.psp_id = psp.psp_id
payment.psp_reference = result["psp_reference"]
payment.transition_to(
PaymentStatus.AUTHORIZED,
f"Authorized via {psp.name}"
)
self._circuit_breakers_record_success(psp.psp_id)
self._store.save(payment)
return payment
elif result["status"] == "declined":
# Hard decline — don't retry with another PSP
payment.transition_to(
PaymentStatus.FAILED,
f"Declined by {psp.name}: {result.get('decline_reason', 'unknown')}"
)
self._store.save(payment)
return payment
except PSPTimeoutError as e:
last_error = e
self._circuit_breakers_record_failure(psp.psp_id)
# Try next PSP
continue
except PSPConnectionError as e:
last_error = e
self._circuit_breakers_record_failure(psp.psp_id)
continue
# All PSPs failed
payment.transition_to(
PaymentStatus.FAILED,
f"All PSPs failed. Last error: {last_error}"
)
self._store.save(payment)
return payment
def capture(
self, payment_id: str, amount: Optional[Decimal] = None
) -> Payment:
"""
Capture a previously authorized payment.
Can capture the full amount or a partial amount.
Partial capture leaves the remaining authorization
available for subsequent captures.
"""
payment = self._store.get(payment_id)
if payment.status not in (
PaymentStatus.AUTHORIZED,
PaymentStatus.PARTIALLY_CAPTURED
):
raise InvalidStateTransition(
f"Cannot capture payment in {payment.status.value} state"
)
capture_amount = amount or (payment.amount - payment.captured_amount)
if capture_amount > payment.amount - payment.captured_amount:
raise ValueError(
f"Capture amount {capture_amount} exceeds remaining "
f"authorization {payment.amount - payment.captured_amount}"
)
connector = self._connectors[payment.psp_id]
result = connector.capture(
psp_reference=payment.psp_reference,
amount=capture_amount,
currency=payment.currency,
)
if result["status"] == "captured":
payment.captured_amount += capture_amount
if payment.captured_amount >= payment.amount:
payment.transition_to(PaymentStatus.CAPTURED, "full_capture")
else:
payment.transition_to(
PaymentStatus.PARTIALLY_CAPTURED,
f"Partial capture: {capture_amount}"
)
self._store.save(payment)
return payment
def refund(
self, payment_id: str, amount: Optional[Decimal] = None
) -> Payment:
"""
Refund a captured payment.
"""
payment = self._store.get(payment_id)
refund_amount = amount or (payment.captured_amount - payment.refunded_amount)
if refund_amount > payment.captured_amount - payment.refunded_amount:
raise ValueError("Refund exceeds captured amount")
connector = self._connectors[payment.psp_id]
result = connector.refund(
psp_reference=payment.psp_reference,
amount=refund_amount,
currency=payment.currency,
)
if result["status"] == "refunded":
payment.refunded_amount += refund_amount
if payment.refunded_amount >= payment.captured_amount:
payment.transition_to(PaymentStatus.REFUNDED, "full_refund")
else:
payment.transition_to(
PaymentStatus.PARTIALLY_REFUNDED,
f"Partial refund: {refund_amount}"
)
self._store.save(payment)
return payment
def _circuit_breakers_record_success(self, psp_id: str):
cb = self._router._circuit_breakers.get(psp_id)
if cb:
cb.record_success()
def _circuit_breakers_record_failure(self, psp_id: str):
cb = self._router._circuit_breakers.get(psp_id)
if cb:
cb.record_failure()
class PSPTimeoutError(Exception):
pass
class PSPConnectionError(Exception):
pass
Webhook Delivery
Merchants need asynchronous status updates. Webhook delivery must be reliable — lost notifications mean merchants and customers don’t know if a payment succeeded:
@dataclass
class WebhookDelivery:
delivery_id: str
payment_id: str
merchant_id: str
url: str
payload: dict
# Delivery tracking
attempt_count: int = 0
max_attempts: int = 15
next_attempt_at: datetime = field(default_factory=datetime.utcnow)
last_response_code: int = 0
status: str = "pending" # "pending", "delivered", "failed", "expired"
# Retry schedule: exponential backoff with jitter
# Attempts at: 0s, 30s, 1m, 5m, 15m, 30m, 1h, 2h, 4h, 8h, 12h, 24h, 48h, 72h
RETRY_DELAYS = [
0, 30, 60, 300, 900, 1800, 3600, 7200,
14400, 28800, 43200, 86400, 172800, 259200
]
def compute_next_retry(self) -> datetime:
"""
Compute the next retry time using exponential backoff.
The schedule delivers 15 attempts over 72 hours. After that,
the webhook is marked as failed and the merchant must poll.
"""
if self.attempt_count >= len(self.RETRY_DELAYS):
self.status = "failed"
return self.next_attempt_at
delay = self.RETRY_DELAYS[self.attempt_count]
# Add jitter (±10%) to prevent thundering herd
import random
jitter = delay * random.uniform(-0.1, 0.1)
return datetime.utcnow() + timedelta(seconds=delay + jitter)
A payment gateway is ultimately a state machine with money attached. Every decision — retry strategy, timeout value, failover logic — has financial consequences. A 30-second timeout that’s too long means the merchant’s customer waits unnecessarily. A timeout that’s too short means a legitimate authorization is abandoned and the card is charged without the merchant knowing. The state machine enforces invariants that prevent financial inconsistencies, and the audit trail ensures every penny can be accounted for.