Skip to main content
digital payment systems cryptography banking protocols and blockchain internals

Building a Payment Gateway

9 min read Chapter 19 of 21

Building a Payment Gateway

A payment gateway sits between merchants and payment processors (PSPs), abstracting the complexity of payment processing behind a unified API. Building one that handles real money requires getting the state machine exactly right — every edge case around timeouts, partial failures, and concurrent operations must have a defined behavior.

Payment Gateway Architecture

The diagram above shows the layered architecture. The merchant-facing API layer accepts payment requests; the routing layer selects the optimal PSP; the connector layer translates to PSP-specific protocols; and the ledger records every financial event for reconciliation.

Transaction State Machine

The payment lifecycle is a state machine with well-defined transitions. Getting this right is the single most important design decision in a payment gateway:

from dataclasses import dataclass, field
from datetime import datetime
from decimal import Decimal
from enum import Enum
from typing import Optional
import uuid

class PaymentStatus(Enum):
    """
    Payment states.
    
    Each state represents a financially meaningful condition:
    - CREATED: intent recorded, no financial action
    - PROCESSING: sent to PSP, awaiting response
    - AUTHORIZED: funds reserved, not yet captured
    - CAPTURED: funds transferred (or will be in settlement)
    - PARTIALLY_CAPTURED: portion of authorization captured
    - VOIDED: authorization cancelled before capture
    - REFUNDED: funds returned to cardholder
    - PARTIALLY_REFUNDED: portion of captured amount refunded
    - FAILED: terminal failure
    - EXPIRED: authorization expired without capture
    """
    CREATED = "created"
    PROCESSING = "processing"
    AUTHORIZED = "authorized"
    CAPTURED = "captured"
    PARTIALLY_CAPTURED = "partially_captured"
    VOIDED = "voided"
    REFUNDED = "refunded"
    PARTIALLY_REFUNDED = "partially_refunded"
    FAILED = "failed"
    EXPIRED = "expired"

# Valid state transitions
VALID_TRANSITIONS: dict[PaymentStatus, set[PaymentStatus]] = {
    PaymentStatus.CREATED: {
        PaymentStatus.PROCESSING,
        PaymentStatus.FAILED,
    },
    PaymentStatus.PROCESSING: {
        PaymentStatus.AUTHORIZED,
        PaymentStatus.CAPTURED,      # Direct capture (no separate auth)
        PaymentStatus.FAILED,
    },
    PaymentStatus.AUTHORIZED: {
        PaymentStatus.CAPTURED,
        PaymentStatus.PARTIALLY_CAPTURED,
        PaymentStatus.VOIDED,
        PaymentStatus.EXPIRED,
    },
    PaymentStatus.CAPTURED: {
        PaymentStatus.REFUNDED,
        PaymentStatus.PARTIALLY_REFUNDED,
    },
    PaymentStatus.PARTIALLY_CAPTURED: {
        PaymentStatus.CAPTURED,       # Capture remaining
        PaymentStatus.VOIDED,         # Void remaining authorization
        PaymentStatus.REFUNDED,
        PaymentStatus.PARTIALLY_REFUNDED,
    },
    # Terminal states — no transitions out
    PaymentStatus.VOIDED: set(),
    PaymentStatus.REFUNDED: set(),
    PaymentStatus.PARTIALLY_REFUNDED: {
        PaymentStatus.REFUNDED,       # Refund remaining
    },
    PaymentStatus.FAILED: set(),
    PaymentStatus.EXPIRED: set(),
}

@dataclass
class Payment:
    """
    A payment record in the gateway.
    """
    payment_id: str
    merchant_id: str
    idempotency_key: str
    
    # Financial
    amount: Decimal
    currency: str
    captured_amount: Decimal = Decimal(0)
    refunded_amount: Decimal = Decimal(0)
    
    # Status
    status: PaymentStatus = PaymentStatus.CREATED
    
    # PSP routing
    psp_id: str = ""
    psp_reference: str = ""
    
    # Payment method
    payment_method_type: str = ""   # "card", "bank_transfer", "wallet"
    payment_method_token: str = ""  # Tokenized payment method
    
    # Metadata
    created_at: datetime = field(default_factory=datetime.utcnow)
    updated_at: datetime = field(default_factory=datetime.utcnow)
    
    # Audit trail
    events: list[dict] = field(default_factory=list)
    
    def transition_to(self, new_status: PaymentStatus, reason: str = ""):
        """
        Transition to a new state with validation.
        
        Raises ValueError if the transition is not allowed.
        This is the ONLY way to change payment status — direct
        field assignment should never be used.
        """
        valid = VALID_TRANSITIONS.get(self.status, set())
        if new_status not in valid:
            raise InvalidStateTransition(
                f"Cannot transition from {self.status.value} to "
                f"{new_status.value}. Valid transitions: "
                f"{[s.value for s in valid]}"
            )
        
        old_status = self.status
        self.status = new_status
        self.updated_at = datetime.utcnow()
        
        self.events.append({
            "timestamp": self.updated_at.isoformat(),
            "from_status": old_status.value,
            "to_status": new_status.value,
            "reason": reason,
        })

class InvalidStateTransition(Exception):
    pass

Multi-PSP Routing

A production gateway routes to multiple PSPs for cost optimization, reliability, and geographic coverage:

@dataclass
class PSPConfig:
    psp_id: str
    name: str
    supported_currencies: set[str]
    supported_card_brands: set[str]
    supported_countries: set[str]
    
    # Cost structure
    transaction_fee_pct: Decimal     # Percentage fee (e.g., 2.9%)
    transaction_fee_fixed: Decimal   # Fixed fee per transaction (e.g., $0.30)
    
    # Performance
    avg_latency_ms: float
    success_rate: float              # Historical success rate (0-1)
    
    # Operational
    is_active: bool = True
    max_tps: int = 1000              # Rate limit
    current_tps: int = 0
    
    # Failover
    priority: int = 1                # Lower = higher priority

class PaymentRouter:
    """
    Routes payments to the optimal PSP based on cost, performance,
    and availability.
    
    Routing strategies:
    1. Cost-optimized: choose cheapest PSP that supports the payment
    2. Performance-optimized: choose PSP with highest success rate
    3. Balanced: weighted score of cost + performance
    4. Failover: try primary, fall back to secondary on failure
    
    The router also handles:
    - Rate limiting per PSP
    - Geographic routing (route EU cards to EU PSPs)
    - Card brand routing (some PSPs have better Amex rates)
    - A/B testing for new PSP integrations
    """
    
    def __init__(self, psps: list[PSPConfig]):
        self._psps = {p.psp_id: p for p in psps}
        self._circuit_breakers: dict[str, 'CircuitBreaker'] = {
            p.psp_id: CircuitBreaker(failure_threshold=5, reset_timeout=60)
            for p in psps
        }
    
    def select_psp(
        self, amount: Decimal, currency: str,
        card_brand: str, card_country: str,
        merchant_routing_rules: dict | None = None
    ) -> list[PSPConfig]:
        """
        Select PSPs in priority order (primary + fallbacks).
        
        Returns a ranked list of eligible PSPs. The gateway
        tries the first one; if it fails, it tries the next.
        """
        # Filter eligible PSPs
        eligible = []
        for psp in self._psps.values():
            if not psp.is_active:
                continue
            if currency not in psp.supported_currencies:
                continue
            if card_brand not in psp.supported_card_brands:
                continue
            if self._circuit_breakers[psp.psp_id].is_open:
                continue
            if psp.current_tps >= psp.max_tps:
                continue
            eligible.append(psp)
        
        if not eligible:
            raise NoPSPAvailable(
                f"No PSP available for {currency}/{card_brand}/{card_country}"
            )
        
        # Apply merchant-specific routing rules
        if merchant_routing_rules:
            preferred = merchant_routing_rules.get("preferred_psp")
            if preferred and preferred in self._psps:
                psp = self._psps[preferred]
                if psp in eligible:
                    eligible.remove(psp)
                    eligible.insert(0, psp)
        
        # Score and rank remaining PSPs
        scored = []
        for psp in eligible:
            cost = float(
                psp.transaction_fee_pct / 100 * amount + 
                psp.transaction_fee_fixed
            )
            
            # Balanced scoring: 60% success rate + 40% cost
            # Lower score is better
            score = (
                (1 - psp.success_rate) * 0.6 +
                (cost / float(amount)) * 0.4
            )
            scored.append((score, psp))
        
        scored.sort(key=lambda x: x[0])
        return [psp for _, psp in scored]

class NoPSPAvailable(Exception):
    pass

class CircuitBreaker:
    """
    Circuit breaker for PSP connections.
    
    States:
    - CLOSED: normal operation, requests pass through
    - OPEN: PSP is failing, requests are immediately rejected
    - HALF_OPEN: testing if PSP has recovered
    
    Transitions:
    - CLOSED → OPEN: failure_threshold consecutive failures
    - OPEN → HALF_OPEN: after reset_timeout seconds
    - HALF_OPEN → CLOSED: first success
    - HALF_OPEN → OPEN: first failure
    """
    
    def __init__(self, failure_threshold: int = 5, reset_timeout: int = 60):
        self._state = "closed"
        self._failure_count = 0
        self._failure_threshold = failure_threshold
        self._reset_timeout = reset_timeout
        self._last_failure_time: float = 0
    
    @property
    def is_open(self) -> bool:
        if self._state == "open":
            # Check if reset timeout has elapsed
            if time.time() - self._last_failure_time > self._reset_timeout:
                self._state = "half_open"
                return False
            return True
        return False
    
    def record_success(self):
        self._failure_count = 0
        self._state = "closed"
    
    def record_failure(self):
        self._failure_count += 1
        self._last_failure_time = time.time()
        
        if self._failure_count >= self._failure_threshold:
            self._state = "open"
        
        if self._state == "half_open":
            self._state = "open"

Payment Processing Engine

The processing engine orchestrates the payment lifecycle — authorization, capture, refund — with retry logic and PSP failover:

class PaymentProcessingEngine:
    """
    Orchestrates payment processing with retry and failover.
    """
    
    def __init__(
        self, router: PaymentRouter,
        connectors: dict[str, 'PSPConnector'],
        payment_store: 'PaymentStore',
    ):
        self._router = router
        self._connectors = connectors
        self._store = payment_store
    
    def authorize(
        self, payment: Payment, card_brand: str, card_country: str
    ) -> Payment:
        """
        Authorize a payment: reserve funds on the cardholder's account.
        """
        payment.transition_to(PaymentStatus.PROCESSING, "authorization_started")
        self._store.save(payment)
        
        # Get ranked PSP list
        psps = self._router.select_psp(
            payment.amount, payment.currency,
            card_brand, card_country
        )
        
        last_error = None
        for psp in psps:
            connector = self._connectors.get(psp.psp_id)
            if not connector:
                continue
            
            try:
                result = connector.authorize(
                    amount=payment.amount,
                    currency=payment.currency,
                    payment_method_token=payment.payment_method_token,
                    merchant_reference=payment.payment_id,
                )
                
                if result["status"] == "authorized":
                    payment.psp_id = psp.psp_id
                    payment.psp_reference = result["psp_reference"]
                    payment.transition_to(
                        PaymentStatus.AUTHORIZED,
                        f"Authorized via {psp.name}"
                    )
                    self._circuit_breakers_record_success(psp.psp_id)
                    self._store.save(payment)
                    return payment
                
                elif result["status"] == "declined":
                    # Hard decline — don't retry with another PSP
                    payment.transition_to(
                        PaymentStatus.FAILED,
                        f"Declined by {psp.name}: {result.get('decline_reason', 'unknown')}"
                    )
                    self._store.save(payment)
                    return payment
                
            except PSPTimeoutError as e:
                last_error = e
                self._circuit_breakers_record_failure(psp.psp_id)
                # Try next PSP
                continue
                
            except PSPConnectionError as e:
                last_error = e
                self._circuit_breakers_record_failure(psp.psp_id)
                continue
        
        # All PSPs failed
        payment.transition_to(
            PaymentStatus.FAILED,
            f"All PSPs failed. Last error: {last_error}"
        )
        self._store.save(payment)
        return payment
    
    def capture(
        self, payment_id: str, amount: Optional[Decimal] = None
    ) -> Payment:
        """
        Capture a previously authorized payment.
        
        Can capture the full amount or a partial amount.
        Partial capture leaves the remaining authorization
        available for subsequent captures.
        """
        payment = self._store.get(payment_id)
        
        if payment.status not in (
            PaymentStatus.AUTHORIZED, 
            PaymentStatus.PARTIALLY_CAPTURED
        ):
            raise InvalidStateTransition(
                f"Cannot capture payment in {payment.status.value} state"
            )
        
        capture_amount = amount or (payment.amount - payment.captured_amount)
        
        if capture_amount > payment.amount - payment.captured_amount:
            raise ValueError(
                f"Capture amount {capture_amount} exceeds remaining "
                f"authorization {payment.amount - payment.captured_amount}"
            )
        
        connector = self._connectors[payment.psp_id]
        result = connector.capture(
            psp_reference=payment.psp_reference,
            amount=capture_amount,
            currency=payment.currency,
        )
        
        if result["status"] == "captured":
            payment.captured_amount += capture_amount
            
            if payment.captured_amount >= payment.amount:
                payment.transition_to(PaymentStatus.CAPTURED, "full_capture")
            else:
                payment.transition_to(
                    PaymentStatus.PARTIALLY_CAPTURED,
                    f"Partial capture: {capture_amount}"
                )
            
            self._store.save(payment)
        
        return payment
    
    def refund(
        self, payment_id: str, amount: Optional[Decimal] = None
    ) -> Payment:
        """
        Refund a captured payment.
        """
        payment = self._store.get(payment_id)
        
        refund_amount = amount or (payment.captured_amount - payment.refunded_amount)
        
        if refund_amount > payment.captured_amount - payment.refunded_amount:
            raise ValueError("Refund exceeds captured amount")
        
        connector = self._connectors[payment.psp_id]
        result = connector.refund(
            psp_reference=payment.psp_reference,
            amount=refund_amount,
            currency=payment.currency,
        )
        
        if result["status"] == "refunded":
            payment.refunded_amount += refund_amount
            
            if payment.refunded_amount >= payment.captured_amount:
                payment.transition_to(PaymentStatus.REFUNDED, "full_refund")
            else:
                payment.transition_to(
                    PaymentStatus.PARTIALLY_REFUNDED,
                    f"Partial refund: {refund_amount}"
                )
            
            self._store.save(payment)
        
        return payment
    
    def _circuit_breakers_record_success(self, psp_id: str):
        cb = self._router._circuit_breakers.get(psp_id)
        if cb:
            cb.record_success()
    
    def _circuit_breakers_record_failure(self, psp_id: str):
        cb = self._router._circuit_breakers.get(psp_id)
        if cb:
            cb.record_failure()

class PSPTimeoutError(Exception):
    pass

class PSPConnectionError(Exception):
    pass

Webhook Delivery

Merchants need asynchronous status updates. Webhook delivery must be reliable — lost notifications mean merchants and customers don’t know if a payment succeeded:

@dataclass
class WebhookDelivery:
    delivery_id: str
    payment_id: str
    merchant_id: str
    url: str
    payload: dict
    
    # Delivery tracking
    attempt_count: int = 0
    max_attempts: int = 15
    next_attempt_at: datetime = field(default_factory=datetime.utcnow)
    last_response_code: int = 0
    status: str = "pending"  # "pending", "delivered", "failed", "expired"
    
    # Retry schedule: exponential backoff with jitter
    # Attempts at: 0s, 30s, 1m, 5m, 15m, 30m, 1h, 2h, 4h, 8h, 12h, 24h, 48h, 72h
    RETRY_DELAYS = [
        0, 30, 60, 300, 900, 1800, 3600, 7200,
        14400, 28800, 43200, 86400, 172800, 259200
    ]
    
    def compute_next_retry(self) -> datetime:
        """
        Compute the next retry time using exponential backoff.
        
        The schedule delivers 15 attempts over 72 hours. After that,
        the webhook is marked as failed and the merchant must poll.
        """
        if self.attempt_count >= len(self.RETRY_DELAYS):
            self.status = "failed"
            return self.next_attempt_at
        
        delay = self.RETRY_DELAYS[self.attempt_count]
        # Add jitter (±10%) to prevent thundering herd
        import random
        jitter = delay * random.uniform(-0.1, 0.1)
        
        return datetime.utcnow() + timedelta(seconds=delay + jitter)

A payment gateway is ultimately a state machine with money attached. Every decision — retry strategy, timeout value, failover logic — has financial consequences. A 30-second timeout that’s too long means the merchant’s customer waits unnecessarily. A timeout that’s too short means a legitimate authorization is abandoned and the card is charged without the merchant knowing. The state machine enforces invariants that prevent financial inconsistencies, and the audit trail ensures every penny can be accounted for.