Skip to main content
resilience patterns in production

Fallback Design

6 min read Chapter 7 of 40

Fallback Design

A fallback is the answer to a question you must ask before writing the first line of resilience code: what does the system do when this dependency is unavailable?

If you cannot answer that question for every dependency in your service, adding a circuit breaker or a retry policy is decoration. The circuit breaker opens. The retry exhausts. Then what? If the answer is “throw an exception and return HTTP 500,” then the fallback is “break.” That might be correct for the balance service, which cannot serve stale data for a payment decision. It is not correct for the notification service, where a delayed email is acceptable.

Fallback design is the discipline of deciding, per dependency and per operation, what the acceptable degraded behavior is.

The Fallback Decision Framework

Fallback Decision Tree

The decision tree shows three distinct paths when a primary call fails. If the caller can tolerate stale data, serve from cache with a maximum staleness TTL per endpoint. If the call is not on the critical path, skip it and proceed (log a metric, deliver later). If the call is critical and stale data is not acceptable, fail fast with a clear error. The bottom row maps these decisions to the transaction platform: balance checks use cached last-known values, fraud detection allows transactions with a manual review flag, and notifications enqueue for later delivery.

The key insight: the fallback decision is a business decision, not a technical one. Engineering determines what is technically possible (caching, queuing, skipping). The business determines what is acceptable. “Can we process a payment without fraud scoring?” is a risk question, not an engineering question. Engineering’s job is to make the fallback work correctly once the decision is made.

Fallback Strategies by Dependency

Fraud Detection Fallback

When fraud detection is unavailable, the payment service has two options: block all payments or allow payments with reduced fraud protection. For the transaction platform, the business decision is to allow payments under a configurable amount with a flag for manual review.

// PRODUCTION - Fraud detection fallback
@Component
public class FraudFallback {

    private static final BigDecimal AUTO_APPROVE_LIMIT = new BigDecimal("50.00");
    private final MeterRegistry registry;

    public FraudFallback(MeterRegistry registry) {
        this.registry = registry;
    }

    public FraudScore fallbackScore(PaymentRequest request, Throwable cause) {
        registry.counter("fraud.fallback.invoked",
                "reason", cause.getClass().getSimpleName()).increment();

        if (request.amount().compareTo(AUTO_APPROVE_LIMIT) <= 0) {
            // Low-value transaction: approve with flag
            registry.counter("fraud.fallback.auto_approved").increment();
            return new FraudScore(
                    0.0,         // No score available
                    true,        // Approved
                    true,        // Flagged for manual review
                    "Fraud service unavailable. Auto-approved (amount <= $50). "
                    + "Queued for manual review."
            );
        } else {
            // High-value transaction: reject
            registry.counter("fraud.fallback.rejected").increment();
            return new FraudScore(
                    1.0,         // Maximum risk score
                    false,       // Not approved
                    true,        // Flagged
                    "Fraud service unavailable. High-value transaction rejected. "
                    + "Retry when fraud service recovers."
            );
        }
    }
}

This fallback has measurable properties. The fraud.fallback.invoked counter tells you how often it fires. The fraud.fallback.auto_approved counter tells you how many transactions bypassed fraud scoring. These metrics feed into the risk team’s dashboards.

Notification Fallback

Notifications are not critical path. A failed notification should never block a payment.

// PRODUCTION - Notification fallback: queue for later delivery
@Component
public class NotificationFallback {

    private final BlockingQueue<NotificationRequest> retryQueue;
    private final MeterRegistry registry;

    public NotificationFallback(
            @Qualifier("notificationRetryQueue") BlockingQueue<NotificationRequest> retryQueue,
            MeterRegistry registry) {
        this.retryQueue = retryQueue;
        this.registry = registry;
    }

    public void fallbackNotify(String userId, PaymentConfirmation confirmation,
                                Throwable cause) {
        registry.counter("notification.fallback.queued").increment();

        NotificationRequest retryRequest = new NotificationRequest(
                userId, confirmation, Instant.now(), 0);

        boolean queued = retryQueue.offer(retryRequest);
        if (!queued) {
            // Queue is full - notification is lost
            // This is acceptable: payment succeeded, notification is best-effort
            registry.counter("notification.fallback.dropped").increment();
        }
    }
}

A background scheduled task drains the retry queue and attempts redelivery. The notification is eventually delivered, but the payment is not delayed.

Balance Service: No Fallback

The balance service checks whether funds are available. Serving stale balance data for a payment decision is not acceptable: you could approve a payment against insufficient funds. The fallback for the balance service is to fail the payment.

// PRODUCTION - Balance fallback: fail fast
@Component
public class BalanceFallback {

    private final MeterRegistry registry;

    public BalanceFallback(MeterRegistry registry) {
        this.registry = registry;
    }

    public Balance fallbackCheck(PaymentRequest request, Throwable cause) {
        registry.counter("balance.fallback.payment_rejected",
                "reason", cause.getClass().getSimpleName()).increment();

        throw new PaymentRejectedException(
                "Balance service unavailable. Cannot verify funds. Payment rejected.",
                cause);
    }
}

This is a fallback that fails. That is a valid design. The important thing is that it fails deliberately, with a clear error message, a counter metric, and the information the caller needs to communicate the failure to the user. Letting the exception propagate unhandled is not the same thing. An unhandled exception produces a stack trace in the logs and an HTTP 500 with no useful information.

Composing Fallbacks

When multiple dependencies are unavailable simultaneously, the fallback logic must compose correctly.

// PRODUCTION - Payment orchestration with composed fallbacks
@Service
public class PaymentService {

    private final FraudDetectionClient fraudClient;
    private final FraudFallback fraudFallback;
    private final BalanceClient balanceClient;
    private final BalanceFallback balanceFallback;
    private final NotificationClient notificationClient;
    private final NotificationFallback notificationFallback;

    public PaymentResult processPayment(PaymentRequest request) {
        // Step 1: Fraud check (fallback: conditional approval)
        FraudScore fraudScore;
        try {
            fraudScore = fraudClient.score(request);
        } catch (Exception e) {
            fraudScore = fraudFallback.fallbackScore(request, e);
            if (!fraudScore.approved()) {
                return PaymentResult.rejected(fraudScore.reason());
            }
        }

        // Step 2: Balance check (fallback: reject payment)
        Balance balance;
        try {
            balance = balanceClient.reserve(request);
        } catch (Exception e) {
            // balanceFallback.fallbackCheck throws PaymentRejectedException
            balance = balanceFallback.fallbackCheck(request, e);
        }

        // Step 3: Execute payment
        PaymentConfirmation confirmation = paymentGateway.charge(request);

        // Step 4: Notification (fallback: queue for later)
        try {
            notificationClient.notify(request.userId(), confirmation);
        } catch (Exception e) {
            notificationFallback.fallbackNotify(
                    request.userId(), confirmation, e);
        }

        return PaymentResult.success(confirmation, fraudScore, balance);
    }
}

Each dependency has its own fallback strategy. Each fallback strategy was chosen based on the dependency’s role in the transaction. The fraud fallback sometimes succeeds and sometimes fails, depending on the transaction amount. The balance fallback always fails. The notification fallback always succeeds (by queuing). The payment still processes correctly under partial dependency failure.

The Metrics That Matter

Every fallback must record:

  1. How often it fires. The *.fallback.invoked counter. If this is zero in production, the fallback has never been tested by real traffic. That does not mean it works.

  2. What it decided. The *.fallback.auto_approved, *.fallback.rejected, *.fallback.queued counters. These tell you the business impact of the fallback.

  3. Why it fired. The reason tag on the invocation counter. Was it a timeout? A connection refused? A circuit breaker open? The reason determines whether the fallback is handling transient issues or a prolonged outage.

Fallbacks that fire and are never reviewed are worse than no fallback. They silently change system behavior. The fraud fallback auto-approving transactions without fraud scoring is a risk exposure. The notification fallback dropping messages because the retry queue is full is data loss. These are acceptable during short outages and unacceptable during prolonged ones. The metrics tell you which scenario you are in.