Skip to main content
surviving the spike

When to Rewrite vs When to Scale: The Honest Conversation

7 min read Chapter 64 of 66

When to Rewrite vs When to Scale: The Honest Conversation

The Symptom

The ride-hailing platform’s fare calculation service hits 10,000 requests per second during Friday evening surge. p99 jumps from 120ms to 2,400ms. Ride bookings start timing out. The error rate hits 8%. Revenue loss: $12,000 per minute.

The on-call engineer restarts pods. p99 drops to 900ms. Still above the 500ms SLO. The team adds 4 more pods. p99 drops to 600ms. Still above. The engineering manager calls an emergency meeting. The senior engineer opens the meeting with: “We need to rewrite the fare service in Rust. The JVM cannot handle this throughput.”

The JVM can handle this throughput. It handles 10x this throughput for companies processing financial transactions. The fare service does not have a language problem. It has an undiagnosed operational problem that a rewrite will carry forward into the new codebase.

The principal engineer asks: “Before we spend 18 months rewriting, can someone run EXPLAIN ANALYZE on the fare calculation query?”

The Cause

The instinct to rewrite comes from frustration. The system is slow. The code is messy. The architecture diagram has arrows pointing everywhere. The rewrite promises a clean start. New language, new framework, new patterns. The team will get it right this time.

They will not. Because the throughput problem is rarely in the language or framework. It is in the queries, the pool sizes, the missing caches, the synchronous calls that should be async, the absent autoscaling rules. These problems follow the team into the new codebase because the team carries the same mental model of how the system should work.

The rewrite trap works like this:

Month 0:   "The system is slow. Let's rewrite in [new language]."
Month 6:   New service handles 3 endpoints. Old system still
           runs the other 47. Two systems to maintain.
Month 12:  New service hits the same throughput wall because
           the same query patterns and pool sizes were used.
Month 18:  Rewrite is "done." Same p99. Different language.
           Team is demoralized. Budget is gone.

The diagnostic must happen before the decision. Every scaling technique in this book, from CH1 through CH21, exists on a cost spectrum. The cheapest fixes (indexes, pool sizes) take hours. The most expensive (multi-region, service extraction) take months. A systematic walk through the spectrum, starting from the cheapest, answers the question: is this an operational problem or an architectural one?

Cost Spectrum (cheapest → most expensive):

Hours:    Indexes, pool sizes, thread config
Days:     Caching layers (Redis, Caffeine)
Weeks:    Autoscaling, async processing (Kafka)
Weeks:    Rate limiting, circuit breakers
Months:   Read replicas, database sharding
Months:   Service extraction (strangler fig)
18+ mo:   Full rewrite

Most scaling problems are solved in the "hours" row.

The Baseline

The fare calculation service at 10,000 RPS. The query:

-- BOTTLENECK: Fare calculation query
-- No composite index. Full table scan on zone lookups.
SELECT
    base_fare, per_km_rate, per_min_rate,
    surge_multiplier, time_of_day_factor
FROM fare_config fc
JOIN surge_zones sz ON fc.zone_id = sz.zone_id
JOIN time_factors tf ON fc.time_slot = tf.time_slot
WHERE sz.zone_id = ?
  AND tf.day_of_week = ?
  AND tf.hour_of_day = ?
  AND fc.vehicle_type = ?
  AND fc.effective_date <= CURRENT_DATE
  AND (fc.expiry_date IS NULL
       OR fc.expiry_date > CURRENT_DATE);
EXPLAIN ANALYZE output:
  Seq Scan on fare_config  (rows=340,000)
  Hash Join on surge_zones
  Hash Join on time_factors
  Filter: effective_date <= CURRENT_DATE
  Execution time: 45ms per query

At 10,000 RPS: 45ms * 10,000 = 450,000ms of PG
compute per second. That is 450 CPU-seconds per real second.
The database needs 450 CPUs just for this query.

The team proposed an 18-month rewrite. The actual fix:

-- SCALED: Composite index
CREATE INDEX idx_fare_config_lookup
    ON fare_config (zone_id, vehicle_type, effective_date)
    INCLUDE (base_fare, per_km_rate, per_min_rate);

CREATE INDEX idx_time_factors_lookup
    ON time_factors (day_of_week, hour_of_day, time_slot);
EXPLAIN ANALYZE after index:
  Index Scan on idx_fare_config_lookup  (rows=1)
  Nested Loop on time_factors using idx_time_factors_lookup
  Execution time: 0.3ms per query

At 10,000 RPS: 0.3ms * 10,000 = 3,000ms of PG compute
per second. 3 CPU-seconds. Down from 450.

A 150x improvement. Two CREATE INDEX statements. Executed in 12 seconds. No rewrite.

The Fix

The Scaling Ceiling Diagnostic

Before discussing rewrites, run the diagnostic. Ten steps, ordered by cost. Stop when the SLO is met.

// SCALED: Diagnostic results tracking
public record DiagnosticResult(
    String step,
    String change,
    Duration executionTime,
    double p99Before,
    double p99After,
    double rpsCapacity,
    boolean sloMet
) {}
Step  Technique              CH   Cost    Time
1     Check indexes          CH8  Free    Hours
2     Connection pool sizing CH4  Free    Hours
3     Thread pool config     CH4  Free    Hours
4     Add caching            CH5  Low     Days
5     Autoscaling            CH13 Low     Days
6     Async (Kafka)          CH9  Medium  Weeks
7     Rate limiting          CH10 Medium  Weeks
8     Circuit breakers       CH18 Medium  Weeks
9     Read replicas          CH8  High    Weeks
10    Shard database         CH8  High    Months

If all 10 applied and SLO still violated → architectural.

The ride-hailing fare calculation stopped at step 1. The index fixed it. No further steps needed. No rewrite needed.

The Strangler Fig Pattern

When the diagnostic reveals an architectural problem, the answer is still not a full rewrite. The strangler fig pattern extracts components incrementally:

The strangler fig pattern showing three phases: monolith (red), partial extraction with facade router (yellow), and fully decomposed microservices (green)

The strangler fig pattern extracts services incrementally without a risky full rewrite. Phase 1 shows the starting state: all functionality packed into a single monolith (red) with a shared database. Phase 2 introduces a facade router (yellow) that transparently directs traffic—most requests still hit the shrinking monolith, while the newly extracted driver matching service (green) handles its own data. Phase 3 shows the end state: an API gateway routes to four independent services, each with its own database, fully decoupled and independently deployable.

// SCALED: Strangler fig facade router
@Component
public class StranglerFigRouter {

    private final WebClient monolithClient;
    private final WebClient driverMatchingClient;
    private final FeatureFlagService featureFlags;

    public Mono<DriverMatch> matchDriver(
            MatchRequest request) {
        if (featureFlags.isEnabled(
                "driver-matching-extraction")) {
            // Route to new service
            return driverMatchingClient.post()
                .uri("/api/match")
                .bodyValue(request)
                .retrieve()
                .bodyToMono(DriverMatch.class);
        }
        // Route to monolith
        return monolithClient.post()
            .uri("/internal/drivers/match")
            .bodyValue(request)
            .retrieve()
            .bodyToMono(DriverMatch.class);
    }
}

The Proof

The Final Locust Test

The ride-hailing platform after 21 chapters of optimizations. Full scenario at 3x production load:

# SCALED: Final Locust test - full ride-hailing at 3x load
from locust import HttpUser, task, between, events
import random

class FinalSpikeRider(HttpUser):
    wait_time = between(0.1, 0.3)
    host = "http://rider-api:8080"

    @task(10)
    def full_ride_flow(self):
        # Step 1: Fare estimate
        est = self.client.get(
            "/api/fares/estimate?zoneId=manhattan-midtown"
            + "&vehicleType=standard")

        # Step 2: Book ride
        booking = self.client.post("/api/rides/book", json={
            "riderId": f"rider-{random.randint(1, 100000)}",
            "pickupLat": 40.7128, "pickupLng": -74.0060,
            "dropoffLat": 40.7580, "dropoffLng": -73.9855,
            "zoneId": "manhattan-midtown",
            "vehicleType": "standard"
        })

        if booking.status_code == 200:
            trip_id = booking.json().get("tripId")
            # Step 3: Track trip
            self.client.get(f"/api/trips/{trip_id}/status")

    @task(3)
    def surge_check(self):
        self.client.get(
            "/api/surge/zone/manhattan-midtown")

    @task(2)
    def trip_history(self):
        self.client.get(
            "/api/trips/history?riderId=rider-1&limit=10")


class FinalSpikeDriver(HttpUser):
    wait_time = between(0.5, 1.0)
    host = "http://driver-api:8080"

    @task(10)
    def update_location(self):
        self.client.post("/api/drivers/location", json={
            "driverId": f"driver-{random.randint(1, 5000)}",
            "lat": 40.7128 + random.uniform(-0.05, 0.05),
            "lng": -74.0060 + random.uniform(-0.05, 0.05)
        })

    @task(2)
    def accept_ride(self):
        self.client.post("/api/drivers/accept", json={
            "driverId": f"driver-{random.randint(1, 5000)}",
            "tripId": f"trip-{random.randint(1, 10000)}"
        })
3x Production Load Results (30,000 RPS, 30 minutes):

Endpoint                  p50    p99    Error%   RPS
/api/fares/estimate       12ms   85ms   0.01%    9,000
/api/rides/book           35ms   180ms  0.03%    6,000
/api/trips/{id}/status    8ms    45ms   0.00%    3,000
/api/surge/zone/{zone}    5ms    28ms   0.00%    3,000
/api/drivers/location     6ms    32ms   0.01%    6,000
/api/drivers/accept       18ms   95ms   0.02%    3,000

All endpoints within SLO (p99 < 500ms).
Total error rate: 0.02%.
Zero pod restarts. Zero circuit breaker openings.
HPA scaled from 6 to 14 pods during the ramp.
PostgreSQL CPU: 62% (headroom remaining).
Redis hit rate: 94%.

The system survived the spike.

The detailed diagnostic process is covered in CH22-S1. The decision framework and the rewrite that failed are in CH22-S2.