Skip to main content
surviving the spike

The Honest Decision Framework

10 min read Chapter 66 of 66

The Honest Decision Framework

The Symptom

A team of 14 engineers spent 18 months rewriting the ride-hailing platform’s monolith in Go. The original was Java, Spring Boot, PostgreSQL. The rewrite was Go, gRPC, PostgreSQL. Same database. Same schema. Same query patterns. Same connection pool sizes (they copied the config). Same missing indexes (they copied the schema without analyzing it).

Launch day. The rewrite handles 3,200 RPS before p99 exceeds 500ms. The original monolith, with the optimizations from the diagnostic checklist, handles 12,000 RPS with a p99 of 180ms.

The Go rewrite was faster at parsing JSON. It used less memory per goroutine than Java used per virtual thread. The raw overhead was lower. But raw overhead was never the bottleneck. The bottleneck was a sequential scan on fare_config that neither the old codebase nor the new one had an index for. The rewrite reproduced every operational problem in a different language, then added new ones (the team was learning Go’s concurrency model on the job, and the first three months of production exposed race conditions that would not have existed in the original).

Cost of the rewrite: 14 engineers, 18 months, $2.1M in salary alone. Cost of the diagnostic: 1 engineer, 3 days, $1,800.

The Cause

“We need to rewrite” is an emotional statement dressed as a technical one. It usually means one of:

  1. “I am frustrated with the code quality.” (This is a refactoring problem, not a scaling problem.)
  2. “I want to use a newer technology.” (This is a career development interest, not a scaling problem.)
  3. “The system is slow and I don’t know why.” (This is a diagnostic problem.)
  4. “We have hit an architectural ceiling and cannot scale further.” (This might be a rewrite problem.)

Only #4 justifies a rewrite. And even #4 usually justifies targeted extraction, not a full rewrite.

The honest decision framework:

Question 1: Have you applied all 10 steps
            of the diagnostic checklist?
  No  → Apply them first. Stop here.
  Yes → Continue.

Question 2: Is the SLO still violated after
            all 10 steps?
  No  → The problem was operational. No rewrite needed.
  Yes → Continue.

Question 3: Can you extract the bottleneck
            using the strangler fig pattern?
  Yes, and it affects < 60% of the codebase
      → Extract it. Not a rewrite.
  Yes, but it affects > 60% of the codebase
      → This is effectively a rewrite.
      → Proceed with full cost analysis.
  No  → Revisit the diagnostic.
        Something was missed.

The Baseline

The engineering team after the failed rewrite. Morale is low. The Go service is in production handling 30% of traffic. The Java monolith handles the other 70%. Two systems to maintain. Two deployment pipelines. Two on-call rotations. Two sets of bugs.

Current state (post-failed-rewrite):
  Java monolith:  70% of traffic, p99=180ms, 8 pods
  Go service:     30% of traffic, p99=320ms, 12 pods
  Total ops cost: 2 deployment pipelines, 2 on-call rotations
  Total infra:    20 pods (was 8 before the rewrite started)

If the monolith had received the diagnostic instead:
  Java monolith:  100% of traffic, p99=180ms, 8 pods
  Total ops cost: 1 deployment pipeline, 1 on-call rotation
  Total infra:    8 pods

The Fix

Strangler Fig Pattern: Driver Matching Extraction

The one component that genuinely needed extraction was driver matching. Not because of a language problem. Because its scaling requirements conflicted with the rest of the monolith. Driver matching needs CPU-heavy scoring across thousands of candidates. The trip service needs memory for caching. Scaling the monolith for driver matching wastes memory. Scaling it for trip caching wastes CPU.

Six steps:

Step 1: Facade Routing

// SCALED: Step 1 - Route through a facade
@Component
public class DriverMatchingFacade {

    private final DriverMatchingService monolithService;
    private final WebClient extractedService;
    private final FeatureFlagService flags;

    public Mono<DriverMatch> match(MatchRequest request) {
        if (flags.isEnabled("driver-matching-v2",
                request.getZoneId())) {
            return extractedService.post()
                .uri("/api/v2/match")
                .bodyValue(request)
                .retrieve()
                .bodyToMono(DriverMatch.class)
                .onErrorResume(ex -> {
                    // Fallback to monolith if new
                    // service fails
                    return monolithService
                        .findBestDriver(request);
                });
        }
        return monolithService.findBestDriver(request);
    }
}

Step 2: New Service with Its Own Database

// SCALED: Step 2 - Extracted driver matching service
// Own database, own connection pool, own scaling
@SpringBootApplication
public class DriverMatchingApplication {

    public static void main(String[] args) {
        SpringApplication.run(
            DriverMatchingApplication.class, args);
    }
}

@RestController
@RequestMapping("/api/v2/match")
public class DriverMatchingController {

    private final ReactiveRedisTemplate<String, String>
        driverCache;
    private final R2dbcEntityTemplate driverDb;

    @PostMapping
    public Mono<DriverMatch> match(
            @RequestBody MatchRequest request) {
        return driverCache.opsForGeo()
            .radius("drivers:active:" + request.getZoneId(),
                new Circle(
                    new Point(request.getLng(),
                              request.getLat()),
                    new Distance(5, Metrics.KILOMETERS)))
            .map(r -> r.getContent().getName())
            .collectList()
            .flatMap(this::scoreAndRank);
    }

    private Mono<DriverMatch> scoreAndRank(
            List<String> driverIds) {
        return Flux.fromIterable(driverIds)
            .flatMap(id -> driverDb.selectOne(
                query(where("id").is(id)),
                DriverProfile.class))
            .map(profile -> new ScoredDriver(
                profile,
                calculateScore(profile)))
            .sort(Comparator.comparing(
                ScoredDriver::score).reversed())
            .next()
            .map(scored -> new DriverMatch(
                scored.profile().getId(),
                scored.score()));
    }
}

Step 3: Dual-Write for Data Migration

// SCALED: Step 3 - Dual-write during migration
@KafkaListener(topics = "driver-events",
    groupId = "matching-migration")
public class DriverDataMigrator {

    private final R2dbcEntityTemplate matchingDb;

    @KafkaHandler
    public void onDriverEvent(DriverEvent event) {
        switch (event.getType()) {
            case PROFILE_UPDATED -> matchingDb.update(
                DriverProfile.fromEvent(event)).subscribe();
            case TRIP_COMPLETED -> matchingDb.update(
                query(where("id").is(event.getDriverId())),
                update("completedTrips",
                    event.getTotalCompleted()),
                DriverProfile.class).subscribe();
        }
    }
}

Step 4: Verify with Locust

# SCALED: Step 4 - Verify extracted service
from locust import HttpUser, task, between

class DriverMatchingTest(HttpUser):
    wait_time = between(0.05, 0.1)
    host = "http://driver-matching-v2:8080"

    @task
    def match_driver(self):
        self.client.post("/api/v2/match", json={
            "zoneId": "manhattan-midtown",
            "lat": 40.7128, "lng": -74.0060,
            "vehicleType": "standard"
        })
Extracted service at 50,000 RPS:
  p50: 12ms
  p99: 65ms
  Error rate: 0.01%
  CPU per pod: 72%
  Pods: 8 (CPU-optimized instances)

Monolith driver matching at 50,000 RPS:
  p50: 180ms
  p99: 520ms (pool contention with trip queries)
  Error rate: 2.3%
  CPU per pod: 94%
  Pods: 24 (general-purpose instances)

Step 5: Cut Over

// SCALED: Step 5 - Cut over zone by zone
// Feature flag: enable per zone, not globally
flags.enable("driver-matching-v2", "manhattan-midtown");
// Monitor for 24 hours
flags.enable("driver-matching-v2", "manhattan-downtown");
// Monitor for 24 hours
flags.enable("driver-matching-v2", "ALL_ZONES");

Step 6: Decommission Old Code Path

// SCALED: Step 6 - Remove monolith driver matching
// After 30 days with 100% traffic on extracted service:
// 1. Remove DriverMatchingService from monolith
// 2. Remove driver_profiles table from monolith DB
// 3. Remove feature flag (extracted service is the
//    only path)
// 4. Archive the monolith code in git history

The Rewrite That Failed: Postmortem

Rewrite Postmortem:

Duration:        18 months
Engineers:       14
Cost:            $2.1M (salary only, excludes infra)
Lines of Go:     142,000
Lines of Java
  replaced:      89,000 (63% of monolith)

Performance comparison at 10,000 RPS:
  Go rewrite:    p99=320ms   errors=1.8%
  Java + index:  p99=180ms   errors=0.1%

Root cause of identical performance:
  1. Same database schema (including missing indexes)
  2. Same query patterns (N+1 in driver scoring)
  3. Same pool size (15, copied from Java config)
  4. New race conditions in concurrent map access
  5. No caching layer (team planned to "add it later")

What the team learned:
  The bottleneck was never Java.
  The bottleneck was the operational configuration
  that surrounded Java. That configuration followed
  them to Go because it was in the database, the pool
  sizes, and the query patterns. Not the language.

The Final Locust Test

The complete ride-hailing platform. Every optimization from 22 chapters applied. 3x production load. 30 minutes.

# SCALED: The Final Locust Test
from locust import HttpUser, task, between
import random

class ProductionRider(HttpUser):
    wait_time = between(0.1, 0.3)
    host = "http://rider-api:8080"
    weight = 3  # 3x riders to drivers

    @task(8)
    def complete_ride_flow(self):
        zone = random.choice([
            "manhattan-midtown", "manhattan-downtown",
            "brooklyn-heights", "queens-astoria"])
        est = self.client.get(
            f"/api/fares/estimate?zoneId={zone}"
            + "&vehicleType=standard",
            name="/api/fares/estimate")
        if est.status_code != 200:
            return

        booking = self.client.post(
            "/api/rides/book", json={
                "riderId": f"r-{random.randint(1,100000)}",
                "pickupLat": 40.71 + random.uniform(
                    -0.05, 0.05),
                "pickupLng": -74.00 + random.uniform(
                    -0.05, 0.05),
                "dropoffLat": 40.75 + random.uniform(
                    -0.05, 0.05),
                "dropoffLng": -73.98 + random.uniform(
                    -0.05, 0.05),
                "zoneId": zone,
                "vehicleType": "standard"
            })
        if booking.status_code == 200:
            tid = booking.json().get("tripId", "t-0")
            self.client.get(f"/api/trips/{tid}/status",
                name="/api/trips/[id]/status")

    @task(2)
    def browse_surge(self):
        zone = random.choice([
            "manhattan-midtown", "manhattan-downtown"])
        self.client.get(f"/api/surge/zone/{zone}",
            name="/api/surge/zone/[zone]")

    @task(1)
    def trip_history(self):
        self.client.get(
            "/api/trips/history"
            + f"?riderId=r-{random.randint(1,100000)}"
            + "&limit=10",
            name="/api/trips/history")


class ProductionDriver(HttpUser):
    wait_time = between(0.3, 0.8)
    host = "http://driver-api:8080"
    weight = 1

    @task(10)
    def location_update(self):
        self.client.post("/api/drivers/location", json={
            "driverId": f"d-{random.randint(1, 10000)}",
            "lat": 40.71 + random.uniform(-0.08, 0.08),
            "lng": -74.00 + random.uniform(-0.08, 0.08)
        })

    @task(3)
    def accept_ride(self):
        self.client.post("/api/drivers/accept", json={
            "driverId": f"d-{random.randint(1, 10000)}",
            "tripId": f"trip-{random.randint(1, 50000)}"
        })

    @task(1)
    def earnings(self):
        self.client.get(
            "/api/drivers/"
            + f"d-{random.randint(1,10000)}/earnings"
            + "?period=today",
            name="/api/drivers/[id]/earnings")
THE FINAL TEST: 3x Production (30,000 RPS, 30 minutes)

Rider Endpoints:
  /api/fares/estimate       p50=12ms   p99=85ms    0.01%
  /api/rides/book           p50=35ms   p99=175ms   0.02%
  /api/trips/[id]/status    p50=8ms    p99=42ms    0.00%
  /api/surge/zone/[zone]    p50=5ms    p99=28ms    0.00%
  /api/trips/history        p50=22ms   p99=110ms   0.01%

Driver Endpoints:
  /api/drivers/location     p50=6ms    p99=30ms    0.01%
  /api/drivers/accept       p50=18ms   p99=92ms    0.01%
  /api/drivers/[id]/earnings p50=15ms  p99=78ms    0.00%

Infrastructure:
  Rider API pods:     6 → 16 (HPA scaled)
  Driver API pods:    4 → 10 (HPA scaled)
  Matching pods:      4 → 12 (CPU-bound, scaled)
  PostgreSQL CPU:     58% (headroom)
  Redis memory:       62% (headroom)
  Redis hit rate:     94.2%
  Kafka consumer lag: 0 (caught up)
  Circuit breakers:   0 openings
  Rate limit hits:    247 (bots, correctly limited)
  Pod restarts:       0

All SLOs met. Error rate: 0.02%. The system survived.

Cumulative Improvement: CH1 Through CH22

Line chart showing p99 latency dropping from 8,400ms (CH1) to 175ms (CH22) across all chapters, with annotations at pool tuning, caching, and indexing inflection points

The cumulative improvement chart tells the story of the entire book in a single curve. The steepest drops come from the cheapest changes: connection pool tuning cut latency 50% (CH4), adding Redis and Caffeine caches dropped it another 57% (CH5-7), and database indexes with read replicas delivered a 75% reduction (CH8). After CH8, the curve flattens—each subsequent technique contributes incremental gains. The final result: 8,400ms to 175ms, a 48x improvement with no rewrite, no language change, and the same database. Different configuration.

The Proof

The ride-hailing platform serves 30,000 requests per second at p99 of 175ms. It survives 3x production load with headroom remaining. It handles zone failures in 15 seconds. It handles service failures with circuit breakers, fallbacks, and cached data. It handles traffic spikes with autoscaling. It handles data consistency across two regions with explicit per-data-type SLOs.

None of this required a rewrite.

The fare calculation that hit 2,400ms at 10,000 RPS needed a composite index. The driver matching that contended for connections needed its own database. The European riders who saw 520ms latency needed a regional deployment. The Friday evening surge that overwhelmed 3 pods needed an HPA rule.

Each problem had a specific, diagnosable, fixable cause. Each fix took hours, days, or weeks. Not months. Not years.

The rewrite that was proposed at month 0 would have consumed 14 engineers for 18 months. It would have delivered the same throughput problems in a different language. The team would have reached the same scaling ceiling and proposed the same solution: another rewrite.

The diagnostic breaks this cycle. Measure. Identify the bottleneck. Apply the cheapest fix. Measure again. Move to the next cheapest fix only if the SLO is still violated.

Most systems do not need a rewrite. They need the techniques in this book applied systematically, starting from the cheapest, stopping when the SLO is met.