The Honest Decision Framework
The Honest Decision Framework
The Symptom
A team of 14 engineers spent 18 months rewriting the ride-hailing platform’s monolith in Go. The original was Java, Spring Boot, PostgreSQL. The rewrite was Go, gRPC, PostgreSQL. Same database. Same schema. Same query patterns. Same connection pool sizes (they copied the config). Same missing indexes (they copied the schema without analyzing it).
Launch day. The rewrite handles 3,200 RPS before p99 exceeds 500ms. The original monolith, with the optimizations from the diagnostic checklist, handles 12,000 RPS with a p99 of 180ms.
The Go rewrite was faster at parsing JSON. It used less memory per goroutine than Java used per virtual thread. The raw overhead was lower. But raw overhead was never the bottleneck. The bottleneck was a sequential scan on fare_config that neither the old codebase nor the new one had an index for. The rewrite reproduced every operational problem in a different language, then added new ones (the team was learning Go’s concurrency model on the job, and the first three months of production exposed race conditions that would not have existed in the original).
Cost of the rewrite: 14 engineers, 18 months, $2.1M in salary alone. Cost of the diagnostic: 1 engineer, 3 days, $1,800.
The Cause
“We need to rewrite” is an emotional statement dressed as a technical one. It usually means one of:
- “I am frustrated with the code quality.” (This is a refactoring problem, not a scaling problem.)
- “I want to use a newer technology.” (This is a career development interest, not a scaling problem.)
- “The system is slow and I don’t know why.” (This is a diagnostic problem.)
- “We have hit an architectural ceiling and cannot scale further.” (This might be a rewrite problem.)
Only #4 justifies a rewrite. And even #4 usually justifies targeted extraction, not a full rewrite.
The honest decision framework:
Question 1: Have you applied all 10 steps
of the diagnostic checklist?
No → Apply them first. Stop here.
Yes → Continue.
Question 2: Is the SLO still violated after
all 10 steps?
No → The problem was operational. No rewrite needed.
Yes → Continue.
Question 3: Can you extract the bottleneck
using the strangler fig pattern?
Yes, and it affects < 60% of the codebase
→ Extract it. Not a rewrite.
Yes, but it affects > 60% of the codebase
→ This is effectively a rewrite.
→ Proceed with full cost analysis.
No → Revisit the diagnostic.
Something was missed.
The Baseline
The engineering team after the failed rewrite. Morale is low. The Go service is in production handling 30% of traffic. The Java monolith handles the other 70%. Two systems to maintain. Two deployment pipelines. Two on-call rotations. Two sets of bugs.
Current state (post-failed-rewrite):
Java monolith: 70% of traffic, p99=180ms, 8 pods
Go service: 30% of traffic, p99=320ms, 12 pods
Total ops cost: 2 deployment pipelines, 2 on-call rotations
Total infra: 20 pods (was 8 before the rewrite started)
If the monolith had received the diagnostic instead:
Java monolith: 100% of traffic, p99=180ms, 8 pods
Total ops cost: 1 deployment pipeline, 1 on-call rotation
Total infra: 8 pods
The Fix
Strangler Fig Pattern: Driver Matching Extraction
The one component that genuinely needed extraction was driver matching. Not because of a language problem. Because its scaling requirements conflicted with the rest of the monolith. Driver matching needs CPU-heavy scoring across thousands of candidates. The trip service needs memory for caching. Scaling the monolith for driver matching wastes memory. Scaling it for trip caching wastes CPU.
Six steps:
Step 1: Facade Routing
// SCALED: Step 1 - Route through a facade
@Component
public class DriverMatchingFacade {
private final DriverMatchingService monolithService;
private final WebClient extractedService;
private final FeatureFlagService flags;
public Mono<DriverMatch> match(MatchRequest request) {
if (flags.isEnabled("driver-matching-v2",
request.getZoneId())) {
return extractedService.post()
.uri("/api/v2/match")
.bodyValue(request)
.retrieve()
.bodyToMono(DriverMatch.class)
.onErrorResume(ex -> {
// Fallback to monolith if new
// service fails
return monolithService
.findBestDriver(request);
});
}
return monolithService.findBestDriver(request);
}
}
Step 2: New Service with Its Own Database
// SCALED: Step 2 - Extracted driver matching service
// Own database, own connection pool, own scaling
@SpringBootApplication
public class DriverMatchingApplication {
public static void main(String[] args) {
SpringApplication.run(
DriverMatchingApplication.class, args);
}
}
@RestController
@RequestMapping("/api/v2/match")
public class DriverMatchingController {
private final ReactiveRedisTemplate<String, String>
driverCache;
private final R2dbcEntityTemplate driverDb;
@PostMapping
public Mono<DriverMatch> match(
@RequestBody MatchRequest request) {
return driverCache.opsForGeo()
.radius("drivers:active:" + request.getZoneId(),
new Circle(
new Point(request.getLng(),
request.getLat()),
new Distance(5, Metrics.KILOMETERS)))
.map(r -> r.getContent().getName())
.collectList()
.flatMap(this::scoreAndRank);
}
private Mono<DriverMatch> scoreAndRank(
List<String> driverIds) {
return Flux.fromIterable(driverIds)
.flatMap(id -> driverDb.selectOne(
query(where("id").is(id)),
DriverProfile.class))
.map(profile -> new ScoredDriver(
profile,
calculateScore(profile)))
.sort(Comparator.comparing(
ScoredDriver::score).reversed())
.next()
.map(scored -> new DriverMatch(
scored.profile().getId(),
scored.score()));
}
}
Step 3: Dual-Write for Data Migration
// SCALED: Step 3 - Dual-write during migration
@KafkaListener(topics = "driver-events",
groupId = "matching-migration")
public class DriverDataMigrator {
private final R2dbcEntityTemplate matchingDb;
@KafkaHandler
public void onDriverEvent(DriverEvent event) {
switch (event.getType()) {
case PROFILE_UPDATED -> matchingDb.update(
DriverProfile.fromEvent(event)).subscribe();
case TRIP_COMPLETED -> matchingDb.update(
query(where("id").is(event.getDriverId())),
update("completedTrips",
event.getTotalCompleted()),
DriverProfile.class).subscribe();
}
}
}
Step 4: Verify with Locust
# SCALED: Step 4 - Verify extracted service
from locust import HttpUser, task, between
class DriverMatchingTest(HttpUser):
wait_time = between(0.05, 0.1)
host = "http://driver-matching-v2:8080"
@task
def match_driver(self):
self.client.post("/api/v2/match", json={
"zoneId": "manhattan-midtown",
"lat": 40.7128, "lng": -74.0060,
"vehicleType": "standard"
})
Extracted service at 50,000 RPS:
p50: 12ms
p99: 65ms
Error rate: 0.01%
CPU per pod: 72%
Pods: 8 (CPU-optimized instances)
Monolith driver matching at 50,000 RPS:
p50: 180ms
p99: 520ms (pool contention with trip queries)
Error rate: 2.3%
CPU per pod: 94%
Pods: 24 (general-purpose instances)
Step 5: Cut Over
// SCALED: Step 5 - Cut over zone by zone
// Feature flag: enable per zone, not globally
flags.enable("driver-matching-v2", "manhattan-midtown");
// Monitor for 24 hours
flags.enable("driver-matching-v2", "manhattan-downtown");
// Monitor for 24 hours
flags.enable("driver-matching-v2", "ALL_ZONES");
Step 6: Decommission Old Code Path
// SCALED: Step 6 - Remove monolith driver matching
// After 30 days with 100% traffic on extracted service:
// 1. Remove DriverMatchingService from monolith
// 2. Remove driver_profiles table from monolith DB
// 3. Remove feature flag (extracted service is the
// only path)
// 4. Archive the monolith code in git history
The Rewrite That Failed: Postmortem
Rewrite Postmortem:
Duration: 18 months
Engineers: 14
Cost: $2.1M (salary only, excludes infra)
Lines of Go: 142,000
Lines of Java
replaced: 89,000 (63% of monolith)
Performance comparison at 10,000 RPS:
Go rewrite: p99=320ms errors=1.8%
Java + index: p99=180ms errors=0.1%
Root cause of identical performance:
1. Same database schema (including missing indexes)
2. Same query patterns (N+1 in driver scoring)
3. Same pool size (15, copied from Java config)
4. New race conditions in concurrent map access
5. No caching layer (team planned to "add it later")
What the team learned:
The bottleneck was never Java.
The bottleneck was the operational configuration
that surrounded Java. That configuration followed
them to Go because it was in the database, the pool
sizes, and the query patterns. Not the language.
The Final Locust Test
The complete ride-hailing platform. Every optimization from 22 chapters applied. 3x production load. 30 minutes.
# SCALED: The Final Locust Test
from locust import HttpUser, task, between
import random
class ProductionRider(HttpUser):
wait_time = between(0.1, 0.3)
host = "http://rider-api:8080"
weight = 3 # 3x riders to drivers
@task(8)
def complete_ride_flow(self):
zone = random.choice([
"manhattan-midtown", "manhattan-downtown",
"brooklyn-heights", "queens-astoria"])
est = self.client.get(
f"/api/fares/estimate?zoneId={zone}"
+ "&vehicleType=standard",
name="/api/fares/estimate")
if est.status_code != 200:
return
booking = self.client.post(
"/api/rides/book", json={
"riderId": f"r-{random.randint(1,100000)}",
"pickupLat": 40.71 + random.uniform(
-0.05, 0.05),
"pickupLng": -74.00 + random.uniform(
-0.05, 0.05),
"dropoffLat": 40.75 + random.uniform(
-0.05, 0.05),
"dropoffLng": -73.98 + random.uniform(
-0.05, 0.05),
"zoneId": zone,
"vehicleType": "standard"
})
if booking.status_code == 200:
tid = booking.json().get("tripId", "t-0")
self.client.get(f"/api/trips/{tid}/status",
name="/api/trips/[id]/status")
@task(2)
def browse_surge(self):
zone = random.choice([
"manhattan-midtown", "manhattan-downtown"])
self.client.get(f"/api/surge/zone/{zone}",
name="/api/surge/zone/[zone]")
@task(1)
def trip_history(self):
self.client.get(
"/api/trips/history"
+ f"?riderId=r-{random.randint(1,100000)}"
+ "&limit=10",
name="/api/trips/history")
class ProductionDriver(HttpUser):
wait_time = between(0.3, 0.8)
host = "http://driver-api:8080"
weight = 1
@task(10)
def location_update(self):
self.client.post("/api/drivers/location", json={
"driverId": f"d-{random.randint(1, 10000)}",
"lat": 40.71 + random.uniform(-0.08, 0.08),
"lng": -74.00 + random.uniform(-0.08, 0.08)
})
@task(3)
def accept_ride(self):
self.client.post("/api/drivers/accept", json={
"driverId": f"d-{random.randint(1, 10000)}",
"tripId": f"trip-{random.randint(1, 50000)}"
})
@task(1)
def earnings(self):
self.client.get(
"/api/drivers/"
+ f"d-{random.randint(1,10000)}/earnings"
+ "?period=today",
name="/api/drivers/[id]/earnings")
THE FINAL TEST: 3x Production (30,000 RPS, 30 minutes)
Rider Endpoints:
/api/fares/estimate p50=12ms p99=85ms 0.01%
/api/rides/book p50=35ms p99=175ms 0.02%
/api/trips/[id]/status p50=8ms p99=42ms 0.00%
/api/surge/zone/[zone] p50=5ms p99=28ms 0.00%
/api/trips/history p50=22ms p99=110ms 0.01%
Driver Endpoints:
/api/drivers/location p50=6ms p99=30ms 0.01%
/api/drivers/accept p50=18ms p99=92ms 0.01%
/api/drivers/[id]/earnings p50=15ms p99=78ms 0.00%
Infrastructure:
Rider API pods: 6 → 16 (HPA scaled)
Driver API pods: 4 → 10 (HPA scaled)
Matching pods: 4 → 12 (CPU-bound, scaled)
PostgreSQL CPU: 58% (headroom)
Redis memory: 62% (headroom)
Redis hit rate: 94.2%
Kafka consumer lag: 0 (caught up)
Circuit breakers: 0 openings
Rate limit hits: 247 (bots, correctly limited)
Pod restarts: 0
All SLOs met. Error rate: 0.02%. The system survived.
Cumulative Improvement: CH1 Through CH22
The cumulative improvement chart tells the story of the entire book in a single curve. The steepest drops come from the cheapest changes: connection pool tuning cut latency 50% (CH4), adding Redis and Caffeine caches dropped it another 57% (CH5-7), and database indexes with read replicas delivered a 75% reduction (CH8). After CH8, the curve flattens—each subsequent technique contributes incremental gains. The final result: 8,400ms to 175ms, a 48x improvement with no rewrite, no language change, and the same database. Different configuration.
The Proof
The ride-hailing platform serves 30,000 requests per second at p99 of 175ms. It survives 3x production load with headroom remaining. It handles zone failures in 15 seconds. It handles service failures with circuit breakers, fallbacks, and cached data. It handles traffic spikes with autoscaling. It handles data consistency across two regions with explicit per-data-type SLOs.
None of this required a rewrite.
The fare calculation that hit 2,400ms at 10,000 RPS needed a composite index. The driver matching that contended for connections needed its own database. The European riders who saw 520ms latency needed a regional deployment. The Friday evening surge that overwhelmed 3 pods needed an HPA rule.
Each problem had a specific, diagnosable, fixable cause. Each fix took hours, days, or weeks. Not months. Not years.
The rewrite that was proposed at month 0 would have consumed 14 engineers for 18 months. It would have delivered the same throughput problems in a different language. The team would have reached the same scaling ceiling and proposed the same solution: another rewrite.
The diagnostic breaks this cycle. Measure. Identify the bottleneck. Apply the cheapest fix. Measure again. Move to the next cheapest fix only if the SLO is still violated.
Most systems do not need a rewrite. They need the techniques in this book applied systematically, starting from the cheapest, stopping when the SLO is met.