Skip to main content
surviving the spike

OpenTelemetry Instrumentation for the Ride-Hailing Platform

7 min read Chapter 47 of 66

OpenTelemetry Instrumentation for the Ride-Hailing Platform

The Symptom

The team adds the OpenTelemetry Java agent to the rider API. Traces appear in Tempo. The waterfall shows Spring WebFlux handler spans, Lettuce Redis spans, R2DBC PostgreSQL spans. But when a ride request takes 3 seconds, the trace shows a 2.8-second gap between the WebFlux handler span and the first database span. Something is taking 2.8 seconds and it is invisible.

The gap is the surge pricing calculation. It runs in-memory, calling no external services for 95% of requests (cached multipliers). The auto-instrumenter does not see it because there is no framework call to hook into. The most critical business logic in the request path is a blind spot.

The Cause

Auto-instrumentation covers the I/O boundary: HTTP handlers, database drivers, cache clients, message brokers. It does not cover application logic that happens between those boundaries. The surge pricing engine, the driver matching algorithm, and the fare computation pipeline are all invisible to the auto-instrumenter.

Two solutions exist:

  1. @WithSpan annotation: add to any method, get a span automatically
  2. Manual Tracer API: full control over span lifecycle, attributes, events

Use @WithSpan for simple methods where you want timing. Use the manual API when you need to add attributes, record events, or manage the span across reactive operators.

The Baseline

Trace coverage with auto-instrumentation only:

Operation                    Instrumented?    Why?
WebFlux handler              Yes              Agent hooks ServerWebExchange
Redis GET surge:zone:123     Yes              Agent hooks Lettuce client
R2DBC SELECT pricing_rules   Yes              Agent hooks R2DBC driver
Kafka produce ride-events    Yes              Agent hooks KafkaTemplate
Surge pricing calculation    No               Pure application logic
Driver matching algorithm    No               Pure application logic
Fare computation pipeline    No               Pure application logic
Promotion application        No               Pure application logic

Four of eight critical operations are invisible. The trace waterfall has gaps where the most important work happens.

Target: every business-critical operation has a span with relevant attributes.

The Fix

Java Agent Setup

# SCALED: Multi-stage build with OTel agent
FROM eclipse-temurin:21-jdk-alpine AS build
WORKDIR /app
COPY . .
RUN ./mvnw package -DskipTests

FROM eclipse-temurin:21-jre-alpine
WORKDIR /app
COPY --from=build /app/target/*.jar app.jar

# Pin the agent version for reproducibility
ARG OTEL_AGENT_VERSION=2.5.0
ADD https://github.com/open-telemetry/opentelemetry-java-instrumentation/releases/download/v${OTEL_AGENT_VERSION}/opentelemetry-javaagent.jar opentelemetry-javaagent.jar

ENTRYPOINT ["java", \
  "-javaagent:/app/opentelemetry-javaagent.jar", \
  "-jar", "app.jar"]

Application properties for the exporter:

# SCALED: application-otel.yml
otel:
  service:
    name: rider-api
  exporter:
    otlp:
      endpoint: http://otel-collector:4317
      protocol: grpc
  traces:
    sampler: parentbased_traceidratio
    sampler-arg: "0.1"
  resource:
    attributes:
      deployment.environment: production
      service.version: ${APP_VERSION:unknown}
      k8s.namespace.name: ${K8S_NAMESPACE:default}

The agent reads OTEL_* environment variables or system properties. The service name, exporter endpoint, and sampling rate are the minimum configuration.

Auto-Instrumented Spans

With the agent attached, these spans appear automatically:

Span Name                              Source Library
GET /api/rides/request                 spring-webflux
redis.GET                              lettuce
SELECT pricing_rules                   r2dbc-postgresql
kafka.produce ride-events              spring-kafka
kafka.consume ride-events              spring-kafka

Each span includes timing, status, and library-specific attributes. The R2DBC span includes the SQL statement (parameterized). The Redis span includes the command and key. The Kafka span includes the topic, partition, and offset.

Custom Spans with @WithSpan

// SCALED: @WithSpan for method-level tracing
@Service
public class DriverMatchingService {

    @WithSpan("driver.matching.find_nearest")
    public Mono<List<Driver>> findNearestDrivers(
            @SpanAttribute("location.lat") double lat,
            @SpanAttribute("location.lng") double lng,
            @SpanAttribute("radius.km") double radiusKm,
            @SpanAttribute("vehicle.type") String vehicleType) {

        return driverLocationCache.getDriversInRadius(lat, lng, radiusKm)
            .filter(driver -> driver.getVehicleType().equals(vehicleType))
            .filter(Driver::isAvailable)
            .sort(Comparator.comparingDouble(d ->
                haversine(lat, lng, d.getLat(), d.getLng())))
            .take(10)
            .collectList();
    }

    @WithSpan("driver.matching.score_candidates")
    public Mono<Driver> scoreCandidates(
            @SpanAttribute("candidate.count") int candidateCount,
            List<Driver> candidates,
            RideRequest request) {

        return Flux.fromIterable(candidates)
            .flatMap(driver -> scoreDriver(driver, request))
            .sort(Comparator.comparingDouble(ScoredDriver::getScore).reversed())
            .next()
            .map(ScoredDriver::getDriver);
    }
}

@SpanAttribute binds method parameters to span attributes. When you search for slow traces in Tempo, you can filter by vehicle.type=SUV or candidate.count > 5 to narrow down the problem.

Manual Tracer for Complex Logic

// SCALED: Manual span management for fare calculation
@Service
public class FareCalculationService {

    private final Tracer tracer = GlobalOpenTelemetry.getTracer("ride-hailing");

    public Mono<FareEstimate> calculate(RideRequest request) {
        return Mono.defer(() -> {
            Span fareSpan = tracer.spanBuilder("fare.calculate.full")
                .setAttribute("rider.id", request.getRiderId())
                .setAttribute("pickup.zone", request.getPickupZoneId())
                .setAttribute("dropoff.zone", request.getDropoffZoneId())
                .startSpan();

            try (Scope scope = fareSpan.makeCurrent()) {
                return calculateDistance(request)
                    .flatMap(distance -> {
                        fareSpan.setAttribute("distance.km", distance);
                        return getBaseRate(request.getPickupZoneId());
                    })
                    .flatMap(rate -> {
                        fareSpan.addEvent("base_rate_resolved",
                            Attributes.of(
                                AttributeKey.doubleKey("rate.per_km"), rate));
                        return applySurge(request, rate);
                    })
                    .flatMap(surgedRate -> {
                        fareSpan.addEvent("surge_applied");
                        return applyPromotions(request, surgedRate);
                    })
                    .map(finalFare -> {
                        fareSpan.setAttribute("fare.amount", finalFare.doubleValue());
                        fareSpan.setAttribute("fare.currency", "USD");
                        fareSpan.setStatus(StatusCode.OK);
                        return new FareEstimate(finalFare, request);
                    })
                    .doOnError(err -> {
                        fareSpan.setStatus(StatusCode.ERROR, err.getMessage());
                        fareSpan.recordException(err);
                    })
                    .doFinally(signal -> fareSpan.end());
            }
        });
    }
}

The manual approach gives you span events (timestamped log entries within the span), dynamic attributes set at different stages, and exception recording. The doFinally ensures the span ends regardless of success or error.

The choice between @WithSpan and manual Tracer:

Criteria                    @WithSpan           Manual Tracer
Simple timing               Yes                 Overkill
Method parameters as attrs  Yes (@SpanAttribute) Yes (setAttribute)
Dynamic attributes          No                  Yes (set during execution)
Span events                 No                  Yes (addEvent)
Reactive chain spans        Fragile             Correct (doFinally)
Error recording             Automatic           Manual (recordException)

For the driver matching service, @WithSpan is sufficient: the method runs, returns, and the span closes. For fare calculation, the manual API is required because attributes like fare.amount are only known at the end of the reactive chain, and span events mark the progress through each computation stage.

Context Propagation Across Kafka

// SCALED: Kafka producer - OTel agent handles context injection
@Service
public class RideEventPublisher {

    private final KafkaTemplate<String, RideEvent> kafkaTemplate;

    public Mono<Void> publishRideRequested(RideRequest request, FareEstimate fare) {
        RideEvent event = new RideEvent(
            request.getRideId(),
            "RIDE_REQUESTED",
            request.getRiderId(),
            fare.getAmount()
        );

        // The OTel agent injects traceparent into Kafka headers automatically
        return Mono.fromFuture(
            kafkaTemplate.send("ride-events", request.getRideId(), event)
        ).then();
    }
}

// SCALED: Kafka consumer - agent extracts context and creates child span
@Component
public class TripAnalyticsConsumer {

    @WithSpan("analytics.process_ride_event")
    @KafkaListener(topics = "ride-events", groupId = "trip-analytics")
    public void processRideEvent(
            @SpanAttribute("event.ride_id") String key,
            RideEvent event) {
        // This span is a child of the producer's span
        // The trace connects rider-api → kafka → trip-analytics
        analyticsStore.recordRideRequest(event);
    }
}

The agent handles W3C traceparent injection on the producer and extraction on the consumer. No code needed. The consumer’s analytics.process_ride_event span shares the same trace ID as the producer’s kafka.produce span.

Kubernetes Manifest for OTel Collector Sidecar

# SCALED: OTel Collector as sidecar in Kubernetes
apiVersion: apps/v1
kind: Deployment
metadata:
  name: rider-api
spec:
  template:
    spec:
      containers:
        - name: rider-api
          image: rider-api:latest
          env:
            - name: OTEL_EXPORTER_OTLP_ENDPOINT
              value: "http://localhost:4317"
            - name: OTEL_SERVICE_NAME
              value: "rider-api"
            - name: OTEL_TRACES_SAMPLER
              value: "parentbased_traceidratio"
            - name: OTEL_TRACES_SAMPLER_ARG
              value: "0.1"
          ports:
            - containerPort: 8080

        - name: otel-collector
          image: otel/opentelemetry-collector-contrib:0.100.0
          args: ["--config=/etc/otel/config.yaml"]
          volumeMounts:
            - name: otel-config
              mountPath: /etc/otel
          ports:
            - containerPort: 4317
            - containerPort: 4318
          resources:
            requests:
              cpu: 100m
              memory: 128Mi
            limits:
              cpu: 500m
              memory: 256Mi

      volumes:
        - name: otel-config
          configMap:
            name: otel-collector-config

The sidecar pattern means the application sends spans to localhost:4317. No cross-network latency. The Collector handles batching, retry, and export to Tempo. If Tempo is unavailable, the Collector buffers spans in memory up to the configured limit. The application never blocks on trace export.

Performance Impact

The OTel Java agent adds overhead. Measure it:

Metric                    Without Agent    With Agent    Delta
p50 latency               142ms            145ms         +2.1%
p99 latency               310ms            318ms         +2.6%
CPU usage (avg)           34%              36%           +2%
Memory (heap)             412MB            438MB         +26MB
Throughput (RPS)          2,840            2,790         -1.8%

The overhead is under 3% for latency and under 2% for throughput. The 26MB heap increase comes from span buffering before export. At 10% sampling rate, the agent creates spans for all requests but only exports 10%. The span creation cost is fixed. The export cost scales with sampling rate.

If 3% overhead is unacceptable, disable specific instrumentations:

# Disable auto-instrumentation for low-value spans
otel.instrumentation.lettuce.enabled=false
otel.instrumentation.r2dbc.enabled=false

Disable only after confirming those spans are not needed for diagnosis. Disabling R2DBC instrumentation would have made the connection pool wait time invisible.

The Proof

Deploy the instrumented rider API. Send 100 ride requests:

# SCALED: Verify instrumentation coverage
for i in $(seq 1 100); do
  curl -s -X POST http://rider-api:8080/api/rides/request \
    -H "Content-Type: application/json" \
    -d '{
      "rider_id": "rider-'$i'",
      "pickup": {"lat": 40.7128, "lng": -74.0060},
      "dropoff": {"lat": 40.7589, "lng": -73.9851}
    }' &
done
wait

Query Tempo for traces from the rider API:

{resource.service.name="rider-api" && name="POST /api/rides/request"}

Each trace should contain:

Span                                  Attributes
POST /api/rides/request               http.method, http.route
  fare.calculate.full                 rider.id, pickup.zone, fare.amount
    fare.surge_pricing                zone.id
    r2dbc.query                       db.statement
  driver.matching.find_nearest        location.lat, location.lng, radius.km
    driver.matching.score_candidates  candidate.count
  redis.GET                           db.system=redis
  kafka.produce ride-events           messaging.destination

Eight spans per trace. Four auto-instrumented, four custom. Zero blind spots in the critical path.

Before instrumentation: 2.8-second gap in trace waterfall, invisible business logic. After instrumentation: complete span tree, every operation visible, filterable by rider ID, zone, fare amount.