Skip to main content

On This Page

Building a Vendor-Neutral ML Observability Stack with OpenTelemetry and VictoriaMetrics

3 min read
Share

These articles are AI-generated summaries. Please check the original sources for full details.

Monitoring an ML Pipeline in Production: Anatomy of an Open-Source Stack

Samuel Desseaux, founder of Erythix, details a field-tested observability architecture built on the AI Observability Hub platform. This stack utilizes OpenTelemetry and VictoriaMetrics to provide high-fidelity monitoring of ML pipelines, including real-time GPU cost estimation and model drift detection.

Why This Matters

Standard infrastructure monitoring often fails to capture the ‘silent degradation’ of ML models where systems remain operational but produce low-quality outputs. Without monitoring the intersection of model confidence, input data drift, and inference costs, organizations risk running models that are technically functional but economically destructive, such as a process costing –3 per inference for a use case generating only –0.50 in value. Transitioning from basic CPU/RAM metrics to a four-layer observability strategy—infrastructure, data pipeline, model quality, and cost—is essential for maintaining production-grade AI systems and complying with governance requirements like the EU AI Act.

Key Insights

  • OpenTelemetry provides a vendor-agnostic instrumentation standard, allowing teams to swap backends without re-instrumenting application code.
  • VictoriaMetrics is selected over Prometheus for its ability to handle high-cardinality labels (model versions, environments) with significantly lower memory and disk footprints.
  • Tail sampling in the OpenTelemetry Collector preserves 100% of error traces for debugging while sampling successful requests to reduce storage volume.
  • Model drift detection requires establishing a 30-day baseline during the initial production run to calculate adaptive thresholds for confidence scores.
  • GPU cost monitoring via custom OTel counters enables linear extrapolation of end-of-month costs and efficiency ratios like tokens-per-euro.

Working Examples

Custom Python instrumentation for capturing ML-specific metrics like duration and token count using OpenTelemetry.

from opentelemetry import metrics
meter = metrics.get_meter("ml.inference")
inference_duration = meter.create_histogram("ml.inference.duration", unit="ms")
inference_tokens = meter.create_counter("ml.inference.tokens.total")

def predict(request):
    start = time.time()
    result = model.generate(request.prompt)
    duration_ms = (time.time() - start) * 1000
    labels = {"model_name": "llama-3-8b", "use_case": "maintenance_assistant"}
    inference_duration.record(duration_ms, labels)
    inference_tokens.add(result.token_count, labels)
    return result

OpenTelemetry Collector configuration for tail sampling and exporting metrics to VictoriaMetrics.

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
processors:
  tail_sampling:
    policies:
      - name: errors-always
        type: status_code
        status_code: {status_codes: [ERROR]}
      - name: sample-rest
        type: probabilistic
        probabilistic: {sampling_percentage: 10}
exporters:
  prometheusremotewrite:
    endpoint: "http://victoriametrics:8428/api/v1/write"

Practical Applications

  • Predictive Maintenance: Detecting confidence score degradation within 72 hours of sensor hardware changes on a factory floor.
  • Cost Optimization: Identifying marginal use cases consuming 35% of GPU budgets, allowing for model rightsizing and 30% total bill reduction.
  • Pitfall: Cardinality Explosion - Adding high-cardinality labels like ‘user_id’ to metrics instead of logs, which can crash time-series databases.
  • Pitfall: Cargo Cult Monitoring - Implementing dashboards with dozens of panels that lack baselines, leading to alert fatigue and ignored signals.

References:

Continue reading

Next article

Scaling AI Identity: Adding Dutch BSN Support to Soulprint's ZK Standard

Related Content