Building a Vendor-Neutral ML Observability Stack with OpenTelemetry and VictoriaMetrics
These articles are AI-generated summaries. Please check the original sources for full details.
Monitoring an ML Pipeline in Production: Anatomy of an Open-Source Stack
Samuel Desseaux, founder of Erythix, details a field-tested observability architecture built on the AI Observability Hub platform. This stack utilizes OpenTelemetry and VictoriaMetrics to provide high-fidelity monitoring of ML pipelines, including real-time GPU cost estimation and model drift detection.
Why This Matters
Standard infrastructure monitoring often fails to capture the ‘silent degradation’ of ML models where systems remain operational but produce low-quality outputs. Without monitoring the intersection of model confidence, input data drift, and inference costs, organizations risk running models that are technically functional but economically destructive, such as a process costing –3 per inference for a use case generating only –0.50 in value. Transitioning from basic CPU/RAM metrics to a four-layer observability strategy—infrastructure, data pipeline, model quality, and cost—is essential for maintaining production-grade AI systems and complying with governance requirements like the EU AI Act.
Key Insights
- OpenTelemetry provides a vendor-agnostic instrumentation standard, allowing teams to swap backends without re-instrumenting application code.
- VictoriaMetrics is selected over Prometheus for its ability to handle high-cardinality labels (model versions, environments) with significantly lower memory and disk footprints.
- Tail sampling in the OpenTelemetry Collector preserves 100% of error traces for debugging while sampling successful requests to reduce storage volume.
- Model drift detection requires establishing a 30-day baseline during the initial production run to calculate adaptive thresholds for confidence scores.
- GPU cost monitoring via custom OTel counters enables linear extrapolation of end-of-month costs and efficiency ratios like tokens-per-euro.
Working Examples
Custom Python instrumentation for capturing ML-specific metrics like duration and token count using OpenTelemetry.
from opentelemetry import metrics
meter = metrics.get_meter("ml.inference")
inference_duration = meter.create_histogram("ml.inference.duration", unit="ms")
inference_tokens = meter.create_counter("ml.inference.tokens.total")
def predict(request):
start = time.time()
result = model.generate(request.prompt)
duration_ms = (time.time() - start) * 1000
labels = {"model_name": "llama-3-8b", "use_case": "maintenance_assistant"}
inference_duration.record(duration_ms, labels)
inference_tokens.add(result.token_count, labels)
return result
OpenTelemetry Collector configuration for tail sampling and exporting metrics to VictoriaMetrics.
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
processors:
tail_sampling:
policies:
- name: errors-always
type: status_code
status_code: {status_codes: [ERROR]}
- name: sample-rest
type: probabilistic
probabilistic: {sampling_percentage: 10}
exporters:
prometheusremotewrite:
endpoint: "http://victoriametrics:8428/api/v1/write"
Practical Applications
- Predictive Maintenance: Detecting confidence score degradation within 72 hours of sensor hardware changes on a factory floor.
- Cost Optimization: Identifying marginal use cases consuming 35% of GPU budgets, allowing for model rightsizing and 30% total bill reduction.
- Pitfall: Cardinality Explosion - Adding high-cardinality labels like ‘user_id’ to metrics instead of logs, which can crash time-series databases.
- Pitfall: Cargo Cult Monitoring - Implementing dashboards with dozens of panels that lack baselines, leading to alert fatigue and ignored signals.
References:
Continue reading
Next article
Scaling AI Identity: Adding Dutch BSN Support to Soulprint's ZK Standard
Related Content
Building SwiftDeploy: A Declarative Infrastructure CLI with Observability and Policy Enforcement
SwiftDeploy automates web application deployments using a single manifest file, integrating OPA for policy enforcement and Prometheus metrics.
The Complete Guide to Docker for Machine Learning Engineers
This article details how to package, run, and ship a complete machine learning prediction service using Docker, covering model training to API serving and distribution.
Building a Vertically Integrated AI Stack on Open Infrastructure
Domonique Luchin scales Load Bearing Empire across six businesses using a self-hosted AI and telephony stack to avoid AWS lock-in.