Essential Observability: 3 Critical Alerts for LLM Systems

The 3 Alerts Every LLM Team Should Have Set Up by Tomorrow

LLM systems can rack up four-figure model spend in 90 seconds from a single runaway conversation. Gabriel Anhaia details three specific alerts focusing on cost, quality, and retrieval that catch failures before users churn.

Why This Matters

Real-world LLM systems fail silently; a retriever might return 200 OK while providing irrelevant context that forces a model to synthesize nonsense. Without per-conversation cost tracking, an agent loop can execute hundreds of cheap calls that aggregate into a massive financial hit, often discovered hours too late by finance teams rather than engineering on-call responders.

Key Insights

OpenTelemetry GenAI semantic conventions were significantly revised by March 2026, renaming gen_ai.usage.prompt_tokens to input_tokens.
Per-conversation cost monitoring catches runaway agent loops that per-call thresholds miss by using cumulative rolling 5-minute windows.
Judge-score drift detection compares 1-day averages against 7-day baselines to identify subtle prompt regressions or provider model shifts.
Retrieval-relevance alerts use a relevance_score to detect index drift, preventing scenarios where models produce fluent but factually incorrect answers.

Working Examples

Python emitter for GenAI spans with cost calculation and conversation tracking.

from opentelemetry import trace\nfrom opentelemetry.trace import Status, StatusCode\ntracer = trace.get_tracer("app.llm")\nCOSTS = {"gpt-4o-2024-11-20": (0.0025, 0.0100), "gpt-4o-mini": (0.00015, 0.00060)}\ndef usd(model: str, in_tok: int, out_tok: int) -> float:\n    cin, cout = COSTS.get(model, (0.0, 0.0))\n    return (in_tok / 1000) * cin + (out_tok / 1000) * cout\ndef emit_llm_span(model, provider, usage, conv_id):\n    with tracer.start_as_current_span("gen_ai.chat") as span:\n        span.set_attribute("gen_ai.request.model", model)\n        span.set_attribute("gen_ai.usage.input_tokens", usage["in"])\n        span.set_attribute("gen_ai.usage.output_tokens", usage["out"])\n        span.set_attribute("gen_ai.conversation.id", conv_id)\n        span.set_attribute("app.llm.cost_usd", usd(model, usage["in"], usage["out"]))

Prometheus alert for any single conversation exceeding $25 in a rolling 5-minute window.

sum by (gen_ai_conversation_id) (rate(app_llm_cost_usd_sum[5m]) * 300) > 25

Practical Applications

Use Case: Conversation Kill Switch. Systems using gen_ai.conversation.id can automatically terminate runaway loops exceeding $25 spend. Pitfall: Alerting on per-tenant cost instead of per-conversation, which masks high-velocity individual failures.
Use Case: Model Version Management. Tracking app.llm.judge.score helps detect regressions when providers rotate model aliases. Pitfall: Using global averages that smooth over critical regressions affecting only specific tenants.
Use Case: RAG Index Validation. Monitoring app.rag.relevance_score detects index drift where re-indexing produces worse chunks. Pitfall: Skipping relevance alerts because latency and status codes appear normal.

References:

https://dev.to/gabrielanhaia/the-3-alerts-every-llm-team-should-have-set-up-by-tomorrow-2o45

On This Page

The 3 Alerts Every LLM Team Should Have Set Up by Tomorrow

Why This Matters

Key Insights

Working Examples

Practical Applications

Continue reading

Related Content

Why Observability Matters for AI Applications: A Deep Dive into LLM Monitoring

Beyond the Green Dot: Advanced LLM Observability Lessons from OpenAI Outages

OpenTelemetry Standardizes LLM Tracing: Implementation Guide for GenAI Semantic Conventions