Skip to main content

On This Page

Essential Observability: 3 Critical Alerts for LLM Systems

2 min read
Share

These articles are AI-generated summaries. Please check the original sources for full details.

The 3 Alerts Every LLM Team Should Have Set Up by Tomorrow

LLM systems can rack up four-figure model spend in 90 seconds from a single runaway conversation. Gabriel Anhaia details three specific alerts focusing on cost, quality, and retrieval that catch failures before users churn.

Why This Matters

Real-world LLM systems fail silently; a retriever might return 200 OK while providing irrelevant context that forces a model to synthesize nonsense. Without per-conversation cost tracking, an agent loop can execute hundreds of cheap calls that aggregate into a massive financial hit, often discovered hours too late by finance teams rather than engineering on-call responders.

Key Insights

  • OpenTelemetry GenAI semantic conventions were significantly revised by March 2026, renaming gen_ai.usage.prompt_tokens to input_tokens.
  • Per-conversation cost monitoring catches runaway agent loops that per-call thresholds miss by using cumulative rolling 5-minute windows.
  • Judge-score drift detection compares 1-day averages against 7-day baselines to identify subtle prompt regressions or provider model shifts.
  • Retrieval-relevance alerts use a relevance_score to detect index drift, preventing scenarios where models produce fluent but factually incorrect answers.

Working Examples

Python emitter for GenAI spans with cost calculation and conversation tracking.

from opentelemetry import trace\nfrom opentelemetry.trace import Status, StatusCode\ntracer = trace.get_tracer("app.llm")\nCOSTS = {"gpt-4o-2024-11-20": (0.0025, 0.0100), "gpt-4o-mini": (0.00015, 0.00060)}\ndef usd(model: str, in_tok: int, out_tok: int) -> float:\n    cin, cout = COSTS.get(model, (0.0, 0.0))\n    return (in_tok / 1000) * cin + (out_tok / 1000) * cout\ndef emit_llm_span(model, provider, usage, conv_id):\n    with tracer.start_as_current_span("gen_ai.chat") as span:\n        span.set_attribute("gen_ai.request.model", model)\n        span.set_attribute("gen_ai.usage.input_tokens", usage["in"])\n        span.set_attribute("gen_ai.usage.output_tokens", usage["out"])\n        span.set_attribute("gen_ai.conversation.id", conv_id)\n        span.set_attribute("app.llm.cost_usd", usd(model, usage["in"], usage["out"]))

Prometheus alert for any single conversation exceeding $25 in a rolling 5-minute window.

sum by (gen_ai_conversation_id) (rate(app_llm_cost_usd_sum[5m]) * 300) > 25

Practical Applications

  • Use Case: Conversation Kill Switch. Systems using gen_ai.conversation.id can automatically terminate runaway loops exceeding $25 spend. Pitfall: Alerting on per-tenant cost instead of per-conversation, which masks high-velocity individual failures.
  • Use Case: Model Version Management. Tracking app.llm.judge.score helps detect regressions when providers rotate model aliases. Pitfall: Using global averages that smooth over critical regressions affecting only specific tenants.
  • Use Case: RAG Index Validation. Monitoring app.rag.relevance_score detects index drift where re-indexing produces worse chunks. Pitfall: Skipping relevance alerts because latency and status codes appear normal.

References:

Continue reading

Next article

Why Scoped Access is Critical for AI Agents: The Railway Incident Analysis

Related Content