LLM Observability Audits: Reducing Error Rates and Exposing Rubric Disagreements

Three LLM Observability Audits in Five Days: Each Fix Exposed the Next Bug

Julio Molina Soler audited a self-hosted Langfuse instance, slashing a 32% error rate to zero within three days. This stabilization revealed that LLM judges disagree on the same outputs 17% of the time despite improved infrastructure.

Why This Matters

In LLM systems, infrastructure noise often masks deeper evaluation failures. Fixing bugs like context-overflow rejections (max_tokens=720000) or $1.11 single-call retrieval errors is necessary but insufficient; once the noise floor drops, systemic rubric flaws emerge. For example, a Correctness judge may award a 1.0 score to a prompt-echo that a Hallucination judge scores as 0.0, demonstrating that reliance on a single metric can lead to routing 1.2B models as equivalent to 120B models.

Key Insights

Infrastructure stabilization reduced application error rates from 32% to 0.0% and token ratios from 97:1 to 1.8:1 (Molina Soler, 2026).
The Prompt-Echo failure mode occurs when models repeat input verbatim; this satisfied Correctness rubrics while failing Hallucination rubrics in 17% of cases.
Pearson correlation between Correctness and Hallucination metrics remained near zero (r = -0.027) across three independent audit windows.
Langfuse and OpenRouter were used to track 400 traces, showing that Toxicity judges provided constant zero signals on agent workloads.
The Correctness leaderboard saturated at 1.000 for nine different models, hiding performance gaps between 1.2B and 120B parameter models.

Working Examples

Distribution of traces during the stabilization phase.

trace.name distribution (today, 400 traces): OpenRouter Request 100, Execute evaluator: Correctness 100, Execute evaluator: Hallucination 100, Execute evaluator: Toxicity 100

Practical Applications

Use case: Implement deterministic echo detection using Levenshtein distance (threshold 0.85) to catch input copies. Pitfall: Relying solely on LLM judges to detect substantive responses, leading to artificial 1.0 scores.
Use case: Replace Toxicity judges with Format Compliance or Refusal Detection for agent-instruction workloads. Pitfall: Wasting token budget on Gemini-2.5-flash for constant signals with no discriminative value.
Use case: Add an anti-echo clause to evaluation rubrics to prevent models from scoring high on verbatim copies. Pitfall: Maintaining invalid model slugs like gemma-4-26b-a4b-it:free in the routing pool due to inertia.

References:

https://dev.to/jmolinasoler/three-llm-observability-audits-in-five-days-each-fix-exposed-the-next-bug-1of6

On This Page

Three LLM Observability Audits in Five Days: Each Fix Exposed the Next Bug

Why This Matters

Key Insights

Working Examples

Practical Applications

Continue reading

Related Content

Lessons from Running 100+ AI Agents in Production: Scaling Rate Limits and Costs

Self-Hosting AI: Reducing Infrastructure Costs from $1,069 to $140/mo

Closing the Loop: Automating AI Context from Audit Violations in CORE