Debugging LLM-as-a-Judge: Why 42% of Hallucinations are Actually Pipeline Failures

Your LLM-as-a-Judge Sees 86% Hallucinations. 42% Are Your Pipeline.

Julio Molina Soler audited a self-hosted Langfuse instance using a custom LLM-as-a-judge evaluator. The initial data showed a 86% hallucination rate, but 26 of those cases occurred where the model never actually produced a response.

Why This Matters

Technical observers often mistake infrastructure noise for model unreliability because LLM-as-a-judge evaluators are structurally blind to the HTTP layer. When an SDK logs a request envelope as output due to a gateway rejection, the judge interprets the empty response as a failure to follow instructions, leading to contaminated quality metrics that can inflate hallucination rates by over 20 points.

Key Insights

Infrastructure Blindness: LLM judges score artifacts without seeing the transport layer; in this audit, 26 out of 72 flagged scores occurred on ‘level=ERROR’ observations where the model never ran.
Pearson Correlation Divergence: A study of 72 traces showed a near-zero correlation (r=0.018) between Hallucination and Correctness scores, proving they measure fundamentally different failure modes.
Prompt Echoing Failures: Models in the 3B–30B range, such as llama-3.2-3b-instruct, frequently return input verbatim instead of executing structured tasks.
Tool Binding Confabulation: Agents fabricate REST shapes when tool schemas are missing, a behavior correctly caught by Gemini-2.5-Flash judges.
Instruction Skipping: Long system prompts for multi-step procedures often result in partial execution when processed by smaller free-tier model fleets.

Working Examples

Script to reproduce the hallucination analysis by filtering out pipeline failures from Langfuse scores.

import os, httpx, pandas as pd
from concurrent.futures import ThreadPoolExecutor
BASE = os.environ["LANGFUSE_BASE_URL"].rstrip("/")
AUTH = (os.environ["LANGFUSE_PUBLIC_KEY"], os.environ["LANGFUSE_SECRET_KEY"])
def paginate(client, path, params=None):
    params = dict(params or {}); params.setdefault("limit", 100); page = 1
    while True:
        params["page"] = page
        j = client.get(f"{BASE}{path}", params=params).json()
        yield from j.get("data", [])
        if page >= j.get("meta", {}).get("totalPages", 1): break
        page += 1
with httpx.Client(auth=AUTH, timeout=60) as c:
    scores = list(paginate(c, "/api/public/scores"))
    H = [s for s in scores if s["name"] == "Hallucination"]
    def fetch_obs(obs_id):
        with httpx.Client(auth=AUTH, timeout=30) as c:
            r = c.get(f"{BASE}/api/public/observations/{obs_id}")
            return r.json() if r.status_code == 200 else None
    with ThreadPoolExecutor(max_workers=8) as ex:
        obs_by_id = dict(zip(
            [s["observationId"] for s in H],
            ex.map(fetch_obs, [s["observationId"] for s in H])
        ))
    rows = []
    for s in H:
        o = obs_by_id.get(s["observationId"])
        if not o: continue
        rows.append({
            "score": s["value"],
            "model": o.get("model"),
            "level": o.get("level"),
            "is_pipeline_failure": (
                isinstance(o.get("output"), dict) and
                o["output"].get("completion") is None
            ),
        })
    df = pd.DataFrame(rows)
    genuine = df[~df["is_pipeline_failure"]]
    print(f"Raw mean: {df['score'].mean():.3f}")
    print(f"Filtered: {genuine['score'].mean():.3f}")

Practical Applications

Use Case: Routing structured-summary tasks to 70B+ models while using smaller models like nemotron-nano-9b-v2 for simple classification to avoid ‘Prompt Echo’. Pitfall: Using sub-30B models for multi-step procedural instructions results in ‘instruction skipping’.
Use Case: Implementing a ‘plan_then_execute’ wrapper to force models to enumerate steps before execution. Pitfall: Relying on a single judge metric like Hallucination can hide regressions in Correctness.
Use Case: Updating tool runners to never return ‘success: true’ on non-zero exit codes. Pitfall: Permissive runners cause models to interpret malformed commands as successful, leading to misinterpreted tool outputs.

References:

https://dev.to/jmolinasoler/your-llm-as-a-judge-sees-86-hallucinations-42-are-your-pipeline-16ja

On This Page

Your LLM-as-a-Judge Sees 86% Hallucinations. 42% Are Your Pipeline.

Why This Matters

Key Insights

Working Examples

Practical Applications

Continue reading

Related Content

Analyzing Readability Metrics Across 10 Major Developer Documentation Sites

Receipts Are Not Outcomes: How a Read-Only AI Gate Exposed Survivorship Bias in Trading

RAG App Fails Two Basic Questions: Chunking Bug vs Model Capacity Limits