Why LLM Agents Fail Silently and How to Debug Them: Token Budgets, Schema Drift, and Swallowed Exceptions

Why LLM Agents Fail Silently and How to Debug Them

Mudassir Khan identifies three root causes of silent failures in LLM agents: token budget exhaustion, tool schema drift, and unhandled exceptions. A single bad tool call in a 10-step chain can silently poison every step that follows.

Why This Matters

LLMs are designed to always return something—they won’t throw a ValueError when they run out of context or when a tool schema changes. The result is an agent that appears to work until you look closely at the outputs, completing without an exception but producing wrong or incomplete results. This gap between the expectation of reliable execution and the reality of silent failures makes these bugs among the nastiest in AI engineering, especially as agents scale to multistep chains where one bad tool call poisons every step that follows.

Key Insights

Token budget exhaustion: When max_tokens is hit mid-tool-call, OpenAI returns an empty choices array with status 200, silently breaking the loop unless finish_reason is checked (source: Khan, 2026).
Tool schema drift: Renamed fields or removed parameters cause the LLM to generate invalid arguments, which LangGraph’s StateGraph swallows and replaces with None instead of raising an error (source: Khan, 2026).
Distributed tracing with OpenTelemetry: Per-step spans surface failures immediately by logging finish_reason, completion_tokens, and input count, making failures queryable in Honeycomb or Jaeger (source: Khan, 2026).
Pydantic validation as a firewall: Placing a schema validation step after every tool call converts silent None propagation into loud ValidationError exceptions, catching mismatches at the boundary (source: Khan, 2026).
Agent watchdog dead man’s switch: A threaded heartbeat mechanism raises RuntimeError if the loop goes silent for longer than a configurable timeout (e.g., 90 seconds), serving as a last-resort alert (source: Khan, 2026).

Working Examples

Python implementation of distributed tracing for LLM agent steps using OpenTelemetry, logging token budget hits and empty responses as errors.

from opentelemetry import trace
tracer = trace.get_tracer("agent.loop")

def run_agent_step(step_name: str, messages: list, tools: list):
    with tracer.start_as_current_span(step_name) as span:
        span.set_attribute("step.input_message_count", len(messages))
        span.set_attribute("step.tool_count", len(tools))

        response = client.chat.completions.create(
            model="gpt-4o",
            messages=messages,
            tools=tools,
        )

        finish_reason = response.choices[0].finish_reason if response.choices else "empty"
        span.set_attribute("step.finish_reason", finish_reason)
        span.set_attribute("step.completion_tokens", response.usage.completion_tokens)

        if finish_reason == "length" or not response.choices:
            span.set_status(trace.StatusCode.ERROR, "token budget hit or empty response")
            raise RuntimeError(f"Step {step_name} hit token budget before completing")

        return response

Pydantic validation step after tool calls to catch schema drift and mismatches as loud exceptions instead of silent None propagation.

from pydantic import BaseModel, ValidationError

class UserProfile(BaseModel):
    user_id: str
    email: str
    role: str  # "admin" | "viewer" | "editor"

def validate_tool_output(raw: dict) -> UserProfile:
    try:
        return UserProfile(**raw)
    except ValidationError as e:
        # Loud failure here is intentional — better than a silent one later
        raise RuntimeError(f"Tool output failed schema validation: {e}") from e

Dead man’s switch implementation to detect silent loops in long-running agents by requiring periodic heartbeats.

import threading
import time

class AgentWatchdog:
    def __init__(self, timeout_seconds: int = 60):
        self.timeout = timeout_seconds
        self.last_heartbeat = time.time()
        self._stop = threading.Event()

    def heartbeat(self):
        """Call this after every successful agent step."""
        self.last_heartbeat = time.time()

    def start(self):
        def _watch():
            while not self._stop.is_set():
                if time.time() - self.last_heartbeat > self.timeout:
                    raise RuntimeError("Agent watchdog timeout — loop went silent")
                time.sleep(5)
        threading.Thread(target=_watch, daemon=True).start()

    def stop(self):
        self._stop.set()

# Usage
watchdog = AgentWatchdog(timeout_seconds=90)
watchdog.start()
for step in agent_steps:
    result = run_agent_step(step)
    watchdog.heartbeat()  # prove we're alive after each step
watchdog.stop()

Practical Applications

Distributed tracing for agent observability—OpenTelemetry spans per step prevent reconstructing failures from scattered logs; pitfall: relying on only console logging misses empty choices arrays returned with status 200.
Pydantic validation for external API tool calls—catches schema drift at the boundary before stale data contaminates the LLM’s next prompt; pitfall: assuming tool output always matches the expected schema leads to silent None propagation through downstream steps.
Dead man’s switch for long-running loops—detects loops that go quiet for minutes or hours by requiring a heartbeat after each step; pitfall: omitting liveness checks allows a hung agent to appear running with no way to alert.
finish_reason logging as immediate triage—checking this field first reveals whether token budget (

References:

https://dev.to/mudassirworks/why-llm-agents-fail-silently-and-how-to-debug-them-251l

On This Page