Why LLM Agents Fail Silently and How to Debug Them: Token Budgets, Schema Drift, and Swallowed Exceptions
These articles are AI-generated summaries. Please check the original sources for full details.
Why LLM Agents Fail Silently and How to Debug Them
Mudassir Khan identifies three root causes of silent failures in LLM agents: token budget exhaustion, tool schema drift, and unhandled exceptions. A single bad tool call in a 10-step chain can silently poison every step that follows.
Why This Matters
LLMs are designed to always return something—they won’t throw a ValueError when they run out of context or when a tool schema changes. The result is an agent that appears to work until you look closely at the outputs, completing without an exception but producing wrong or incomplete results. This gap between the expectation of reliable execution and the reality of silent failures makes these bugs among the nastiest in AI engineering, especially as agents scale to multistep chains where one bad tool call poisons every step that follows.
Key Insights
- Token budget exhaustion: When max_tokens is hit mid-tool-call, OpenAI returns an empty choices array with status 200, silently breaking the loop unless finish_reason is checked (source: Khan, 2026).
- Tool schema drift: Renamed fields or removed parameters cause the LLM to generate invalid arguments, which LangGraph’s StateGraph swallows and replaces with None instead of raising an error (source: Khan, 2026).
- Distributed tracing with OpenTelemetry: Per-step spans surface failures immediately by logging finish_reason, completion_tokens, and input count, making failures queryable in Honeycomb or Jaeger (source: Khan, 2026).
- Pydantic validation as a firewall: Placing a schema validation step after every tool call converts silent None propagation into loud ValidationError exceptions, catching mismatches at the boundary (source: Khan, 2026).
- Agent watchdog dead man’s switch: A threaded heartbeat mechanism raises RuntimeError if the loop goes silent for longer than a configurable timeout (e.g., 90 seconds), serving as a last-resort alert (source: Khan, 2026).
Working Examples
Python implementation of distributed tracing for LLM agent steps using OpenTelemetry, logging token budget hits and empty responses as errors.
from opentelemetry import trace
tracer = trace.get_tracer("agent.loop")
def run_agent_step(step_name: str, messages: list, tools: list):
with tracer.start_as_current_span(step_name) as span:
span.set_attribute("step.input_message_count", len(messages))
span.set_attribute("step.tool_count", len(tools))
response = client.chat.completions.create(
model="gpt-4o",
messages=messages,
tools=tools,
)
finish_reason = response.choices[0].finish_reason if response.choices else "empty"
span.set_attribute("step.finish_reason", finish_reason)
span.set_attribute("step.completion_tokens", response.usage.completion_tokens)
if finish_reason == "length" or not response.choices:
span.set_status(trace.StatusCode.ERROR, "token budget hit or empty response")
raise RuntimeError(f"Step {step_name} hit token budget before completing")
return response
Pydantic validation step after tool calls to catch schema drift and mismatches as loud exceptions instead of silent None propagation.
from pydantic import BaseModel, ValidationError
class UserProfile(BaseModel):
user_id: str
email: str
role: str # "admin" | "viewer" | "editor"
def validate_tool_output(raw: dict) -> UserProfile:
try:
return UserProfile(**raw)
except ValidationError as e:
# Loud failure here is intentional — better than a silent one later
raise RuntimeError(f"Tool output failed schema validation: {e}") from e
Dead man’s switch implementation to detect silent loops in long-running agents by requiring periodic heartbeats.
import threading
import time
class AgentWatchdog:
def __init__(self, timeout_seconds: int = 60):
self.timeout = timeout_seconds
self.last_heartbeat = time.time()
self._stop = threading.Event()
def heartbeat(self):
"""Call this after every successful agent step."""
self.last_heartbeat = time.time()
def start(self):
def _watch():
while not self._stop.is_set():
if time.time() - self.last_heartbeat > self.timeout:
raise RuntimeError("Agent watchdog timeout — loop went silent")
time.sleep(5)
threading.Thread(target=_watch, daemon=True).start()
def stop(self):
self._stop.set()
# Usage
watchdog = AgentWatchdog(timeout_seconds=90)
watchdog.start()
for step in agent_steps:
result = run_agent_step(step)
watchdog.heartbeat() # prove we're alive after each step
watchdog.stop()
Practical Applications
- Distributed tracing for agent observability—OpenTelemetry spans per step prevent reconstructing failures from scattered logs; pitfall: relying on only console logging misses empty choices arrays returned with status 200.
- Pydantic validation for external API tool calls—catches schema drift at the boundary before stale data contaminates the LLM’s next prompt; pitfall: assuming tool output always matches the expected schema leads to silent None propagation through downstream steps.
- Dead man’s switch for long-running loops—detects loops that go quiet for minutes or hours by requiring a heartbeat after each step; pitfall: omitting liveness checks allows a hung agent to appear running with no way to alert.
- finish_reason logging as immediate triage—checking this field first reveals whether token budget (
References:
Continue reading
Next article
AI Agents for Laravel/Symfony: Safer Refactoring and N+1 Detection at Scale
Related Content
TITAN: A Zero-Dependency Token Compressor for AI Coding Agents
TITAN reduces AI agent token consumption by 70% to 85% using a multi-layer compression framework with zero external dependencies.
Securing AI Agents: Solving the Confused Deputy Problem in LLM Workflows
Meta's AI assistant enabled attackers to hijack 20,000 Instagram accounts by exposing a 'confused deputy' vulnerability in authorization logic.
The Missing Context Plane: Why Enterprise AI Agents Keep Failing Despite Sound Data Stacks
Shakti Mishra reveals how data stacks built for humans cause AI agents to fail without a third architecture layer for context.