Google AI Releases Auto-Diagnose: LLM-Based System for Automated Integration Test Debugging
These articles are AI-generated summaries. Please check the original sources for full details.
Google AI Releases Auto-Diagnose: An Large Language Model LLM-Based System to Diagnose Integration Test Failures at Scale
Google researchers have introduced Auto-Diagnose, an LLM-powered system designed to automate the diagnosis of complex integration test failures. In a manual evaluation of 71 real-world failures across 39 teams, the tool correctly identified the root cause 90.14% of the time.
Why This Matters
Integration tests represent a significant debugging tax because failures often surface as generic symptoms like timeouts, while the actual error is buried deep within disparate component logs. At Google, a survey of 116 developers revealed that 38.4% of these failures take more than an hour to diagnose, and 8.9% take over a day, whereas unit tests rarely exceed an hour for diagnosis. Auto-Diagnose addresses this by aggregating logs across data centers and processes into a single timestamped stream for LLM analysis.
Key Insights
- Auto-Diagnose achieved a 90.14% root-cause accuracy rate using Gemini 2.5 Flash without any fine-tuning, relying instead on sophisticated prompt engineering and a temperature of 0.1 for near-deterministic results.
- The system operates with a p50 latency of 56 seconds, enabling 22,962 distinct developers to receive findings before they lose context on their code changes.
- A survey of 6,059 developers at Google (EngSat) identified integration test failures as one of the top five productivity complaints across the organization.
- The system uses hard negative constraints in its prompts, forcing the model to report ‘more information is needed’ rather than hallucinating when logs are incomplete.
- Out of 517 feedback reports, 84.3% were ‘Please fix’ requests from reviewers, ranking Auto-Diagnose #14 in helpfulness out of 370 internal tools at Google.
Practical Applications
- Use Case: Automated code review comments in Google’s Critique system provide markdown findings with clickable log links, allowing authors to act on root causes immediately.
- Pitfall: Relying on test driver logs alone often masks the true error; Auto-Diagnose mitigates this by joining SUT component logs at level INFO and above into a unified stream.
- Pitfall: Incomplete infrastructure logging can cause diagnostic failure; Auto-Diagnose’s refusal to guess has helped surface real infrastructure bugs in logging pipelines.
References:
Continue reading
Next article
A Well-Designed JavaScript Module System is Your First Architecture Decision
Related Content
NadirClaw: Building Cost-Aware LLM Routing with Local Prompt Classification
NadirClaw introduces an intelligent local routing layer that classifies prompts into simple and complex tiers, enabling dynamic switching between Gemini Flash and Pro to reduce LLM costs by up to 50%.
Scalable i18n Testing in Cypress: Semantic Assertions via i18next Integration
Sebastian Clavijo Suero demonstrates how integrating i18next into Cypress prevents test failures by asserting translation keys instead of fragile hardcoded strings.
Bridging the Gap Between AI-Assisted Speed and System Stability
AI tools boost code production speed, but exceeding a system's change absorption capacity leads to production failures and triple the rework time.