Google AI Releases Auto-Diagnose: LLM-Based System for Automated Integration Test Debugging
These articles are AI-generated summaries. Please check the original sources for full details.
Google AI Releases Auto-Diagnose: An Large Language Model LLM-Based System to Diagnose Integration Test Failures at Scale
Google researchers have introduced Auto-Diagnose, an LLM-powered system designed to automate the diagnosis of complex integration test failures. In a manual evaluation of 71 real-world failures across 39 teams, the tool correctly identified the root cause 90.14% of the time.
Why This Matters
Integration tests represent a significant debugging tax because failures often surface as generic symptoms like timeouts, while the actual error is buried deep within disparate component logs. At Google, a survey of 116 developers revealed that 38.4% of these failures take more than an hour to diagnose, and 8.9% take over a day, whereas unit tests rarely exceed an hour for diagnosis. Auto-Diagnose addresses this by aggregating logs across data centers and processes into a single timestamped stream for LLM analysis.
Key Insights
- Auto-Diagnose achieved a 90.14% root-cause accuracy rate using Gemini 2.5 Flash without any fine-tuning, relying instead on sophisticated prompt engineering and a temperature of 0.1 for near-deterministic results.
- The system operates with a p50 latency of 56 seconds, enabling 22,962 distinct developers to receive findings before they lose context on their code changes.
- A survey of 6,059 developers at Google (EngSat) identified integration test failures as one of the top five productivity complaints across the organization.
- The system uses hard negative constraints in its prompts, forcing the model to report ‘more information is needed’ rather than hallucinating when logs are incomplete.
- Out of 517 feedback reports, 84.3% were ‘Please fix’ requests from reviewers, ranking Auto-Diagnose #14 in helpfulness out of 370 internal tools at Google.
Practical Applications
- Use Case: Automated code review comments in Google’s Critique system provide markdown findings with clickable log links, allowing authors to act on root causes immediately.
- Pitfall: Relying on test driver logs alone often masks the true error; Auto-Diagnose mitigates this by joining SUT component logs at level INFO and above into a unified stream.
- Pitfall: Incomplete infrastructure logging can cause diagnostic failure; Auto-Diagnose’s refusal to guess has helped surface real infrastructure bugs in logging pipelines.
References:
Continue reading
Next article
Oracle Taps 2.8GW Fuel Cell Capacity to Tackle AI Power Constraints
Related Content
Beyond the 10x Developer: The Five Engineering Archetypes for Healthy Organizations
Micah Breedlove challenges the 'mythical engineer' trope by defining five complementary technical archetypes essential for software system survival.
Full Stack Expert Usman Ali Joins DEV Community to Share 15 Years of Web Engineering Experience
Full Stack Developer Usman Ali, with over 15 years of experience in custom web applications and API integrations, joins the DEV community.
Solving Tournament Admin Friction: Building The Colosseum for CoD Streamers
Developer Joe C eliminates manual data entry for CoD tournaments by integrating Google Forms and Challonge into a single Electron desktop app.