Skip to main content

On This Page

AutoAgent: Automating AI Agent Optimization and Harness Engineering

2 min read
Share

These articles are AI-generated summaries. Please check the original sources for full details.

Meet ‘AutoAgent’: The Open-Source Library That Lets an AI Engineer and Optimize Its Own Agent Harness Overnight

Developed by Kevin Gu at thirdlayer.inc, AutoAgent is an open-source library designed to automate the manual iteration of agent system prompts and tools. In a single 24-hour run, the system achieved a #1 ranking on SpreadsheetBench with a score of 96.5%.

Why This Matters

Traditional agent engineering relies on a tedious manual prompt-tuning loop where humans tweak system prompts and tool definitions based on benchmark failure traces. AutoAgent shifts this paradigm by treating the agent harness—including orchestration and routing logic—as an optimization surface for a meta-agent, effectively hill-climbing on benchmark scores to outperform human-crafted configurations. This approach addresses the scalability limits of manual engineering by automating the diagnosis and remediation of agent failures.

Key Insights

  • AutoAgent achieved a 55.1% score on TerminalBench, the highest recorded for GPT-5, by autonomously iterating on agent configurations (2026).
  • The system utilizes a ‘ratchet loop’ inspired by Andrej Karpathy’s autoresearch, applying propose-train-evaluate cycles to agent scaffolding rather than model weights.
  • A ‘meta-agent’ manages a single agent.py file, rewriting tool definitions and routing logic based on performance data recorded in a results.tsv experiment log.
  • The library integrates with the Harbor format, using Docker containers and LLM-as-judge verifiers to provide consistent scoring for complex, non-deterministic tasks.
  • Experiments suggest a ‘model empathy’ effect where a Claude-based meta-agent optimizes Claude-based sub-agents more effectively than those based on GPT.

Practical Applications

  • Spreadsheet Automation: AutoAgent optimized an agent to reach 96.5% accuracy on SpreadsheetBench; a common pitfall is manual prompt-tuning which fails to capture edge cases handled by autonomous iteration.
  • Terminal Task Execution: Using the Harbor adapter, AutoAgent reached a 55.1% score on TerminalBench; the anti-pattern of hard-coding tool routing often leads to brittle agents that fail on complex CLI environments.

References:

Continue reading

Next article

MaxToki: A 1B-Parameter Temporal Foundation Model for Cellular Aging Trajectories

Related Content