Skip to main content

On This Page

OpenAI GPT-5.5: First Fully Retrained Agentic Model Hits 82.7% on Terminal-Bench

2 min read
Share

These articles are AI-generated summaries. Please check the original sources for full details.

OpenAI Releases GPT-5.5, a Fully Retrained Agentic Model That Scores 82.7% on Terminal-Bench 2.0 and 84.9% on GDPval

OpenAI has launched GPT-5.5, its first fully retrained base model since GPT-4.5, specifically designed for autonomous multi-step computer tasks. The model scores a significant 82.7% on Terminal-Bench 2.0, outperforming Claude Opus 4.7 and Gemini 3.1 Pro in complex command-line workflows.

Why This Matters

Traditional LLMs often stall at handoff points, requiring constant human re-prompting to complete complex workflows. GPT-5.5 addresses this technical bottleneck by reasoning across long contexts and autonomously using tools like web browsers and code executors, representing a shift from passive assistants to active agents. While per-token pricing has doubled compared to GPT-5.4, OpenAI claims that superior token efficiency in Codex tasks offsets the higher rates by requiring fewer total tokens to achieve successful outcomes.

Key Insights

  • GPT-5.5 resolves 58.6% of real-world GitHub issues end-to-end on SWE-Bench Pro (2026)
  • The model achieves 82.7% on Terminal-Bench 2.0, significantly beating Claude Opus 4.7’s score of 69.4%
  • GPT-5.5 Pro variant scores 90.1% on BrowseComp, outperforming Gemini 3.1 Pro in tracking down obscure web information
  • On OSWorld-Verified, the model demonstrates a 78.7% success rate in autonomously operating real computer environments
  • Codex token efficiency allows GPT-5.5 to complete complex engineering tasks with fewer total tokens than GPT-5.4

Practical Applications

  • Terminal Orchestration: ML engineers using GPT-5.5 to automate pipeline debugging and script execution via Terminal-Bench 2.0 workflows. Pitfall: Over-reliance on autonomous execution without verifying environment state changes can lead to inconsistent deployments.
  • Software Engineering: Developers using Codex with GPT-5.5 for large-scale refactors and GitHub issue resolution (58.6% success rate). Pitfall: Using the model on codebases with high technical debt where the shape of the system is poorly defined.
  • Web-based Knowledge Work: Researchers utilizing GPT-5.5 Pro for hard-to-find data retrieval via BrowseComp (90.1% accuracy). Pitfall: High API costs for Pro models if workflows are not optimized for token efficiency.

References:

Continue reading

Next article

Recreating Apple Vision Pro Scroll Animations with Modern CSS

Related Content