OpenAI GPT-5.5: First Fully Retrained Agentic Model Hits 82.7% on Terminal-Bench
These articles are AI-generated summaries. Please check the original sources for full details.
OpenAI Releases GPT-5.5, a Fully Retrained Agentic Model That Scores 82.7% on Terminal-Bench 2.0 and 84.9% on GDPval
OpenAI has launched GPT-5.5, its first fully retrained base model since GPT-4.5, specifically designed for autonomous multi-step computer tasks. The model scores a significant 82.7% on Terminal-Bench 2.0, outperforming Claude Opus 4.7 and Gemini 3.1 Pro in complex command-line workflows.
Why This Matters
Traditional LLMs often stall at handoff points, requiring constant human re-prompting to complete complex workflows. GPT-5.5 addresses this technical bottleneck by reasoning across long contexts and autonomously using tools like web browsers and code executors, representing a shift from passive assistants to active agents. While per-token pricing has doubled compared to GPT-5.4, OpenAI claims that superior token efficiency in Codex tasks offsets the higher rates by requiring fewer total tokens to achieve successful outcomes.
Key Insights
- GPT-5.5 resolves 58.6% of real-world GitHub issues end-to-end on SWE-Bench Pro (2026)
- The model achieves 82.7% on Terminal-Bench 2.0, significantly beating Claude Opus 4.7’s score of 69.4%
- GPT-5.5 Pro variant scores 90.1% on BrowseComp, outperforming Gemini 3.1 Pro in tracking down obscure web information
- On OSWorld-Verified, the model demonstrates a 78.7% success rate in autonomously operating real computer environments
- Codex token efficiency allows GPT-5.5 to complete complex engineering tasks with fewer total tokens than GPT-5.4
Practical Applications
- Terminal Orchestration: ML engineers using GPT-5.5 to automate pipeline debugging and script execution via Terminal-Bench 2.0 workflows. Pitfall: Over-reliance on autonomous execution without verifying environment state changes can lead to inconsistent deployments.
- Software Engineering: Developers using Codex with GPT-5.5 for large-scale refactors and GitHub issue resolution (58.6% success rate). Pitfall: Using the model on codebases with high technical debt where the shape of the system is poorly defined.
- Web-based Knowledge Work: Researchers utilizing GPT-5.5 Pro for hard-to-find data retrieval via BrowseComp (90.1% accuracy). Pitfall: High API costs for Pro models if workflows are not optimized for token efficiency.
References:
Continue reading
Next article
Recreating Apple Vision Pro Scroll Animations with Modern CSS
Related Content
Moonshot AI Introduces Kimi K2 Thinking: A Breakthrough in Long-Horizon Reasoning and Tool Use
Moonshot AI releases Kimi K2 Thinking, an open-source thinking model capable of executing 200–300 sequential tool calls without human intervention, optimized for long-horizon reasoning and agentic tasks.
Qwen3.6-35B-A3B: Sparse MoE Vision-Language Model with 3B Active Parameters
Alibaba releases Qwen3.6-35B-A3B, a sparse MoE model with 3B active parameters that outperforms larger models on Terminal-Bench 2.0 and SWE-bench.
OpenAI Launches Codex Chrome Extension for Signed-In Browser Workflows
OpenAI releases a Codex Chrome extension enabling AI agents to access authenticated sessions for LinkedIn and Salesforce via a new three-tier browser execution model.