OpenAI GPT-5.5: First Fully Retrained Agentic Model Hits 82.7% on Terminal-Bench

OpenAI Releases GPT-5.5, a Fully Retrained Agentic Model That Scores 82.7% on Terminal-Bench 2.0 and 84.9% on GDPval

OpenAI has launched GPT-5.5, its first fully retrained base model since GPT-4.5, specifically designed for autonomous multi-step computer tasks. The model scores a significant 82.7% on Terminal-Bench 2.0, outperforming Claude Opus 4.7 and Gemini 3.1 Pro in complex command-line workflows.

Why This Matters

Traditional LLMs often stall at handoff points, requiring constant human re-prompting to complete complex workflows. GPT-5.5 addresses this technical bottleneck by reasoning across long contexts and autonomously using tools like web browsers and code executors, representing a shift from passive assistants to active agents. While per-token pricing has doubled compared to GPT-5.4, OpenAI claims that superior token efficiency in Codex tasks offsets the higher rates by requiring fewer total tokens to achieve successful outcomes.

Key Insights

GPT-5.5 resolves 58.6% of real-world GitHub issues end-to-end on SWE-Bench Pro (2026)
The model achieves 82.7% on Terminal-Bench 2.0, significantly beating Claude Opus 4.7’s score of 69.4%
GPT-5.5 Pro variant scores 90.1% on BrowseComp, outperforming Gemini 3.1 Pro in tracking down obscure web information
On OSWorld-Verified, the model demonstrates a 78.7% success rate in autonomously operating real computer environments
Codex token efficiency allows GPT-5.5 to complete complex engineering tasks with fewer total tokens than GPT-5.4

Practical Applications

Terminal Orchestration: ML engineers using GPT-5.5 to automate pipeline debugging and script execution via Terminal-Bench 2.0 workflows. Pitfall: Over-reliance on autonomous execution without verifying environment state changes can lead to inconsistent deployments.
Software Engineering: Developers using Codex with GPT-5.5 for large-scale refactors and GitHub issue resolution (58.6% success rate). Pitfall: Using the model on codebases with high technical debt where the shape of the system is poorly defined.
Web-based Knowledge Work: Researchers utilizing GPT-5.5 Pro for hard-to-find data retrieval via BrowseComp (90.1% accuracy). Pitfall: High API costs for Pro models if workflows are not optimized for token efficiency.

References:

https://www.marktechpost.com/2026/04/23/openai-releases-gpt-5-5-a-fully-retrained-agentic-model-that-scores-82-7-on-terminal-bench-2-0-and-84-9-on-gdpval/

On This Page

OpenAI Releases GPT-5.5, a Fully Retrained Agentic Model That Scores 82.7% on Terminal-Bench 2.0 and 84.9% on GDPval

Why This Matters

Key Insights

Practical Applications

Continue reading

Related Content

Moonshot AI Introduces Kimi K2 Thinking: A Breakthrough in Long-Horizon Reasoning and Tool Use

Qwen3.6-35B-A3B: Sparse MoE Vision-Language Model with 3B Active Parameters

Gemini 3.1 Pro: 1M Token Context and 77.1% ARC-AGI-2 Reasoning for AI Agents