Inside OpenAI’s in-house data agent

OpenAI has developed a bespoke, internal AI data agent powered by GPT-5 and Codex, designed to explore and reason over its massive data platform, containing over 600 petabytes of data across 70,000 datasets. This agent dramatically reduces the time to insight for employees, moving it from days to minutes.

Why This Matters

Ideal data analysis assumes clean, well-documented data and analysts with deep contextual knowledge. In reality, data is often messy, poorly documented, and requires significant effort to understand relationships and potential pitfalls. At OpenAI’s scale—with 3.5k+ internal users and 70k datasets—the cost of inefficient data access and analysis quickly becomes substantial, hindering data-driven decision-making.

Key Insights

600 petabytes: The total volume of data managed by OpenAI’s data platform.
Context is King: The agent relies on multiple layers of context – metadata, query inference, curated descriptions, code-level definitions, Slack/Google Docs integration, and a learning memory system – to ensure accurate results.
Evals API for Quality Control: OpenAI uses its Evals API to systematically evaluate the agent’s performance with curated question-answer pairs and automated SQL comparison, preventing regressions and ensuring reliability.

Practical Applications

OpenAI Internal Teams: Engineering, Data Science, Go-To-Market, Finance, and Research teams use the agent for high-impact data questions, such as evaluating product launches and understanding business health.
Pitfall: Overly prescriptive prompting can hinder the agent’s ability to reason effectively; allowing GPT-5 to choose the execution path leads to more robust results.

References:

https://openai.com/index/inside-our-in-house-data-agent/

On This Page

Inside OpenAI’s in-house data agent