OpenAI Releases gpt-oss-safeguard: Open-Weight Safety Reasoning Models for Custom Policy Enforcement

OpenAI has released a research preview of gpt-oss-safeguard, two open-weight safety reasoning models designed to classify content using developer-defined policies at inference time. These models, gpt-oss-safeguard-120b (117B parameters, 5.1B active) and gpt-oss-safeguard-20b (21B parameters, 3.6B active), are fine-tuned from the gpt-oss base model and licensed under Apache 2.0. They are available on Hugging Face for local deployment, enabling platforms to enforce custom safety policies dynamically without retraining.

Policy-Conditioned Safety: A Paradigm Shift

Traditional Limitations: Conventional moderation models rely on fixed policies, requiring retraining or replacement when policies change.
gpt-oss-safeguard Innovation:
- Accepts developer-authored policies as inputs alongside user content.
- Uses step-by-step reasoning to determine policy violations, making it adaptable to evolving or domain-specific harms (e.g., fraud, biology, self-harm).
- Transforms safety into a prompt and evaluation task, improving flexibility for fast-changing threats.

Alignment with OpenAI’s Internal Systems

Internal Safety Reasoner: The models replicate the architecture used in OpenAI’s production systems (e.g., GPT 5, ChatGPT Agent, Sora 2).
Defense-in-Depth Strategy:
- First Layer: Small, high-recall classifiers filter traffic.
- Second Layer: Uncertain or sensitive content is escalated to gpt-oss-safeguard for deeper analysis.
- Compute Allocation: Up to 16% of total compute in recent launches was dedicated to safety reasoning.

Model Sizes and Hardware Fit

gpt-oss-safeguard-120b:
- 117B parameters, 5.1B active.
- Optimized for 80GB H100-class GPUs.
gpt-oss-safeguard-20b:
- 21B parameters, 3.6B active.
- Targets 16GB GPU setups for lower latency.
Harmony Response Format: Prompts must follow this structure; deviations degrade performance.

Evaluation Results

Internal Multi-Policy Tests:
- Outperformed gpt-5-thinking and gpt-oss baselines in multi-policy accuracy (applying multiple policies simultaneously).
- Slight edge over OpenAI’s internal Safety Reasoner on the 2022 moderation dataset, though the margin is not statistically significant.
ToxicChat Benchmark:
- OpenAI’s internal Safety Reasoner led, with gpt-oss-safeguard in close second place, indicating readiness for real-world moderation.

Recommended Deployment Pattern

Layered Moderation Pipeline:
- Use fast, high-recall classifiers for all traffic.
- Route uncertain/sensitive content to gpt-oss-safeguard.
- For real-time needs, run the reasoner asynchronously.
External Resources: Combine with tools like ROOST for auditability and custom taxonomy support.

Key Takeaways

Custom Policy Flexibility: Developers can update policies without retraining models.
Production-Ready Sizing: Both models are optimized for real-world hardware (e.g., H100 GPUs, 16GB setups).
Competitive Performance: Matches or slightly outperforms internal Safety Reasoners on benchmarks, though margins are modest.
License and Accessibility: Apache 2.0 license permits commercial use; models are open-weight and available on Hugging Face.

Editorial Insights

OpenAI’s release democratizes its internal safety infrastructure, allowing external platforms to enforce custom taxonomies and audit reasoning chains. The models’ alignment with production-grade performance (e.g., 16% compute allocation) validates their practicality for real-world deployment.

Reference Link

On This Page

OpenAI Releases gpt-oss-safeguard: Open-Weight Safety Reasoning Models for Custom Policy Enforcement