Unlocking Agentic RL Training for GPT-OSS: A Practical Retrospective
These articles are AI-generated summaries. Please check the original sources for full details.
Challenges of GPT-OSS RL Training
Agentic reinforcement learning (RL) trains policies by actively collecting on-policy data through interaction with an environment, optimizing for multi-step decision-making—a departure from traditional single-turn RL or offline preference methods. LinkedIn utilizes this approach to build AI agents for professional applications, requiring robust reasoning, interaction with structured services, and adaptation to evolving user intent. However, initial attempts to apply agentic RL to the GPT-OSS model revealed instability, with exploding KL divergence, entropy, and non-increasing rewards, highlighting the challenges of adapting cutting-edge LLMs to this training paradigm.
Key Insights
- MoE Log-Probability Mismatch: A dual forward pass in verl, combined with the Mixture of Experts (MoE) architecture of GPT-OSS, caused differing expert routing and a resulting importance sampling ratio deviation from 1, violating the core on-policy assumption.
- Training-Inference Discrepancy: Differences between training (FSDP) and inference (vLLM, SGLang) execution, particularly in attention mechanisms, introduced instability and hindered convergence.
- FlashAttention v3 & Sequence Parallelism: Combining attention sink support in FlashAttention v3 with sequence parallelism drastically improved memory efficiency and enabled training with long context windows essential for multi-step agents.
Working Example
# Simplified example illustrating attention with sinks
import torch
def attention_with_sink(Q, K, V, sink_param):
"""
Calculates attention with a learnable sink parameter.
"""
scores = torch.matmul(Q, K.transpose(-2, -1)) / torch.sqrt(torch.tensor(Q.shape[-1], dtype=torch.float32)) # [B, H, N_q, N_k]
combined = torch.cat([scores, sink_param.unsqueeze(1).unsqueeze(1)], dim=-1) # [B, H, N_q, N_k+1]
probs = torch.softmax(combined, dim=-1) # Σ_j P_ij + P_sink = 1
probs_content = probs[..., :-1] # Drop sink component
output = torch.matmul(probs_content, V) # [B, H, N_q, d_v]
return output
Practical Applications
- LinkedIn Recruiter: An agentic system using GPT-OSS to refine search queries, coordinate with data sources, and present tailored candidate recommendations to recruiters.
- Numerical Precision: Using bf16 format can lead to memory blow-ups during FSDP forward pass due to repeated MoE expert materialization, requiring careful optimization.
References:
Continue reading
Next article
‘Sicarii’ Ransomware Decryption Fails Due to Poor Coding and Potential AI Use
Related Content
Quantum-Inspired State Sculpting: Revolutionizing Offline Reinforcement Learning
Quantum-inspired state sculpting boosts offline RL performance with 100x fewer training samples.
Quantum-Inspired Encoding: Revolutionizing Reinforcement Learning with Scarce Data
Quantum-inspired encoding boosts RL performance with scarce data, enabling breakthroughs in healthcare and finance.
Meta AI Introduces DreamGym: A Textual Experience Synthesizer For Reinforcement Learning RL Agents
Meta AI’s DreamGym achieves performance matching 80,000 real-environment interactions using solely synthetic data, scaling RL for LLM agents.