Unlocking Agentic RL Training for GPT-OSS: A Practical Retrospective

Challenges of GPT-OSS RL Training

Agentic reinforcement learning (RL) trains policies by actively collecting on-policy data through interaction with an environment, optimizing for multi-step decision-making—a departure from traditional single-turn RL or offline preference methods. LinkedIn utilizes this approach to build AI agents for professional applications, requiring robust reasoning, interaction with structured services, and adaptation to evolving user intent. However, initial attempts to apply agentic RL to the GPT-OSS model revealed instability, with exploding KL divergence, entropy, and non-increasing rewards, highlighting the challenges of adapting cutting-edge LLMs to this training paradigm.

Key Insights

MoE Log-Probability Mismatch: A dual forward pass in verl, combined with the Mixture of Experts (MoE) architecture of GPT-OSS, caused differing expert routing and a resulting importance sampling ratio deviation from 1, violating the core on-policy assumption.
Training-Inference Discrepancy: Differences between training (FSDP) and inference (vLLM, SGLang) execution, particularly in attention mechanisms, introduced instability and hindered convergence.
FlashAttention v3 & Sequence Parallelism: Combining attention sink support in FlashAttention v3 with sequence parallelism drastically improved memory efficiency and enabled training with long context windows essential for multi-step agents.

Working Example

# Simplified example illustrating attention with sinks
import torch

def attention_with_sink(Q, K, V, sink_param):
  """
  Calculates attention with a learnable sink parameter.
  """
  scores = torch.matmul(Q, K.transpose(-2, -1)) / torch.sqrt(torch.tensor(Q.shape[-1], dtype=torch.float32)) # [B, H, N_q, N_k]
  combined = torch.cat([scores, sink_param.unsqueeze(1).unsqueeze(1)], dim=-1) # [B, H, N_q, N_k+1]
  probs = torch.softmax(combined, dim=-1) # Σ_j P_ij + P_sink = 1
  probs_content = probs[..., :-1] # Drop sink component
  output = torch.matmul(probs_content, V) # [B, H, N_q, d_v]
  return output

Practical Applications

LinkedIn Recruiter: An agentic system using GPT-OSS to refine search queries, coordinate with data sources, and present tailored candidate recommendations to recruiters.
Numerical Precision: Using bf16 format can lead to memory blow-ups during FSDP forward pass due to repeated MoE expert materialization, requiring careful optimization.

References:

On This Page

Challenges of GPT-OSS RL Training

Key Insights

Working Example

Practical Applications

Continue reading

Related Content

Quantum-Inspired State Sculpting: Revolutionizing Offline Reinforcement Learning

Quantum-Inspired Encoding: Revolutionizing Reinforcement Learning with Scarce Data

Meta AI Introduces DreamGym: A Textual Experience Synthesizer For Reinforcement Learning RL Agents