Skip to main content

On This Page

Unlocking Agentic RL Training for GPT-OSS: A Practical Retrospective

2 min read
Share

These articles are AI-generated summaries. Please check the original sources for full details.

Challenges of GPT-OSS RL Training

Agentic reinforcement learning (RL) trains policies by actively collecting on-policy data through interaction with an environment, optimizing for multi-step decision-making—a departure from traditional single-turn RL or offline preference methods. LinkedIn utilizes this approach to build AI agents for professional applications, requiring robust reasoning, interaction with structured services, and adaptation to evolving user intent. However, initial attempts to apply agentic RL to the GPT-OSS model revealed instability, with exploding KL divergence, entropy, and non-increasing rewards, highlighting the challenges of adapting cutting-edge LLMs to this training paradigm.

Key Insights

  • MoE Log-Probability Mismatch: A dual forward pass in verl, combined with the Mixture of Experts (MoE) architecture of GPT-OSS, caused differing expert routing and a resulting importance sampling ratio deviation from 1, violating the core on-policy assumption.
  • Training-Inference Discrepancy: Differences between training (FSDP) and inference (vLLM, SGLang) execution, particularly in attention mechanisms, introduced instability and hindered convergence.
  • FlashAttention v3 & Sequence Parallelism: Combining attention sink support in FlashAttention v3 with sequence parallelism drastically improved memory efficiency and enabled training with long context windows essential for multi-step agents.

Working Example

# Simplified example illustrating attention with sinks
import torch

def attention_with_sink(Q, K, V, sink_param):
  """
  Calculates attention with a learnable sink parameter.
  """
  scores = torch.matmul(Q, K.transpose(-2, -1)) / torch.sqrt(torch.tensor(Q.shape[-1], dtype=torch.float32)) # [B, H, N_q, N_k]
  combined = torch.cat([scores, sink_param.unsqueeze(1).unsqueeze(1)], dim=-1) # [B, H, N_q, N_k+1]
  probs = torch.softmax(combined, dim=-1) # Σ_j P_ij + P_sink = 1
  probs_content = probs[..., :-1] # Drop sink component
  output = torch.matmul(probs_content, V) # [B, H, N_q, d_v]
  return output

Practical Applications

  • LinkedIn Recruiter: An agentic system using GPT-OSS to refine search queries, coordinate with data sources, and present tailored candidate recommendations to recruiters.
  • Numerical Precision: Using bf16 format can lead to memory blow-ups during FSDP forward pass due to repeated MoE expert materialization, requiring careful optimization.

References:

Continue reading

Next article

‘Sicarii’ Ransomware Decryption Fails Due to Poor Coding and Potential AI Use

Related Content