Mastering LLM Post-Training: A Practical Guide to SFT, DPO, and GRPO with TRL
These articles are AI-generated summaries. Please check the original sources for full details.
A Coding Guide on LLM Post-Training with TRL from Supervised Fine Tuning to DPO and GRPO Reasoning
The TRL (Transformer Reinforcement Learning) library provides a unified ecosystem for aligning large language models through iterative post-training stages. This guide demonstrates how to execute the full alignment pipeline on a single 16GB T4 GPU using LoRA and efficient memory management.
Why This Matters
While base models possess broad knowledge, they often lack the specific conversational structure and reasoning capabilities required for production environments. This technical walkthrough addresses the gap between raw pre-trained weights and aligned assistants by implementing verifiable reward functions and preference optimization, proving that state-of-the-art alignment is feasible on limited hardware without massive infrastructure costs.
Key Insights
- LoRA (Low-Rank Adaptation) enables fine-tuning models like Qwen2.5-0.5B on hardware with limited VRAM by targeting projection layers such as q_proj and v_proj.
- Supervised Fine-Tuning (SFT) using TRL’s SFTTrainer establishes baseline conversational behavior by training on instruction datasets like Capybara.
- Direct Preference Optimization (DPO) simplifies alignment by directly optimizing policies using chosen/rejected pairs, eliminating the overhead of maintaining a separate reward model during the final stage.
- Group Relative Policy Optimization (GRPO) enhances model reasoning by generating multiple responses per prompt and applying relative rewards based on verifiable outcomes.
- Deterministic reward functions, such as regex-based math correctness checks, allow models to improve reasoning behavior without human-in-the-loop feedback.
Working Examples
Implementation of GRPO with a verifiable math correctness reward function using TRL.
from trl import GRPOTrainer, GRPOConfig
def correctness_reward(completions, **kwargs):
answers = kwargs["answer"]
rewards = []
for c, gold in zip(completions, answers):
nums = re.findall(r"-?\d+", c)
rewards.append(1.0 if nums and nums[-1] == gold else 0.0)
return rewards
grpo_args = GRPOConfig(
output_dir="./grpo_out",
learning_rate=1e-5,
per_device_train_batch_size=2,
num_generations=4,
bf16=BF16_OK,
max_steps=15
)
grpo_trainer = GRPOTrainer(
model=MODEL_NAME,
args=grpo_args,
train_dataset=grpo_ds,
reward_funcs=[correctness_reward],
peft_config=LORA_CFG
)
Configuring the DPOTrainer to align model outputs with human preference data using the beta parameter to control divergence.
from trl import DPOTrainer, DPOConfig
dpo_args = DPOConfig(
output_dir="./dpo_out",
per_device_train_batch_size=1,
gradient_accumulation_steps=4,
learning_rate=5e-6,
beta=0.1,
bf16=BF16_OK
)
dpo_trainer = DPOTrainer(
model=MODEL_NAME,
args=dpo_args,
train_dataset=dpo_ds,
peft_config=LORA_CFG
)
Practical Applications
- Mathematical reasoning engines: Using GRPO to verify answers via regex-based correctness rewards while applying brevity rewards to discourage verbose ‘rambling’ anti-patterns.
- Instruction following: Implementing SFT and DPO to transform raw base models into chat-ready assistants that adhere to specific formatting requirements and human preferences.
References:
Continue reading
Next article
Implementing End-to-End Brain Decoding from MEG Signals with NeuralSet and CNNs
Related Content
Hugging Face Releases TRL v1.0: A Unified Post-Training Stack for SFT, Reward Modeling, DPO, and GRPO Workflows
Hugging Face TRL v1.0 standardizes LLM post-training with a unified CLI and config system, delivering up to 2x training speed and a 70% reduction in memory usage.
NVIDIA KVPress: Optimizing Long-Context LLM Inference with KV Cache Compression
NVIDIA’s KVPress framework enables memory-efficient LLM inference by pruning KV cache pairs with compression ratios up to 0.7, significantly reducing GPU memory overhead for long-context tasks.
Zyphra ZAYA1-8B: A 760M Parameter MoE Model Outperforming Claude 4.5 on Math
Zyphra's ZAYA1-8B uses 760M active parameters to outperform Claude 4.5 Sonnet on math benchmarks using novel Markovian RSA test-time compute.