Skip to main content

On This Page

Mastering LLM Post-Training: A Practical Guide to SFT, DPO, and GRPO with TRL

2 min read
Share

These articles are AI-generated summaries. Please check the original sources for full details.

A Coding Guide on LLM Post-Training with TRL from Supervised Fine Tuning to DPO and GRPO Reasoning

The TRL (Transformer Reinforcement Learning) library provides a unified ecosystem for aligning large language models through iterative post-training stages. This guide demonstrates how to execute the full alignment pipeline on a single 16GB T4 GPU using LoRA and efficient memory management.

Why This Matters

While base models possess broad knowledge, they often lack the specific conversational structure and reasoning capabilities required for production environments. This technical walkthrough addresses the gap between raw pre-trained weights and aligned assistants by implementing verifiable reward functions and preference optimization, proving that state-of-the-art alignment is feasible on limited hardware without massive infrastructure costs.

Key Insights

  • LoRA (Low-Rank Adaptation) enables fine-tuning models like Qwen2.5-0.5B on hardware with limited VRAM by targeting projection layers such as q_proj and v_proj.
  • Supervised Fine-Tuning (SFT) using TRL’s SFTTrainer establishes baseline conversational behavior by training on instruction datasets like Capybara.
  • Direct Preference Optimization (DPO) simplifies alignment by directly optimizing policies using chosen/rejected pairs, eliminating the overhead of maintaining a separate reward model during the final stage.
  • Group Relative Policy Optimization (GRPO) enhances model reasoning by generating multiple responses per prompt and applying relative rewards based on verifiable outcomes.
  • Deterministic reward functions, such as regex-based math correctness checks, allow models to improve reasoning behavior without human-in-the-loop feedback.

Working Examples

Implementation of GRPO with a verifiable math correctness reward function using TRL.

from trl import GRPOTrainer, GRPOConfig
def correctness_reward(completions, **kwargs):
    answers = kwargs["answer"]
    rewards = []
    for c, gold in zip(completions, answers):
        nums = re.findall(r"-?\d+", c)
        rewards.append(1.0 if nums and nums[-1] == gold else 0.0)
    return rewards

grpo_args = GRPOConfig(
    output_dir="./grpo_out",
    learning_rate=1e-5,
    per_device_train_batch_size=2,
    num_generations=4,
    bf16=BF16_OK,
    max_steps=15
)
grpo_trainer = GRPOTrainer(
    model=MODEL_NAME,
    args=grpo_args,
    train_dataset=grpo_ds,
    reward_funcs=[correctness_reward],
    peft_config=LORA_CFG
)

Configuring the DPOTrainer to align model outputs with human preference data using the beta parameter to control divergence.

from trl import DPOTrainer, DPOConfig
dpo_args = DPOConfig(
    output_dir="./dpo_out",
    per_device_train_batch_size=1,
    gradient_accumulation_steps=4,
    learning_rate=5e-6,
    beta=0.1,
    bf16=BF16_OK
)
dpo_trainer = DPOTrainer(
    model=MODEL_NAME,
    args=dpo_args,
    train_dataset=dpo_ds,
    peft_config=LORA_CFG
)

Practical Applications

  • Mathematical reasoning engines: Using GRPO to verify answers via regex-based correctness rewards while applying brevity rewards to discourage verbose ‘rambling’ anti-patterns.
  • Instruction following: Implementing SFT and DPO to transform raw base models into chat-ready assistants that adhere to specific formatting requirements and human preferences.

References:

Continue reading

Next article

Implementing End-to-End Brain Decoding from MEG Signals with NeuralSet and CNNs

Related Content