Mastering LLM Post-Training: A Practical Guide to SFT, DPO, and GRPO with TRL

A Coding Guide on LLM Post-Training with TRL from Supervised Fine Tuning to DPO and GRPO Reasoning

The TRL (Transformer Reinforcement Learning) library provides a unified ecosystem for aligning large language models through iterative post-training stages. This guide demonstrates how to execute the full alignment pipeline on a single 16GB T4 GPU using LoRA and efficient memory management.

Why This Matters

While base models possess broad knowledge, they often lack the specific conversational structure and reasoning capabilities required for production environments. This technical walkthrough addresses the gap between raw pre-trained weights and aligned assistants by implementing verifiable reward functions and preference optimization, proving that state-of-the-art alignment is feasible on limited hardware without massive infrastructure costs.

Key Insights

LoRA (Low-Rank Adaptation) enables fine-tuning models like Qwen2.5-0.5B on hardware with limited VRAM by targeting projection layers such as q_proj and v_proj.
Supervised Fine-Tuning (SFT) using TRL’s SFTTrainer establishes baseline conversational behavior by training on instruction datasets like Capybara.
Direct Preference Optimization (DPO) simplifies alignment by directly optimizing policies using chosen/rejected pairs, eliminating the overhead of maintaining a separate reward model during the final stage.
Group Relative Policy Optimization (GRPO) enhances model reasoning by generating multiple responses per prompt and applying relative rewards based on verifiable outcomes.
Deterministic reward functions, such as regex-based math correctness checks, allow models to improve reasoning behavior without human-in-the-loop feedback.

Working Examples

Implementation of GRPO with a verifiable math correctness reward function using TRL.

from trl import GRPOTrainer, GRPOConfig
def correctness_reward(completions, **kwargs):
    answers = kwargs["answer"]
    rewards = []
    for c, gold in zip(completions, answers):
        nums = re.findall(r"-?\d+", c)
        rewards.append(1.0 if nums and nums[-1] == gold else 0.0)
    return rewards

grpo_args = GRPOConfig(
    output_dir="./grpo_out",
    learning_rate=1e-5,
    per_device_train_batch_size=2,
    num_generations=4,
    bf16=BF16_OK,
    max_steps=15
)
grpo_trainer = GRPOTrainer(
    model=MODEL_NAME,
    args=grpo_args,
    train_dataset=grpo_ds,
    reward_funcs=[correctness_reward],
    peft_config=LORA_CFG
)

Configuring the DPOTrainer to align model outputs with human preference data using the beta parameter to control divergence.

from trl import DPOTrainer, DPOConfig
dpo_args = DPOConfig(
    output_dir="./dpo_out",
    per_device_train_batch_size=1,
    gradient_accumulation_steps=4,
    learning_rate=5e-6,
    beta=0.1,
    bf16=BF16_OK
)
dpo_trainer = DPOTrainer(
    model=MODEL_NAME,
    args=dpo_args,
    train_dataset=dpo_ds,
    peft_config=LORA_CFG
)

Practical Applications

Mathematical reasoning engines: Using GRPO to verify answers via regex-based correctness rewards while applying brevity rewards to discourage verbose ‘rambling’ anti-patterns.
Instruction following: Implementing SFT and DPO to transform raw base models into chat-ready assistants that adhere to specific formatting requirements and human preferences.

References:

https://www.marktechpost.com/2026/05/01/a-coding-guide-on-llm-post-training-with-trl-from-supervised-fine-tuning-to-dpo-and-grpo-reasoning/

On This Page

A Coding Guide on LLM Post-Training with TRL from Supervised Fine Tuning to DPO and GRPO Reasoning

Why This Matters

Key Insights

Working Examples

Practical Applications

Continue reading

Related Content

Hugging Face Releases TRL v1.0: A Unified Post-Training Stack for SFT, Reward Modeling, DPO, and GRPO Workflows

NVIDIA KVPress: Optimizing Long-Context LLM Inference with KV Cache Compression

Google's Deep-Thinking Ratio: Boosting LLM Accuracy While Slashing Inference Costs by 50%