Kimi’s K2 Opensource LLM Achieves 71.3% on SWE-Bench Verified
These articles are AI-generated summaries. Please check the original sources for full details.
Kimi’s K2 Opensource Language Model Supports Dynamic Resource Availability and New Optimizer
Kimi released K2, a Mixture-of-Experts (MoE) large language model with 1.04 trillion total parameters and 32 billion activated parameters, trained on 15.5 trillion tokens. The release includes MuonClip, a new optimizer that reportedly eliminates loss spikes during pre-training.
Why This Matters
Traditional LLM training faces significant instability issues, requiring extensive manual tuning and often resulting in failed training runs—estimated to cost millions in compute resources. Kimi’s MuonClip optimizer aims to address these instabilities, enabling more reliable and scalable training of extremely large models like K2, pushing the boundaries of open-source LLM capabilities.
Key Insights
- K2 Performance: Achieved 44.9% on Humanity’s Last Exam (HLE) with tools, 60.2% on BrowseComp, and 71.3% on SWE-Bench Verified.
- MuonClip Optimizer: Builds on the Muon optimizer with a QK-clip technique to stabilize training.
- Dynamic Resource Availability: Kimi designed a parallelism strategy allowing training on any number of nodes as a multiple of 32, accommodating fluctuating compute resources.
Working Example
# Example of INT4 Quantization Aware Training (QAT)
# (Conceptual - actual implementation details are complex)
import torch
# Load pre-trained model
model = ...
# Enable QAT
model.qconfig = torch.quantization.get_default_qconfig('fbgemm')
torch.quantization.prepare(model, inplace=True)
# Perform training with QAT
# ... training loop ...
# Convert to quantized model
torch.quantization.convert(model, inplace=True)
# Deploy quantized model for inference
# ... inference code ...
Practical Applications
- Use Case: Kimi utilizes K2 for agentic tasks, enabling 200-300 sequential tool calls with long-horizon planning.
- Pitfall: Quantization, while improving inference speed, can degrade performance on long-output models; Kimi addressed this with Quantization-Aware Training (QAT).
References:
Continue reading
Next article
KubeCon NA 2025 - Erica Hughberg and Alexa Griffith on Tools for the Age of GenAI
Related Content
MBZUAI Researchers Introduce PAN: A General World Model For Interactable Long Horizon Simulation
MBZUAI’s PAN world model achieves 70.3% agent simulation accuracy, enabling interactive long-horizon video generation.
FunctionGemma: Google AI’s 270M Parameter Function Calling Specialist for Edge Workloads
Google released FunctionGemma, a compact 270M parameter model achieving 85% accuracy on the Mobile Actions benchmark after fine-tuning.
Tencent Releases HY-Motion 1.0: A Billion-Parameter Text-to-Motion Model
Tencent’s HY-Motion 1.0 achieves a 78.6% SSAE score, representing a significant advance in text-to-3D human motion generation.