AutoKernel: Automating GPU Kernel Optimization with LLM Agent Loops
These articles are AI-generated summaries. Please check the original sources for full details.
RightNow AI Releases AutoKernel: An Open-Source Framework that Applies an Autonomous Agent Loop to GPU Kernel Optimization for Arbitrary PyTorch Models
RightNow AI has released AutoKernel, an open-source framework that automates GPU kernel optimization through an autonomous LLM agent loop. The system can execute 300 to 400 experiments in a single overnight 10-hour run without human intervention.
Why This Matters
GPU kernel engineering is a high-skill bottleneck where manual tuning of parameters like tile sizes, register pressure, and warp synchronization takes years to master. While frontier LLMs match PyTorch baselines in fewer than 20% of cases via one-shot generation according to KernelBench, AutoKernel bridges this gap by mechanizing the iterative edit-benchmark-verify cycle used by expert engineers.
Key Insights
- The KernelBench suite found that even frontier LLMs fail to beat PyTorch baseline performance in over 80% of one-shot generation attempts.
- AutoKernel uses a five-stage correctness harness including numerical stability tests under adversarial inputs and determinism verification to catch race conditions.
- Optimization targets are prioritized using Amdahl’s Law, where torch.profiler identifies kernels consuming the highest percentage of total GPU runtime.
- A community-contributed Triton FP4 matmul kernel generated by the agent outperformed hand-optimized CUTLASS C++ code by up to 2.15x on H100 hardware.
- The system utilizes a 909-line instruction document (program.md) that encodes a six-tier optimization playbook ranging from block size tuning to architecture-specific TMA on Hopper.
Practical Applications
- Use case: Optimizing Transformer memory-bound kernels; AutoKernel reached 2,788 GB/s (83% of peak bandwidth) for RMSNorm on NVIDIA H100. Pitfall: Optimizing kernels in isolation without profiling the full model leads to negligible end-to-end performance gains.
- Use case: Multi-backend deployment; the orchestrator supports both Triton for rapid JIT iteration and CUDA C++ for low-level warp primitives. Pitfall: Relying solely on torch.compile max-autotune, which AutoKernel outperformed in 12 of 16 tested configurations.
References:
Continue reading
Next article
The Growing Cloud Data Encryption Gap: Insights from the 2026 Thales Report
Related Content
LightSeek Foundation Releases TokenSpeed: An Open-Source Inference Engine for Agentic AI
LightSeek Foundation's TokenSpeed is an open-source LLM inference engine that outperforms TensorRT-LLM by 11% in throughput on NVIDIA B200 GPUs for agentic coding workloads.
Unsloth Studio: No-Code LLM Fine-Tuning with 70% Less VRAM
Unsloth Studio launches as a local no-code interface for LLM fine-tuning, reducing VRAM usage by 70% and doubling training speeds via Triton kernels.
Fastino Labs Releases GLiGuard: 300M Parameter Model for 16x Faster LLM Safety Moderation
Fastino Labs open-sourced GLiGuard, a 300M parameter safety model that matches the accuracy of models 90x its size while delivering 16.6x lower latency.