AutoKernel: Automating GPU Kernel Optimization with LLM Agent Loops

RightNow AI Releases AutoKernel: An Open-Source Framework that Applies an Autonomous Agent Loop to GPU Kernel Optimization for Arbitrary PyTorch Models

RightNow AI has released AutoKernel, an open-source framework that automates GPU kernel optimization through an autonomous LLM agent loop. The system can execute 300 to 400 experiments in a single overnight 10-hour run without human intervention.

Why This Matters

GPU kernel engineering is a high-skill bottleneck where manual tuning of parameters like tile sizes, register pressure, and warp synchronization takes years to master. While frontier LLMs match PyTorch baselines in fewer than 20% of cases via one-shot generation according to KernelBench, AutoKernel bridges this gap by mechanizing the iterative edit-benchmark-verify cycle used by expert engineers.

Key Insights

The KernelBench suite found that even frontier LLMs fail to beat PyTorch baseline performance in over 80% of one-shot generation attempts.
AutoKernel uses a five-stage correctness harness including numerical stability tests under adversarial inputs and determinism verification to catch race conditions.
Optimization targets are prioritized using Amdahl’s Law, where torch.profiler identifies kernels consuming the highest percentage of total GPU runtime.
A community-contributed Triton FP4 matmul kernel generated by the agent outperformed hand-optimized CUTLASS C++ code by up to 2.15x on H100 hardware.
The system utilizes a 909-line instruction document (program.md) that encodes a six-tier optimization playbook ranging from block size tuning to architecture-specific TMA on Hopper.

Practical Applications

Use case: Optimizing Transformer memory-bound kernels; AutoKernel reached 2,788 GB/s (83% of peak bandwidth) for RMSNorm on NVIDIA H100. Pitfall: Optimizing kernels in isolation without profiling the full model leads to negligible end-to-end performance gains.
Use case: Multi-backend deployment; the orchestrator supports both Triton for rapid JIT iteration and CUDA C++ for low-level warp primitives. Pitfall: Relying solely on torch.compile max-autotune, which AutoKernel outperformed in 12 of 16 tested configurations.

References:

https://www.marktechpost.com/2026/04/06/rightnow-ai-releases-autokernel-an-open-source-framework-that-applies-an-autonomous-agent-loop-to-gpu-kernel-optimization-for-arbitrary-pytorch-models/

On This Page

RightNow AI Releases AutoKernel: An Open-Source Framework that Applies an Autonomous Agent Loop to GPU Kernel Optimization for Arbitrary PyTorch Models

Why This Matters

Key Insights

Practical Applications

Continue reading

Related Content

Unsloth Studio: No-Code LLM Fine-Tuning with 70% Less VRAM

GLM-5 Achieves Open-Source Leadership Without NVIDIA GPUs

Photon Launches Spectrum: Open-Source TypeScript SDK for Deploying AI Agents to iMessage and WhatsApp