Skip to main content

On This Page

AutoKernel: Automating GPU Kernel Optimization with LLM Agent Loops

2 min read
Share

These articles are AI-generated summaries. Please check the original sources for full details.

RightNow AI Releases AutoKernel: An Open-Source Framework that Applies an Autonomous Agent Loop to GPU Kernel Optimization for Arbitrary PyTorch Models

RightNow AI has released AutoKernel, an open-source framework that automates GPU kernel optimization through an autonomous LLM agent loop. The system can execute 300 to 400 experiments in a single overnight 10-hour run without human intervention.

Why This Matters

GPU kernel engineering is a high-skill bottleneck where manual tuning of parameters like tile sizes, register pressure, and warp synchronization takes years to master. While frontier LLMs match PyTorch baselines in fewer than 20% of cases via one-shot generation according to KernelBench, AutoKernel bridges this gap by mechanizing the iterative edit-benchmark-verify cycle used by expert engineers.

Key Insights

  • The KernelBench suite found that even frontier LLMs fail to beat PyTorch baseline performance in over 80% of one-shot generation attempts.
  • AutoKernel uses a five-stage correctness harness including numerical stability tests under adversarial inputs and determinism verification to catch race conditions.
  • Optimization targets are prioritized using Amdahl’s Law, where torch.profiler identifies kernels consuming the highest percentage of total GPU runtime.
  • A community-contributed Triton FP4 matmul kernel generated by the agent outperformed hand-optimized CUTLASS C++ code by up to 2.15x on H100 hardware.
  • The system utilizes a 909-line instruction document (program.md) that encodes a six-tier optimization playbook ranging from block size tuning to architecture-specific TMA on Hopper.

Practical Applications

  • Use case: Optimizing Transformer memory-bound kernels; AutoKernel reached 2,788 GB/s (83% of peak bandwidth) for RMSNorm on NVIDIA H100. Pitfall: Optimizing kernels in isolation without profiling the full model leads to negligible end-to-end performance gains.
  • Use case: Multi-backend deployment; the orchestrator supports both Triton for rapid JIT iteration and CUDA C++ for low-level warp primitives. Pitfall: Relying solely on torch.compile max-autotune, which AutoKernel outperformed in 12 of 16 tested configurations.

References:

Continue reading

Next article

The Growing Cloud Data Encryption Gap: Insights from the 2026 Thales Report

Related Content