Moonshot AI Releases FlashKDA: 2.22x Faster Prefill for Kimi Delta Attention

Moonshot AI Open-Sources FlashKDA: CUTLASS Kernels for Kimi Delta Attention with Variable-Length Batching and H20 Benchmarks

Moonshot AI has released FlashKDA, a high-performance CUTLASS-based kernel implementation of the Kimi Delta Attention (KDA) mechanism. The library delivers prefill speedups of 1.72x to 2.22x over the flash-linear-attention baseline on NVIDIA H20 GPUs.

Why This Matters

Standard softmax attention suffers from quadratic complexity, making long-context processing prohibitively expensive. While linear attention mechanisms like KDA offer linear scaling and a 75% reduction in KV cache usage, achieving production-grade performance requires highly optimized GPU kernels that can exploit hardware features like Tensor Cores. FlashKDA bridges this gap by providing an optimized CUTLASS implementation that supports variable-length batching, enabling efficient 1-million-context-length generation without the overhead of full attention.

Key Insights

FlashKDA achieves a 2.22x speedup on NVIDIA H20 GPUs for uniform variable-length sequences as of April 2026.
Kimi Delta Attention (KDA) refines Gated DeltaNet with channel-wise gating for more effective finite-state RNN memory use.
The Kimi Linear model uses a 3:1 KDA-to-MLA ratio to achieve 6x higher decoding throughput at 1 million context length.
The kernel is built on CUTLASS for SM90+ hardware, specifically targeting NVIDIA Hopper architectures like H100 and H20.
FlashKDA is auto-dispatched from the flash-linear-attention library, as tracked in GitHub PR #852.

Working Examples

Installation steps for the FlashKDA library and its dependencies.

git clone https://github.com/MoonshotAI/FlashKDA.git flash-kda; cd flash-kda; git submodule update --init --recursive; pip install -v .

Practical Applications

The Kimi Linear model architecture utilizes KDA to reduce KV cache usage by 75% during long-sequence generation. Pitfall: Attempting to use head dimensions other than K=V=128 will cause the current kernel to fail.
Production inference systems use the cu_seqlens parameter to pack multiple variable-length requests into a single kernel call for high-throughput serving. Pitfall: Running the kernel on hardware older than SM90 (Hopper) will result in incompatibility since it requires CUDA 12.9+.

References:

On This Page

Moonshot AI Open-Sources FlashKDA: CUTLASS Kernels for Kimi Delta Attention with Variable-Length Batching and H20 Benchmarks

Why This Matters

Key Insights

Working Examples

Practical Applications

Continue reading

Related Content

FlashQLA: High-Performance Linear Attention Library for NVIDIA Hopper GPUs

Nous Research Debuts Lighthouse Attention for 1.7x Faster Long-Context Pretraining

AutoKernel: Automating GPU Kernel Optimization with LLM Agent Loops