Moonshot AI Releases FlashKDA: 2.22x Faster Prefill for Kimi Delta Attention
These articles are AI-generated summaries. Please check the original sources for full details.
Moonshot AI Open-Sources FlashKDA: CUTLASS Kernels for Kimi Delta Attention with Variable-Length Batching and H20 Benchmarks
Moonshot AI has released FlashKDA, a high-performance CUTLASS-based kernel implementation of the Kimi Delta Attention (KDA) mechanism. The library delivers prefill speedups of 1.72x to 2.22x over the flash-linear-attention baseline on NVIDIA H20 GPUs.
Why This Matters
Standard softmax attention suffers from quadratic complexity, making long-context processing prohibitively expensive. While linear attention mechanisms like KDA offer linear scaling and a 75% reduction in KV cache usage, achieving production-grade performance requires highly optimized GPU kernels that can exploit hardware features like Tensor Cores. FlashKDA bridges this gap by providing an optimized CUTLASS implementation that supports variable-length batching, enabling efficient 1-million-context-length generation without the overhead of full attention.
Key Insights
- FlashKDA achieves a 2.22x speedup on NVIDIA H20 GPUs for uniform variable-length sequences as of April 2026.
- Kimi Delta Attention (KDA) refines Gated DeltaNet with channel-wise gating for more effective finite-state RNN memory use.
- The Kimi Linear model uses a 3:1 KDA-to-MLA ratio to achieve 6x higher decoding throughput at 1 million context length.
- The kernel is built on CUTLASS for SM90+ hardware, specifically targeting NVIDIA Hopper architectures like H100 and H20.
- FlashKDA is auto-dispatched from the flash-linear-attention library, as tracked in GitHub PR #852.
Working Examples
Installation steps for the FlashKDA library and its dependencies.
git clone https://github.com/MoonshotAI/FlashKDA.git flash-kda; cd flash-kda; git submodule update --init --recursive; pip install -v .
Practical Applications
- The Kimi Linear model architecture utilizes KDA to reduce KV cache usage by 75% during long-sequence generation. Pitfall: Attempting to use head dimensions other than K=V=128 will cause the current kernel to fail.
- Production inference systems use the cu_seqlens parameter to pack multiple variable-length requests into a single kernel call for high-throughput serving. Pitfall: Running the kernel on hardware older than SM90 (Hopper) will result in incompatibility since it requires CUDA 12.9+.
References:
Continue reading
Next article
Benchmarking Document Parsing with LlamaIndex ParseBench and PyMuPDF
Related Content
Nous Research Debuts Lighthouse Attention for 1.7x Faster Long-Context Pretraining
Nous Research introduces Lighthouse Attention, delivering up to 1.7x pretraining speedups and 21x faster forward passes at 512K context lengths.
FlashQLA: High-Performance Linear Attention Library for NVIDIA Hopper GPUs
The Qwen Team has released FlashQLA, a linear attention kernel library achieving up to 3x speedup on NVIDIA Hopper GPUs for Gated Delta Network architectures.
LightSeek Foundation Releases TokenSpeed: An Open-Source Inference Engine for Agentic AI
LightSeek Foundation's TokenSpeed is an open-source LLM inference engine that outperforms TensorRT-LLM by 11% in throughput on NVIDIA B200 GPUs for agentic coding workloads.