FlashQLA: High-Performance Linear Attention Library for NVIDIA Hopper GPUs

Qwen Team Releases FlashQLA: a High-Performance Linear Attention Kernel Library That Achieves Up to 3× Speedup on NVIDIA Hopper GPUs

The Qwen Team has launched FlashQLA, a high-performance linear attention kernel library specifically optimized for the Gated Delta Network (GDN) attention mechanism. Built on the TileLang compiler framework, it achieves a 2–3× speedup on forward passes compared to traditional Triton-based kernels on NVIDIA H200 GPUs.

Why This Matters

Standard Transformer attention mechanisms suffer from O(n²) complexity, creating a bottleneck for long-context tasks like processing extensive documents or codebases. While linear attention architectures like GDN reduce this to O(n), software implementations often fail to fully exploit modern hardware features such as warpgroup-level Tensor Core operations and asynchronous data pipelines. FlashQLA addresses this by optimizing the GPU kernel layer, bridging the gap between mathematical efficiency and physical hardware utilization on NVIDIA’s Hopper architecture.

Key Insights

FlashQLA optimizes the Gated Delta Network (GDN) architecture used in the Qwen3.5 and Qwen3.6 model families.
The library achieves a 2-3x speedup in forward passes and a 2x speedup in backward passes over the Flash Linear Attention (FLA) Triton 0.5.0 baseline.
Gate-driven automatic intra-card context parallelism (CP) enables higher SM utilization in long-sequence and small-head-count scenarios without manual configuration.
Algebraic reformulation reduces computational overhead on Tensor Cores, CUDA Cores, and Special Function Units (SFU) without sacrificing numerical precision.
The implementation uses TileLang to facilitate warp specialization, allowing warpgroups to overlap data movement with Tensor Core matrix multiplications.
FlashQLA is released under the MIT License and requires SM90+ hardware, CUDA 12.8+, and PyTorch 2.8+.

Practical Applications

Use case: Pretraining and long-context inference for hybrid Qwen models using GDN layers to scale sequence length efficiently. Pitfall: Using unoptimized Triton kernels on Hopper hardware results in suboptimal instruction scheduling and lower throughput.
Use case: Edge-side agentic inference where low-latency linear attention is critical for real-time responsiveness. Pitfall: Attempting to run FlashQLA on hardware older than SM90 (Hopper) will result in incompatibility as the kernels are specifically tuned for H100/H200 architectures.

References:

https://www.marktechpost.com/2026/04/29/qwen-team-releases-flashqla-a-high-performance-linear-attention-kernel-library-that-achieves-up-to-3x-speedup-on-nvidia-hopper-gpus/

On This Page

Qwen Team Releases FlashQLA: a High-Performance Linear Attention Kernel Library That Achieves Up to 3× Speedup on NVIDIA Hopper GPUs

Why This Matters

Key Insights

Practical Applications

Continue reading

Related Content

Five AI Compute Architectures Every Engineer Should Know: CPUs, GPUs, TPUs, NPUs, and LPUs Compared

Moonshot AI Releases FlashKDA: 2.22x Faster Prefill for Kimi Delta Attention

Zyphra ZAYA1-8B-Diffusion: Achieving 7.7x Speedup via Autoregressive to MoE Diffusion Conversion