DeepSeek-V4: 1M-Token Contexts via Compressed Sparse Attention and Hybrid Architecture

DeepSeek AI Releases DeepSeek-V4: Compressed Sparse Attention and Heavily Compressed Attention Enable One-Million-Token Contexts

DeepSeek-AI has launched the DeepSeek-V4 series, featuring a 1.6T parameter Mixture-of-Experts (MoE) model designed for million-token context windows. This architecture achieves a 90% reduction in KV cache size compared to DeepSeek-V3.2 during long-context inference.

Why This Matters

Standard Transformer attention scales quadratically, making million-token contexts computationally prohibitive for production environments due to KV cache memory bottlenecks. DeepSeek-V4 addresses this by replacing vanilla attention with a hybrid CSA/HCA mechanism and implementing manifold-constrained hyper-connections, shifting the focus from raw compute to efficient memory management and signal stability in trillion-parameter architectures.

Key Insights

Hybrid CSA and HCA attention reduces DeepSeek-V4-Pro’s KV cache to 10% and inference FLOPs to 27% of DeepSeek-V3.2 at the one-million-token scale.
Manifold-Constrained Hyper-Connections (mHC) use the Sinkhorn-Knopp algorithm to bound spectral norms at 1, preventing signal amplification during trillion-parameter training.
The Muon optimizer replaces AdamW for core parameters, using Newton-Schulz iterations to orthogonalize gradient updates for faster convergence.
FP4 Quantization-Aware Training (QAT) is applied directly to MoE expert weights to reduce memory traffic and sampling latency during RL rollout.
On-Policy Distillation (OPD) replaces traditional mixed RL by distilling a unified student model from over ten specialized domain teacher models.
DeepSeek-V4-Pro-Max achieves a 3206 Codeforces rating, outperforming GPT-5.4-xHigh (3168) and Gemini-3.1-Pro-High (3052).

Practical Applications

Software Engineering: Utilizing DeepSeek-V4-Pro-Max for repository-level debugging, achieving 80.6% on SWE-Verified. Pitfall: Using ‘Think Max’ mode for trivial code fixes increases latency without significant accuracy gains.
Long-Document Analysis: Processing million-token datasets with DeepSeek-V4-Flash to minimize infrastructure costs. Pitfall: Misconfiguring sliding window parameters (n_win) may cause loss of local dependency modeling in dense text.

References:

https://www.marktechpost.com/2026/04/24/deepseek-ai-releases-deepseek-v4-compressed-sparse-attention-and-heavily-compressed-attention-enable-one-million-token-contexts/

On This Page

DeepSeek AI Releases DeepSeek-V4: Compressed Sparse Attention and Heavily Compressed Attention Enable One-Million-Token Contexts

Why This Matters

Key Insights

Practical Applications

Continue reading

Related Content

TriAttention: MIT and NVIDIA's 10.7x KV Cache Compression for LLM Reasoning

Yuan 3.0 Ultra: Optimizing Trillion-Parameter MoE Efficiency via LAEP

Top 10 KV Cache Compression Techniques for LLM Inference