Skip to main content

On This Page

DeepSeek-V4: 1M-Token Contexts via Compressed Sparse Attention and Hybrid Architecture

2 min read
Share

These articles are AI-generated summaries. Please check the original sources for full details.

DeepSeek AI Releases DeepSeek-V4: Compressed Sparse Attention and Heavily Compressed Attention Enable One-Million-Token Contexts

DeepSeek-AI has launched the DeepSeek-V4 series, featuring a 1.6T parameter Mixture-of-Experts (MoE) model designed for million-token context windows. This architecture achieves a 90% reduction in KV cache size compared to DeepSeek-V3.2 during long-context inference.

Why This Matters

Standard Transformer attention scales quadratically, making million-token contexts computationally prohibitive for production environments due to KV cache memory bottlenecks. DeepSeek-V4 addresses this by replacing vanilla attention with a hybrid CSA/HCA mechanism and implementing manifold-constrained hyper-connections, shifting the focus from raw compute to efficient memory management and signal stability in trillion-parameter architectures.

Key Insights

  • Hybrid CSA and HCA attention reduces DeepSeek-V4-Pro’s KV cache to 10% and inference FLOPs to 27% of DeepSeek-V3.2 at the one-million-token scale.
  • Manifold-Constrained Hyper-Connections (mHC) use the Sinkhorn-Knopp algorithm to bound spectral norms at 1, preventing signal amplification during trillion-parameter training.
  • The Muon optimizer replaces AdamW for core parameters, using Newton-Schulz iterations to orthogonalize gradient updates for faster convergence.
  • FP4 Quantization-Aware Training (QAT) is applied directly to MoE expert weights to reduce memory traffic and sampling latency during RL rollout.
  • On-Policy Distillation (OPD) replaces traditional mixed RL by distilling a unified student model from over ten specialized domain teacher models.
  • DeepSeek-V4-Pro-Max achieves a 3206 Codeforces rating, outperforming GPT-5.4-xHigh (3168) and Gemini-3.1-Pro-High (3052).

Practical Applications

  • Software Engineering: Utilizing DeepSeek-V4-Pro-Max for repository-level debugging, achieving 80.6% on SWE-Verified. Pitfall: Using ‘Think Max’ mode for trivial code fixes increases latency without significant accuracy gains.
  • Long-Document Analysis: Processing million-token datasets with DeepSeek-V4-Flash to minimize infrastructure costs. Pitfall: Misconfiguring sliding window parameters (n_win) may cause loss of local dependency modeling in dense text.

References:

Continue reading

Next article

Local Browser-Based AI: Running Neural Networks for Audio Stem Separation

Related Content