Google Introduces TurboQuant: A New Compression Algorithm that Reduces LLM Key-Value Cache Memory by 6x and Delivers Up to 8x Speedup

Google Introduces TurboQuant: A New Compression Algorithm that Reduces LLM Key-Value Cache Memory by 6x and Delivers Up to 8x Speedup, All with Zero Accuracy Loss

Google Research has unveiled TurboQuant, a data-oblivious vector quantization framework designed to mitigate the memory wall in LLM inference. The system achieves a 4x compression ratio while maintaining 100% retrieval accuracy on the Needle-In-A-Haystack benchmark up to 104k tokens.

Why This Matters

Modern LLM scaling is increasingly constrained by the memory wall, where the communication overhead between High-Bandwidth Memory (HBM) and SRAM limits performance as Key-Value (KV) cache sizes grow. Traditional vector quantization methods often require extensive offline preprocessing and data-dependent codebook training, which are ill-suited for the dynamic, real-time requirements of long-context AI workloads. TurboQuant addresses this by providing a hardware-compatible, data-oblivious algorithm that achieves near-optimal distortion rates without dataset-specific tuning. This allows for massive memory savings and speedups in transformer attention mechanisms while maintaining the mathematical integrity of inner product estimations.

Key Insights

TurboQuant induces a concentrated Beta distribution on vector coordinates via random rotation, making high-dimensional coordinates nearly independent and identically distributed (i.i.d.).
The framework solves a continuous 1D k-means / Max-Lloyd scalar quantization problem per coordinate, minimizing MSE cost functions for specific bit-widths.
A two-stage ‘TURBOQUANTprod’ approach combines MSE-optimal quantization with a 1-bit Quantized Johnson-Lindenstrauss (QJL) transform to eliminate inner product bias in attention layers.
TurboQuant distortion is provably within a factor of 2.7 of the Shannon Lower Bound (SLB) across all bit-widths, reaching a 1.45 factor at b=1.
Indexing time for 1536-dimensional vectors is reduced from 239.75s using Product Quantization to 0.0013s with TurboQuant.
Under 4x compression, Llama-3.1-8B-Instruct maintained 100% retrieval accuracy on the Needle-In-A-Haystack benchmark up to 104k tokens.

Practical Applications

Use case: Long-context LLM deployment (e.g., Llama-3.1) where KV cache is compressed 5x to fit larger context windows in limited GPU memory. Pitfall: Using MSE-only quantizers which introduce a 2/π multiplicative bias in inner product estimates, degrading attention accuracy.
Use case: Vector database indexing where TurboQuant reduces indexing time to virtually zero (0.0021s for d=3072) compared to hundreds of seconds for k-means training. Pitfall: Relying on data-dependent PQ that requires re-training whenever the underlying data distribution shifts.

References:

https://www.marktechpost.com/2026/03/25/google-introduces-turboquant-a-new-compression-algorithm-that-reduces-llm-key-value-cache-memory-by-6x-and-delivers-up-to-8x-speedup-all-with-zero-accuracy-loss/

On This Page

Google Introduces TurboQuant: A New Compression Algorithm that Reduces LLM Key-Value Cache Memory by 6x and Delivers Up to 8x Speedup, All with Zero Accuracy Loss

Why This Matters

Key Insights

Practical Applications

Continue reading

Related Content

NVIDIA KVPress: Optimizing Long-Context LLM Inference with KV Cache Compression

TriAttention: MIT and NVIDIA's 10.7x KV Cache Compression for LLM Reasoning

Top 10 KV Cache Compression Techniques for LLM Inference