Optimizing LLM Throughput: How Paged Attention Achieves 98.5% Memory Utilization
These articles are AI-generated summaries. Please check the original sources for full details.
Paged Attention in Large Language Models LLMs
Paged Attention introduces a virtual memory-inspired approach to manage the KV cache in Large Language Models. By breaking the cache into fixed-size pages of 16 tokens, it eliminates the need for contiguous memory reservation and drastically reduces fragmentation.
Why This Matters
In traditional LLM serving, GPU memory is the primary constraint rather than compute because systems pre-allocate contiguous blocks based on the maximum sequence length. This leads to a structural waste where a single request might reserve 1024 MB but only use 250 MB, resulting in approximately 75 GB of wasted memory across 100 concurrent users.
Key Insights
- Naive systems utilize only 20-38% of allocated KV cache memory according to the original Paged Attention/vLLM paper.
- KV cache costs for a GPT-style model with 32 layers and 128-dimensional heads total 512 KB per token in fp16.
- The Copy-on-Write (CoW) mechanism allows N requests to share a single system prompt’s memory, saving 936 MB for 10 requests with a 200-token prefix.
- Paged Attention maintains a block table to map logical page indices to physical page IDs, ensuring zero memory is touched before it is needed.
Working Examples
Setting up the architectural constants and calculating KV cache memory requirements per token.
import math
import random
import numpy as np
from collections import defaultdict
NUM_LAYERS = 32
NUM_HEADS = 32
HEAD_DIM = 128
BYTES_FP16 = 2
PAGE_SIZE = 16
MAX_SEQ_LEN = 2048
KV_BYTES_PER_TOKEN = 2 * NUM_LAYERS * NUM_HEADS * HEAD_DIM * BYTES_FP16
KV_MB_PER_TOKEN = KV_BYTES_PER_TOKEN / 1024 / 1024
Implementation of a Page Pool to simulate physical GPU memory management with reference counting and CoW support.
class PagePool:
def __init__(self, total_pages):
self.free = list(range(total_pages))
self.total = total_pages
self.ref_count = defaultdict(int)
def allocate(self):
if not self.free:
raise MemoryError("OOM -- no free pages")
pid = self.free.pop(0)
self.ref_count[pid] = 1
return pid
def release(self, pid):
self.ref_count[pid] -= 1
if self.ref_count[pid] <= 0:
self.free.append(pid)
del self.ref_count[pid]
def share(self, pid):
self.ref_count[pid] += 1
def cow_copy(self, pid):
new_pid = self.allocate()
self.release(pid)
return new_pid
@property
def utilisation(self):
return (self.total - len(self.free)) / self.total * 100
Practical Applications
- vLLM serving systems use Paged Attention to increase throughput by 2-4x by fitting more concurrent requests into GPU memory. Pitfall: Naive contiguous allocation results in 75GB waste for 100 users.
- Multi-user chatbot deployments use Copy-on-Write for shared system prompts to reduce redundant memory overhead. Pitfall: Duplicate storage of identical prefixes leads to premature OOM walls before the GPU is computationally saturated.
References:
Continue reading
Next article
Decoding Narcissistic Text Patterns: Analyzing Digital Parent-Child Dynamics
Related Content
TriAttention: MIT and NVIDIA's 10.7x KV Cache Compression for LLM Reasoning
TriAttention achieves 2.5x higher throughput and 10.7x KV memory reduction while matching full attention accuracy on the AIME25 benchmark.
Sakana AI Launches Doc-to-LoRA and Text-to-LoRA for Instant LLM Adaptation
Sakana AI introduces hypernetworks that reduce 128K-token document VRAM usage from 12GB to under 50MB via instant LoRA generation.
Yuan 3.0 Ultra: Optimizing Trillion-Parameter MoE Efficiency via LAEP
YuanLab AI releases Yuan 3.0 Ultra, a 1T-parameter MoE model that achieves a 49% boost in pre-training efficiency. By utilizing Layer-Adaptive Expert Pruning and a Reflection Inhibition Reward Mechanism, it reduces total parameters by 33.3% while maintaining state-of-the-art performance in multimodal retrieval and enterprise benchmarks.