Optimizing LLM Throughput: How Paged Attention Achieves 98.5% Memory Utilization

Paged Attention in Large Language Models LLMs

Paged Attention introduces a virtual memory-inspired approach to manage the KV cache in Large Language Models. By breaking the cache into fixed-size pages of 16 tokens, it eliminates the need for contiguous memory reservation and drastically reduces fragmentation.

Why This Matters

In traditional LLM serving, GPU memory is the primary constraint rather than compute because systems pre-allocate contiguous blocks based on the maximum sequence length. This leads to a structural waste where a single request might reserve 1024 MB but only use 250 MB, resulting in approximately 75 GB of wasted memory across 100 concurrent users.

Key Insights

Naive systems utilize only 20-38% of allocated KV cache memory according to the original Paged Attention/vLLM paper.
KV cache costs for a GPT-style model with 32 layers and 128-dimensional heads total 512 KB per token in fp16.
The Copy-on-Write (CoW) mechanism allows N requests to share a single system prompt’s memory, saving 936 MB for 10 requests with a 200-token prefix.
Paged Attention maintains a block table to map logical page indices to physical page IDs, ensuring zero memory is touched before it is needed.

Working Examples

Setting up the architectural constants and calculating KV cache memory requirements per token.

import math
import random
import numpy as np
from collections import defaultdict

NUM_LAYERS = 32
NUM_HEADS = 32
HEAD_DIM = 128
BYTES_FP16 = 2
PAGE_SIZE = 16
MAX_SEQ_LEN = 2048
KV_BYTES_PER_TOKEN = 2 * NUM_LAYERS * NUM_HEADS * HEAD_DIM * BYTES_FP16
KV_MB_PER_TOKEN = KV_BYTES_PER_TOKEN / 1024 / 1024

Implementation of a Page Pool to simulate physical GPU memory management with reference counting and CoW support.

class PagePool:
    def __init__(self, total_pages):
        self.free = list(range(total_pages))
        self.total = total_pages
        self.ref_count = defaultdict(int)

    def allocate(self):
        if not self.free:
            raise MemoryError("OOM -- no free pages")
        pid = self.free.pop(0)
        self.ref_count[pid] = 1
        return pid

    def release(self, pid):
        self.ref_count[pid] -= 1
        if self.ref_count[pid] <= 0:
            self.free.append(pid)
            del self.ref_count[pid]

    def share(self, pid):
        self.ref_count[pid] += 1

    def cow_copy(self, pid):
        new_pid = self.allocate()
        self.release(pid)
        return new_pid

    @property
    def utilisation(self):
        return (self.total - len(self.free)) / self.total * 100

Practical Applications

vLLM serving systems use Paged Attention to increase throughput by 2-4x by fitting more concurrent requests into GPU memory. Pitfall: Naive contiguous allocation results in 75GB waste for 100 users.
Multi-user chatbot deployments use Copy-on-Write for shared system prompts to reduce redundant memory overhead. Pitfall: Duplicate storage of identical prefixes leads to premature OOM walls before the GPU is computationally saturated.

References:

https://www.marktechpost.com/2026/03/24/paged-attention-in-large-language-models-llms/

On This Page

Paged Attention in Large Language Models LLMs

Why This Matters

Key Insights

Working Examples

Practical Applications

Continue reading

Related Content

TriAttention: MIT and NVIDIA's 10.7x KV Cache Compression for LLM Reasoning

Sakana AI Launches Doc-to-LoRA and Text-to-LoRA for Instant LLM Adaptation

Yuan 3.0 Ultra: Optimizing Trillion-Parameter MoE Efficiency via LAEP