Skip to main content

On This Page

Optimizing LLM Throughput: How Paged Attention Achieves 98.5% Memory Utilization

3 min read
Share

These articles are AI-generated summaries. Please check the original sources for full details.

Paged Attention in Large Language Models LLMs

Paged Attention introduces a virtual memory-inspired approach to manage the KV cache in Large Language Models. By breaking the cache into fixed-size pages of 16 tokens, it eliminates the need for contiguous memory reservation and drastically reduces fragmentation.

Why This Matters

In traditional LLM serving, GPU memory is the primary constraint rather than compute because systems pre-allocate contiguous blocks based on the maximum sequence length. This leads to a structural waste where a single request might reserve 1024 MB but only use 250 MB, resulting in approximately 75 GB of wasted memory across 100 concurrent users.

Key Insights

  • Naive systems utilize only 20-38% of allocated KV cache memory according to the original Paged Attention/vLLM paper.
  • KV cache costs for a GPT-style model with 32 layers and 128-dimensional heads total 512 KB per token in fp16.
  • The Copy-on-Write (CoW) mechanism allows N requests to share a single system prompt’s memory, saving 936 MB for 10 requests with a 200-token prefix.
  • Paged Attention maintains a block table to map logical page indices to physical page IDs, ensuring zero memory is touched before it is needed.

Working Examples

Setting up the architectural constants and calculating KV cache memory requirements per token.

import math
import random
import numpy as np
from collections import defaultdict

NUM_LAYERS = 32
NUM_HEADS = 32
HEAD_DIM = 128
BYTES_FP16 = 2
PAGE_SIZE = 16
MAX_SEQ_LEN = 2048
KV_BYTES_PER_TOKEN = 2 * NUM_LAYERS * NUM_HEADS * HEAD_DIM * BYTES_FP16
KV_MB_PER_TOKEN = KV_BYTES_PER_TOKEN / 1024 / 1024

Implementation of a Page Pool to simulate physical GPU memory management with reference counting and CoW support.

class PagePool:
    def __init__(self, total_pages):
        self.free = list(range(total_pages))
        self.total = total_pages
        self.ref_count = defaultdict(int)

    def allocate(self):
        if not self.free:
            raise MemoryError("OOM -- no free pages")
        pid = self.free.pop(0)
        self.ref_count[pid] = 1
        return pid

    def release(self, pid):
        self.ref_count[pid] -= 1
        if self.ref_count[pid] <= 0:
            self.free.append(pid)
            del self.ref_count[pid]

    def share(self, pid):
        self.ref_count[pid] += 1

    def cow_copy(self, pid):
        new_pid = self.allocate()
        self.release(pid)
        return new_pid

    @property
    def utilisation(self):
        return (self.total - len(self.free)) / self.total * 100

Practical Applications

  • vLLM serving systems use Paged Attention to increase throughput by 2-4x by fitting more concurrent requests into GPU memory. Pitfall: Naive contiguous allocation results in 75GB waste for 100 users.
  • Multi-user chatbot deployments use Copy-on-Write for shared system prompts to reduce redundant memory overhead. Pitfall: Duplicate storage of identical prefixes leads to premature OOM walls before the GPU is computationally saturated.

References:

Continue reading

Next article

Decoding Narcissistic Text Patterns: Analyzing Digital Parent-Child Dynamics

Related Content