Adapting Rotary Position Embeddings (RoPE) for Long Context Lengths

RoPE for Long Context Length

Rotary Position Embeddings (RoPE) is a popular technique for encoding token positions in sequence models. While effective for standard context lengths, adapting RoPE for models exceeding 8K tokens requires modification to maintain performance. Llama 3, for example, achieves a context length of 131K tokens by scaling RoPE frequencies.

Traditional position embeddings struggle with long sequences, often leading to performance degradation or increased computational cost. RoPE’s reliance on relative positioning is advantageous, but naive extrapolation to very long sequences can still introduce instability and diminish the importance of local context. Scaling the RoPE frequencies addresses this by prioritizing short-range dependencies while enabling effective long-range modeling.

Key Insights

RoPE Formula: RoPE uses rotation matrices to encode position, defined by the equation: $X_{n,i} = X_{n,i} \cos(n\theta_i) – X_{n,\frac{d}{2}+i} \sin(n\theta_i)$.
Frequency Scaling: Models like Llama 3 adjust RoPE frequencies based on a base length (8192) to improve stability for extended contexts.
Llama 3 Implementation: Llama 3 employs a scaling factor of 8 and smooth interpolation to modify RoPE frequencies, balancing short and long-range dependencies.

Working Example

import torch
import torch.nn as nn
import math

def rotate_half(x: torch.Tensor) -> torch.Tensor:
    """Rotates half the hidden dims of the input."""
    x1, x2 = x.chunk(2, dim=-1)
    return torch.cat((-x2, x1), dim=-1)

class RotaryPositionEncoding(nn.Module):
    """Rotary position encoding."""
    def __init__(self, dim: int, max_position_embeddings: int, base_length: int = 8192):
        super().__init__()
        self.dim = dim
        self.max_position_embeddings = max_position_embeddings
        N = 10_000.0
        scale_factor = 8.0
        low_factor, high_factor = 1.0, 4.0

        inv_freq = 1.0 / (N ** (torch.arange(0, dim, 2).float().to("cuda") / dim))
        wavelen = 2 * math.pi / inv_freq
        max_wavelen = base_length / low_factor
        min_wavelen = base_length / high_factor
        smooth_factor = (base_length / wavelen - low_factor) / (high_factor - low_factor)
        smoothed = (1 - smooth_factor) * inv_freq / scale_factor + smooth_factor * inv_freq
        inv_freq = torch.where(wavelen > max_wavelen, inv_freq / scale_factor, torch.where(wavelen < min_wavelen, inv_freq, smoothed))
        inv_freq = torch.cat((inv_freq, inv_freq), dim=-1)

        position = torch.arange(max_position_embeddings).float()
        sinusoid_inp = torch.outer(position, inv_freq)
        self.register_buffer("cos", sinusoid_inp.cos())
        self.register_buffer("sin", sinusoid_inp.sin())

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        batch_size, seq_len, num_heads, head_dim = x.shape
        dtype = x.dtype
        cos = self.cos.to(dtype)[:seq_len].view(1, seq_len, 1, -1)
        sin = self.sin.to(dtype)[:seq_len].view(1, seq_len, 1, -1)
        output = (x * cos) + (rotate_half(x) * sin)
        return output

Practical Applications

Large Language Models: Llama 3 utilizes scaled RoPE to process extremely long documents and conversations.
Pitfall: Using standard RoPE for very long sequences can lead to a loss of positional information, especially for tokens near the beginning of the sequence, impacting performance.

References:

On This Page

RoPE for Long Context Length

Key Insights

Working Example

Practical Applications

Continue reading

Related Content

Bleeding Llama CVE-2026-7482: Why Local LLMs Like Ollama Are Not Inherently Private

Introducing AnyLanguageModel: One API for Local and Remote LLMs on Apple Platforms

Continuous batching from first principles