Adapting Rotary Position Embeddings (RoPE) for Long Context Lengths
These articles are AI-generated summaries. Please check the original sources for full details.
RoPE for Long Context Length
Rotary Position Embeddings (RoPE) is a popular technique for encoding token positions in sequence models. While effective for standard context lengths, adapting RoPE for models exceeding 8K tokens requires modification to maintain performance. Llama 3, for example, achieves a context length of 131K tokens by scaling RoPE frequencies.
Traditional position embeddings struggle with long sequences, often leading to performance degradation or increased computational cost. RoPE’s reliance on relative positioning is advantageous, but naive extrapolation to very long sequences can still introduce instability and diminish the importance of local context. Scaling the RoPE frequencies addresses this by prioritizing short-range dependencies while enabling effective long-range modeling.
Key Insights
- RoPE Formula: RoPE uses rotation matrices to encode position, defined by the equation: $X_{n,i} = X_{n,i} \cos(n\theta_i) – X_{n,\frac{d}{2}+i} \sin(n\theta_i)$.
- Frequency Scaling: Models like Llama 3 adjust RoPE frequencies based on a base length (8192) to improve stability for extended contexts.
- Llama 3 Implementation: Llama 3 employs a scaling factor of 8 and smooth interpolation to modify RoPE frequencies, balancing short and long-range dependencies.
Working Example
import torch
import torch.nn as nn
import math
def rotate_half(x: torch.Tensor) -> torch.Tensor:
"""Rotates half the hidden dims of the input."""
x1, x2 = x.chunk(2, dim=-1)
return torch.cat((-x2, x1), dim=-1)
class RotaryPositionEncoding(nn.Module):
"""Rotary position encoding."""
def __init__(self, dim: int, max_position_embeddings: int, base_length: int = 8192):
super().__init__()
self.dim = dim
self.max_position_embeddings = max_position_embeddings
N = 10_000.0
scale_factor = 8.0
low_factor, high_factor = 1.0, 4.0
inv_freq = 1.0 / (N ** (torch.arange(0, dim, 2).float().to("cuda") / dim))
wavelen = 2 * math.pi / inv_freq
max_wavelen = base_length / low_factor
min_wavelen = base_length / high_factor
smooth_factor = (base_length / wavelen - low_factor) / (high_factor - low_factor)
smoothed = (1 - smooth_factor) * inv_freq / scale_factor + smooth_factor * inv_freq
inv_freq = torch.where(wavelen > max_wavelen, inv_freq / scale_factor, torch.where(wavelen < min_wavelen, inv_freq, smoothed))
inv_freq = torch.cat((inv_freq, inv_freq), dim=-1)
position = torch.arange(max_position_embeddings).float()
sinusoid_inp = torch.outer(position, inv_freq)
self.register_buffer("cos", sinusoid_inp.cos())
self.register_buffer("sin", sinusoid_inp.sin())
def forward(self, x: torch.Tensor) -> torch.Tensor:
batch_size, seq_len, num_heads, head_dim = x.shape
dtype = x.dtype
cos = self.cos.to(dtype)[:seq_len].view(1, seq_len, 1, -1)
sin = self.sin.to(dtype)[:seq_len].view(1, seq_len, 1, -1)
output = (x * cos) + (rotate_half(x) * sin)
return output
Practical Applications
- Large Language Models: Llama 3 utilizes scaled RoPE to process extremely long documents and conversations.
- Pitfall: Using standard RoPE for very long sequences can lead to a loss of positional information, especially for tokens near the beginning of the sequence, impacting performance.
References:
Continue reading
Next article
Terraform Day 12: Validation, Numeric, Time & File Functions – Writing Safer IaC
Related Content
DeepSeek-V3: Scaling 671B MoE Models with FP8 Precision and R1 Distillation
DeepSeek-V3 achieves GPT-4o level performance with a 671B parameter MoE architecture activating only 37B parameters per token.
Introducing AnyLanguageModel: One API for Local and Remote LLMs on Apple Platforms
AnyLanguageModel simplifies LLM integration for Apple developers, offering a single API to seamlessly switch between local and remote models.
Continuous batching from first principles
Continuous batching maximizes LLM throughput by intelligently combining prefill and decode phases, achieving up to a 2x speedup in token generation.