Architecture · advanced

Rotary Position Embedding (RoPE)

RoPE encodes token position by rotating the query and key vectors in attention by an angle proportional to their position. It generalizes well to longer sequences than the model was trained on.

Published May 29, 2026

Explanation

Instead of adding a position signal to the token embedding, RoPE multiplies the query and key vectors by a rotation matrix whose angle depends on position. The attention dot product then naturally depends on the relative distance between two tokens.

This relative-position behavior is the key to RoPE's success: a model trained on 8K-token contexts can be extrapolated (with tricks like YaRN or position interpolation) to 32K or 128K tokens without retraining from scratch.

Llama, Mistral, Qwen, and many other modern open models use RoPE.

Examples

Llama 3 uses RoPE with adjustable base frequency.
Extending Llama's context from 8K to 128K with YaRN scaling of RoPE.

Frequently asked

What is Rotary Position Embedding?

RoPE encodes token position by rotating the query and key vectors in attention by an angle proportional to their position. It generalizes well to longer sequences than the model was trained on.

What is an example of rotary position embedding?

Llama 3 uses RoPE with adjustable base frequency.

How is Rotary Position Embedding related to Positional Encoding?

Rotary Position Embedding and Positional Encoding are both architecture concepts. Positional encoding tells the transformer where each token sits in the sequence. Without it, "dog bites man" and "man bites dog" would look identical to the model.

Is Rotary Position Embedding considered advanced?

Rotary Position Embedding is generally considered advanced-level material in the AI and LLM space.

Positional EncodingArchitecture

Positional encoding tells the transformer where each token sits in the sequence. Without it, "dog bites man" and "man bites dog" would look identical to the model.

Context WindowInference

The context window is the maximum number of tokens an LLM can consider in a single call — prompt plus generated output combined.

TransformerArchitecture

The transformer is the neural network architecture behind virtually every modern large language model. It uses self-attention to model relationships between all positions in a sequence in parallel.

AttentionArchitecture

Attention is the mechanism a transformer uses to decide which earlier tokens matter most when producing each new one. It mixes information across positions by weighted sum.

Side-by-side comparisons

Sources

RoFormer (arXiv)