Skip to main content
ModelTerms

Architecture · advanced

Rotary Position Embedding (RoPE)

RoPE encodes token position by rotating the query and key vectors in attention by an angle proportional to their position. It generalizes well to longer sequences than the model was trained on.

Explanation

Instead of adding a position signal to the token embedding, RoPE multiplies the query and key vectors by a rotation matrix whose angle depends on position. The attention dot product then naturally depends on the relative distance between two tokens.

This relative-position behavior is the key to RoPE's success: a model trained on 8K-token contexts can be extrapolated (with tricks like YaRN or position interpolation) to 32K or 128K tokens without retraining from scratch.

Llama, Mistral, Qwen, and many other modern open models use RoPE.

Examples

  • Llama 3 uses RoPE with adjustable base frequency.
  • Extending Llama's context from 8K to 128K with YaRN scaling of RoPE.

Frequently asked

What is Rotary Position Embedding?

RoPE encodes token position by rotating the query and key vectors in attention by an angle proportional to their position. It generalizes well to longer sequences than the model was trained on.

What is an example of rotary position embedding?

Llama 3 uses RoPE with adjustable base frequency.

How is Rotary Position Embedding related to Positional Encoding?

Rotary Position Embedding and Positional Encoding are both architecture concepts. Positional encoding tells the transformer where each token sits in the sequence. Without it, "dog bites man" and "man bites dog" would look identical to the model.

Is Rotary Position Embedding considered advanced?

Rotary Position Embedding is generally considered advanced-level material in the AI and LLM space.

Side-by-side comparisons

Sources