Architecture · advanced
Positional Encoding
Positional encoding tells the transformer where each token sits in the sequence. Without it, "dog bites man" and "man bites dog" would look identical to the model.
Explanation
Self-attention is permutation-invariant: shuffling the input tokens doesn't change the attention weights. Positional encodings break this symmetry by adding (or rotating) position-specific signals into each token's embedding.
The original transformer used fixed sinusoidal encodings. Modern LLMs typically use Rotary Position Embedding (RoPE) or ALiBi, both of which generalize better to longer contexts than the original sinusoidal scheme.
Examples
- Adding a sine-wave pattern to each token by position.
- Rotating query/key vectors by an angle proportional to position (RoPE).
Frequently asked
What is Positional Encoding?
Positional encoding tells the transformer where each token sits in the sequence. Without it, "dog bites man" and "man bites dog" would look identical to the model.
What is an example of positional encoding?
Adding a sine-wave pattern to each token by position.
How is Positional Encoding related to Rotary Position Embedding?
Positional Encoding and Rotary Position Embedding are both architecture concepts. RoPE encodes token position by rotating the query and key vectors in attention by an angle proportional to their position. It generalizes well to longer sequences than the model was trained on.
Is Positional Encoding considered advanced?
Positional Encoding is generally considered advanced-level material in the AI and LLM space.