Architecture · advanced

Positional Encoding

Positional encoding tells the transformer where each token sits in the sequence. Without it, "dog bites man" and "man bites dog" would look identical to the model.

Published May 29, 2026

Explanation

Self-attention is permutation-invariant: shuffling the input tokens doesn't change the attention weights. Positional encodings break this symmetry by adding (or rotating) position-specific signals into each token's embedding.

The original transformer used fixed sinusoidal encodings. Modern LLMs typically use Rotary Position Embedding (RoPE) or ALiBi, both of which generalize better to longer contexts than the original sinusoidal scheme.

Examples

Adding a sine-wave pattern to each token by position.
Rotating query/key vectors by an angle proportional to position (RoPE).

Frequently asked

What is Positional Encoding?

Positional encoding tells the transformer where each token sits in the sequence. Without it, "dog bites man" and "man bites dog" would look identical to the model.

What is an example of positional encoding?

Adding a sine-wave pattern to each token by position.

How is Positional Encoding related to Rotary Position Embedding?

Positional Encoding and Rotary Position Embedding are both architecture concepts. RoPE encodes token position by rotating the query and key vectors in attention by an angle proportional to their position. It generalizes well to longer sequences than the model was trained on.

Is Positional Encoding considered advanced?

Positional Encoding is generally considered advanced-level material in the AI and LLM space.

Rotary Position EmbeddingArchitecture

RoPE encodes token position by rotating the query and key vectors in attention by an angle proportional to their position. It generalizes well to longer sequences than the model was trained on.

TransformerArchitecture

The transformer is the neural network architecture behind virtually every modern large language model. It uses self-attention to model relationships between all positions in a sequence in parallel.

AttentionArchitecture

Attention is the mechanism a transformer uses to decide which earlier tokens matter most when producing each new one. It mixes information across positions by weighted sum.

Context WindowInference

The context window is the maximum number of tokens an LLM can consider in a single call — prompt plus generated output combined.

Side-by-side comparisons

Sources

Attention Is All You Need (arXiv)