Architecture · intermediate
Attention (attention mechanism)
Attention is the mechanism a transformer uses to decide which earlier tokens matter most when producing each new one. It mixes information across positions by weighted sum.
Explanation
For each output position, attention computes three vectors per input token — query, key, and value — and uses the dot product of queries and keys to produce weights that say "how much should this output position care about this input position?" The output is the weighted sum of the value vectors.
Self-attention applies this within a single sequence; cross-attention applies it between two sequences (e.g., source language and target language in translation).
Attention scales quadratically with sequence length, which is why context windows used to be small and why innovations like FlashAttention and sparse attention exist.
Examples
- Translating "the bank by the river": attention helps "bank" attend more to "river" than to "money".
- Answering questions about a long document: attention surfaces the relevant passage.
Frequently asked
What is Attention?
Attention is the mechanism a transformer uses to decide which earlier tokens matter most when producing each new one. It mixes information across positions by weighted sum.
What is an example of attention?
Translating "the bank by the river": attention helps "bank" attend more to "river" than to "money".
How is Attention related to Self-Attention?
Attention and Self-Attention are both architecture concepts. Self-attention is attention applied within a single sequence: each token attends to every other token in the same input, including itself.
Is Attention considered intermediate?
Attention is generally considered intermediate-level material in the AI and LLM space.