Architecture · intermediate
Self-Attention
Self-attention is attention applied within a single sequence: each token attends to every other token in the same input, including itself.
Explanation
Self-attention is what lets a transformer turn "the dog that chased the cat was tired" into a representation where "was tired" can be linked to "the dog" directly, no matter how far apart they are.
In decoder-only models (like GPT), self-attention is causal: each token can only attend to earlier tokens, not future ones. This preserves the next-token-prediction setup. Encoder models like BERT use bidirectional self-attention.
Almost all the parameters in a transformer live in attention and the feed-forward layers that come after each attention block.
Examples
- In a sentence about a pronoun, self-attention links "it" to its antecedent.
- In code, self-attention lets a function reference a variable defined many lines earlier.
Frequently asked
What is Self-Attention?
Self-attention is attention applied within a single sequence: each token attends to every other token in the same input, including itself.
What is an example of self-attention?
In a sentence about a pronoun, self-attention links "it" to its antecedent.
How is Self-Attention related to Attention?
Self-Attention and Attention are both architecture concepts. Attention is the mechanism a transformer uses to decide which earlier tokens matter most when producing each new one. It mixes information across positions by weighted sum.
Is Self-Attention considered intermediate?
Self-Attention is generally considered intermediate-level material in the AI and LLM space.