Architecture · intermediate

Self-Attention

Self-attention is attention applied within a single sequence: each token attends to every other token in the same input, including itself.

Published May 29, 2026

Explanation

Self-attention is what lets a transformer turn "the dog that chased the cat was tired" into a representation where "was tired" can be linked to "the dog" directly, no matter how far apart they are.

In decoder-only models (like GPT), self-attention is causal: each token can only attend to earlier tokens, not future ones. This preserves the next-token-prediction setup. Encoder models like BERT use bidirectional self-attention.

Almost all the parameters in a transformer live in attention and the feed-forward layers that come after each attention block.

Examples

In a sentence about a pronoun, self-attention links "it" to its antecedent.
In code, self-attention lets a function reference a variable defined many lines earlier.

Frequently asked

What is Self-Attention?

Self-attention is attention applied within a single sequence: each token attends to every other token in the same input, including itself.

What is an example of self-attention?

In a sentence about a pronoun, self-attention links "it" to its antecedent.

How is Self-Attention related to Attention?

Self-Attention and Attention are both architecture concepts. Attention is the mechanism a transformer uses to decide which earlier tokens matter most when producing each new one. It mixes information across positions by weighted sum.

Is Self-Attention considered intermediate?

Self-Attention is generally considered intermediate-level material in the AI and LLM space.

AttentionArchitecture

Attention is the mechanism a transformer uses to decide which earlier tokens matter most when producing each new one. It mixes information across positions by weighted sum.

Multi-Head AttentionArchitecture

Multi-head attention runs several attention operations in parallel, each with its own learned projection, then concatenates the results. This lets the model attend to different kinds of relationships at once.

TransformerArchitecture

The transformer is the neural network architecture behind virtually every modern large language model. It uses self-attention to model relationships between all positions in a sequence in parallel.

KV CacheArchitecture

The KV cache stores the key and value vectors of all earlier tokens during generation so the model does not recompute them at every step. It is the main memory cost of LLM inference.

Side-by-side comparisons

Sources

The Annotated Transformer