Skip to main content
ModelTerms

Architecture · intermediate

Attention (attention mechanism)

Attention is the mechanism a transformer uses to decide which earlier tokens matter most when producing each new one. It mixes information across positions by weighted sum.

Explanation

For each output position, attention computes three vectors per input token — query, key, and value — and uses the dot product of queries and keys to produce weights that say "how much should this output position care about this input position?" The output is the weighted sum of the value vectors.

Self-attention applies this within a single sequence; cross-attention applies it between two sequences (e.g., source language and target language in translation).

Attention scales quadratically with sequence length, which is why context windows used to be small and why innovations like FlashAttention and sparse attention exist.

Examples

  • Translating "the bank by the river": attention helps "bank" attend more to "river" than to "money".
  • Answering questions about a long document: attention surfaces the relevant passage.

Frequently asked

What is Attention?

Attention is the mechanism a transformer uses to decide which earlier tokens matter most when producing each new one. It mixes information across positions by weighted sum.

What is an example of attention?

Translating "the bank by the river": attention helps "bank" attend more to "river" than to "money".

How is Attention related to Self-Attention?

Attention and Self-Attention are both architecture concepts. Self-attention is attention applied within a single sequence: each token attends to every other token in the same input, including itself.

Is Attention considered intermediate?

Attention is generally considered intermediate-level material in the AI and LLM space.

Self-AttentionArchitecture

Self-attention is attention applied within a single sequence: each token attends to every other token in the same input, including itself.

Multi-Head AttentionArchitecture

Multi-head attention runs several attention operations in parallel, each with its own learned projection, then concatenates the results. This lets the model attend to different kinds of relationships at once.

TransformerArchitecture

The transformer is the neural network architecture behind virtually every modern large language model. It uses self-attention to model relationships between all positions in a sequence in parallel.

KV CacheArchitecture

The KV cache stores the key and value vectors of all earlier tokens during generation so the model does not recompute them at every step. It is the main memory cost of LLM inference.

FlashAttentionArchitecture

FlashAttention is an algorithm that computes exact attention faster and with much less memory by carefully tiling the computation to fit in GPU SRAM rather than going to HBM.

Side-by-side comparisons

Sources