Architecture · advanced

Multi-Head Attention

Multi-head attention runs several attention operations in parallel, each with its own learned projection, then concatenates the results. This lets the model attend to different kinds of relationships at once.

Published May 29, 2026

Explanation

One attention head might learn to track subject-verb agreement; another might track coreference; another might track topical similarity. Multi-head attention bundles 8-96 of these together so the model can capture diverse linguistic and semantic patterns in a single layer.

The total compute is similar to a single large attention block, because each head uses a smaller dimension. The "heads" are not pre-specified — each one learns what it's useful for during training.

Examples

GPT-4 reportedly uses ~96 heads per attention layer.
Different heads tend to specialize: some on syntax, some on facts, some on copying.

Frequently asked

What is Multi-Head Attention?

What is an example of multi-head attention?

GPT-4 reportedly uses ~96 heads per attention layer.

How is Multi-Head Attention related to Attention?

Multi-Head Attention and Attention are both architecture concepts. Attention is the mechanism a transformer uses to decide which earlier tokens matter most when producing each new one. It mixes information across positions by weighted sum.

Is Multi-Head Attention considered advanced?

Multi-Head Attention is generally considered advanced-level material in the AI and LLM space.

AttentionArchitecture

Attention is the mechanism a transformer uses to decide which earlier tokens matter most when producing each new one. It mixes information across positions by weighted sum.

Self-AttentionArchitecture

Self-attention is attention applied within a single sequence: each token attends to every other token in the same input, including itself.

TransformerArchitecture

The transformer is the neural network architecture behind virtually every modern large language model. It uses self-attention to model relationships between all positions in a sequence in parallel.

Side-by-side comparisons

Sources

Attention Is All You Need (arXiv)