Architecture · advanced
Multi-Head Attention
Multi-head attention runs several attention operations in parallel, each with its own learned projection, then concatenates the results. This lets the model attend to different kinds of relationships at once.
Explanation
One attention head might learn to track subject-verb agreement; another might track coreference; another might track topical similarity. Multi-head attention bundles 8-96 of these together so the model can capture diverse linguistic and semantic patterns in a single layer.
The total compute is similar to a single large attention block, because each head uses a smaller dimension. The "heads" are not pre-specified — each one learns what it's useful for during training.
Examples
- GPT-4 reportedly uses ~96 heads per attention layer.
- Different heads tend to specialize: some on syntax, some on facts, some on copying.
Frequently asked
What is Multi-Head Attention?
Multi-head attention runs several attention operations in parallel, each with its own learned projection, then concatenates the results. This lets the model attend to different kinds of relationships at once.
What is an example of multi-head attention?
GPT-4 reportedly uses ~96 heads per attention layer.
How is Multi-Head Attention related to Attention?
Multi-Head Attention and Attention are both architecture concepts. Attention is the mechanism a transformer uses to decide which earlier tokens matter most when producing each new one. It mixes information across positions by weighted sum.
Is Multi-Head Attention considered advanced?
Multi-Head Attention is generally considered advanced-level material in the AI and LLM space.