Comparison
Attention vs Multi-Head Attention
Attention and Multi-Head Attention are both common AI/LLM terms but cover different ideas. Here is a quick side-by-side.
When you would reach for Attention
Attention comes up when the question is fundamentally about architecture.
Translating "the bank by the river": attention helps "bank" attend more to "river" than to "money".
When you would reach for Multi-Head Attention
Multi-Head Attention comes up when the question is fundamentally about architecture.
GPT-4 reportedly uses ~96 heads per attention layer.
Frequently asked
What is the difference between Attention and Multi-Head Attention?
Attention: Attention is the mechanism a transformer uses to decide which earlier tokens matter most when producing each new one. It mixes information across positions by weighted sum. Multi-Head Attention: Multi-head attention runs several attention operations in parallel, each with its own learned projection, then concatenates the results. This lets the model attend to different kinds of relationships at once.
When should I use Attention vs Multi-Head Attention?
Attention is the right concept when you are focused on architecture. Multi-Head Attention applies when you are focused on architecture.
Are Attention and Multi-Head Attention the same thing?
No. Attention is architecture; Multi-Head Attention is architecture. They are related but address different parts of the AI stack.