Comparison

Multi-Head Attention vs Self-Attention

Multi-Head Attention and Self-Attention are both common AI/LLM terms but cover different ideas. Here is a quick side-by-side.

When you would reach for Multi-Head Attention

Multi-Head Attention comes up when the question is fundamentally about architecture.

GPT-4 reportedly uses ~96 heads per attention layer.

When you would reach for Self-Attention

Self-Attention comes up when the question is fundamentally about architecture.

In a sentence about a pronoun, self-attention links "it" to its antecedent.

Frequently asked

What is the difference between Multi-Head Attention and Self-Attention?

Multi-Head Attention: Multi-head attention runs several attention operations in parallel, each with its own learned projection, then concatenates the results. This lets the model attend to different kinds of relationships at once. Self-Attention: Self-attention is attention applied within a single sequence: each token attends to every other token in the same input, including itself.

When should I use Multi-Head Attention vs Self-Attention?

Multi-Head Attention is the right concept when you are focused on architecture. Self-Attention applies when you are focused on architecture.

Are Multi-Head Attention and Self-Attention the same thing?

No. Multi-Head Attention is architecture; Self-Attention is architecture. They are related but address different parts of the AI stack.