Skip to main content
ModelTerms

Learning path · 22 min · intermediate

The attention family

Self-attention to FlashAttention — how the transformer's core operation evolved.

Attention is the operation that made transformers work. This path takes you through the basic mechanism, the multi-head twist, the memory cost (KV cache), and the engineering breakthroughs (FlashAttention, sliding-window) that made long-context models possible.

  1. Attentionattention mechanism

    Why this step: The base operation: weighted mixing of token representations.

    Attention is the mechanism a transformer uses to decide which earlier tokens matter most when producing each new one. It mixes information across positions by weighted sum.

    Read full entry →Architecture · intermediate
  2. Self-Attention

    Why this step: Attention applied within a single sequence. The transformer's core trick.

    Self-attention is attention applied within a single sequence: each token attends to every other token in the same input, including itself.

    Read full entry →Architecture · intermediate
  3. Multi-Head Attention

    Why this step: Many attention operations in parallel, each learning different linguistic relationships.

    Multi-head attention runs several attention operations in parallel, each with its own learned projection, then concatenates the results. This lets the model attend to different kinds of relationships at once.

    Read full entry →Architecture · advanced
  4. KV Cachekey-value cache

    Why this step: The dominant memory cost during generation. Knowing this is knowing inference.

    The KV cache stores the key and value vectors of all earlier tokens during generation so the model does not recompute them at every step. It is the main memory cost of LLM inference.

    Read full entry →Architecture · advanced
  5. FlashAttention

    Why this step: The breakthrough algorithm that made attention fast and memory-cheap.

    FlashAttention is an algorithm that computes exact attention faster and with much less memory by carefully tiling the computation to fit in GPU SRAM rather than going to HBM.

    Read full entry →Architecture · advanced
  6. Sliding-Window AttentionSWA

    Why this step: How modern models trade global attention for linear-cost scaling.

    Sliding-window attention limits each token to attending only the most recent W tokens (e.g. 4K), making attention linear in sequence length. Mistral and Gemma use it.

    Read full entry →Architecture · advanced
  7. Rotary Position EmbeddingRoPE

    Why this step: How attention learns to care about token position — and why this matters for long context.

    RoPE encodes token position by rotating the query and key vectors in attention by an angle proportional to their position. It generalizes well to longer sequences than the model was trained on.

    Read full entry →Architecture · advanced

You finished the path.

Now stress-test what you remember.

Take the mixed quiz →Pick another path