Architecture · advanced
Sliding-Window Attention (SWA)
Sliding-window attention limits each token to attending only the most recent W tokens (e.g. 4K), making attention linear in sequence length. Mistral and Gemma use it.
Explanation
Vanilla self-attention compares every token to every other token, costing O(n²). Sliding-window attention caps the window to a fixed W — say, 4096 — so cost grows linearly.
The trade-off: tokens cannot directly attend to anything older than W. In practice models stack layers so the effective receptive field grows by W per layer, recovering most long-range capability while keeping cost low. Often combined with global attention on a few special tokens.
Mistral 7B uses SWA with W=4096. Gemma and many efficient open models use variants.
Examples
- Mistral 7B: sliding window of 4096 over a 32K-trained context.
- Gemma combining SWA layers with full-attention layers.
Frequently asked
What is Sliding-Window Attention?
Sliding-window attention limits each token to attending only the most recent W tokens (e.g. 4K), making attention linear in sequence length. Mistral and Gemma use it.
What is an example of sliding-window attention?
Mistral 7B: sliding window of 4096 over a 32K-trained context.
How is Sliding-Window Attention related to Attention?
Sliding-Window Attention and Attention are both architecture concepts. Attention is the mechanism a transformer uses to decide which earlier tokens matter most when producing each new one. It mixes information across positions by weighted sum.
Is Sliding-Window Attention considered advanced?
Sliding-Window Attention is generally considered advanced-level material in the AI and LLM space.