Skip to main content
ModelTerms

Architecture · advanced

Sliding-Window Attention (SWA)

Sliding-window attention limits each token to attending only the most recent W tokens (e.g. 4K), making attention linear in sequence length. Mistral and Gemma use it.

Explanation

Vanilla self-attention compares every token to every other token, costing O(n²). Sliding-window attention caps the window to a fixed W — say, 4096 — so cost grows linearly.

The trade-off: tokens cannot directly attend to anything older than W. In practice models stack layers so the effective receptive field grows by W per layer, recovering most long-range capability while keeping cost low. Often combined with global attention on a few special tokens.

Mistral 7B uses SWA with W=4096. Gemma and many efficient open models use variants.

Examples

  • Mistral 7B: sliding window of 4096 over a 32K-trained context.
  • Gemma combining SWA layers with full-attention layers.

Frequently asked

What is Sliding-Window Attention?

Sliding-window attention limits each token to attending only the most recent W tokens (e.g. 4K), making attention linear in sequence length. Mistral and Gemma use it.

What is an example of sliding-window attention?

Mistral 7B: sliding window of 4096 over a 32K-trained context.

How is Sliding-Window Attention related to Attention?

Sliding-Window Attention and Attention are both architecture concepts. Attention is the mechanism a transformer uses to decide which earlier tokens matter most when producing each new one. It mixes information across positions by weighted sum.

Is Sliding-Window Attention considered advanced?

Sliding-Window Attention is generally considered advanced-level material in the AI and LLM space.

AttentionArchitecture

Attention is the mechanism a transformer uses to decide which earlier tokens matter most when producing each new one. It mixes information across positions by weighted sum.

FlashAttentionArchitecture

FlashAttention is an algorithm that computes exact attention faster and with much less memory by carefully tiling the computation to fit in GPU SRAM rather than going to HBM.

TransformerArchitecture

The transformer is the neural network architecture behind virtually every modern large language model. It uses self-attention to model relationships between all positions in a sequence in parallel.

Context WindowInference

The context window is the maximum number of tokens an LLM can consider in a single call — prompt plus generated output combined.

MambaArchitecture

Mamba is a state-space model architecture that replaces transformer attention with selective state updates. It scales linearly with sequence length and matches transformer quality on many tasks.

Side-by-side comparisons

Sources