Skip to main content
ModelTerms

Architecture · advanced

FlashAttention

FlashAttention is an algorithm that computes exact attention faster and with much less memory by carefully tiling the computation to fit in GPU SRAM rather than going to HBM.

Explanation

Standard attention materializes the full N-by-N attention matrix in memory, which dominates the cost for long sequences. FlashAttention avoids ever holding the full matrix: it processes small blocks of queries and keys at a time, accumulating the softmax incrementally in fast on-chip memory.

The result is identical to standard attention (no approximation), but 2-4x faster and dramatically lower memory. FlashAttention-2 and -3 further specialize the kernel for newer GPU architectures.

Most modern training and inference frameworks (PyTorch, vLLM, TGI, Llama.cpp) use FlashAttention by default.

Examples

  • Training a 70B model on 8K context that would not fit with standard attention.
  • Doubling tokens-per-second on H100 inference for free.

Frequently asked

What is FlashAttention?

FlashAttention is an algorithm that computes exact attention faster and with much less memory by carefully tiling the computation to fit in GPU SRAM rather than going to HBM.

What is an example of flashattention?

Training a 70B model on 8K context that would not fit with standard attention.

How is FlashAttention related to Attention?

FlashAttention and Attention are both architecture concepts. Attention is the mechanism a transformer uses to decide which earlier tokens matter most when producing each new one. It mixes information across positions by weighted sum.

Is FlashAttention considered advanced?

FlashAttention is generally considered advanced-level material in the AI and LLM space.

AttentionArchitecture

Attention is the mechanism a transformer uses to decide which earlier tokens matter most when producing each new one. It mixes information across positions by weighted sum.

KV CacheArchitecture

The KV cache stores the key and value vectors of all earlier tokens during generation so the model does not recompute them at every step. It is the main memory cost of LLM inference.

GPUInfrastructure

GPUs are the parallel processors that train and run nearly every modern AI model. Their throughput on matrix multiplication is what makes deep learning practical.

InferenceInference

Inference is what happens when you actually run a trained model on new input. For LLMs that means generating tokens one at a time, with sampling and a KV cache.

vLLMInfrastructure

vLLM is an open-source high-throughput LLM serving engine. Its PagedAttention KV cache manager is the reason it dramatically outperforms naive serving setups.

Sliding-Window AttentionArchitecture

Sliding-window attention limits each token to attending only the most recent W tokens (e.g. 4K), making attention linear in sequence length. Mistral and Gemma use it.

Continuous BatchingInference

Continuous batching lets new requests join an in-flight batch on the next decode step rather than waiting for the current batch to finish, dramatically raising GPU utilization on variable-length workloads.

Side-by-side comparisons

Sources