Architecture · advanced
FlashAttention
FlashAttention is an algorithm that computes exact attention faster and with much less memory by carefully tiling the computation to fit in GPU SRAM rather than going to HBM.
Explanation
Standard attention materializes the full N-by-N attention matrix in memory, which dominates the cost for long sequences. FlashAttention avoids ever holding the full matrix: it processes small blocks of queries and keys at a time, accumulating the softmax incrementally in fast on-chip memory.
The result is identical to standard attention (no approximation), but 2-4x faster and dramatically lower memory. FlashAttention-2 and -3 further specialize the kernel for newer GPU architectures.
Most modern training and inference frameworks (PyTorch, vLLM, TGI, Llama.cpp) use FlashAttention by default.
Examples
- Training a 70B model on 8K context that would not fit with standard attention.
- Doubling tokens-per-second on H100 inference for free.
Frequently asked
What is FlashAttention?
FlashAttention is an algorithm that computes exact attention faster and with much less memory by carefully tiling the computation to fit in GPU SRAM rather than going to HBM.
What is an example of flashattention?
Training a 70B model on 8K context that would not fit with standard attention.
How is FlashAttention related to Attention?
FlashAttention and Attention are both architecture concepts. Attention is the mechanism a transformer uses to decide which earlier tokens matter most when producing each new one. It mixes information across positions by weighted sum.
Is FlashAttention considered advanced?
FlashAttention is generally considered advanced-level material in the AI and LLM space.