Skip to main content
ModelTerms

Comparison

FlashAttention vs Inference

FlashAttention and Inference are both common AI/LLM terms but cover different ideas. Here is a quick side-by-side.

When you would reach for FlashAttention

FlashAttention comes up when the question is fundamentally about architecture.

Training a 70B model on 8K context that would not fit with standard attention.

When you would reach for Inference

Inference comes up when the question is fundamentally about inference.

A ChatGPT response: one inference call per turn.

Frequently asked

What is the difference between FlashAttention and Inference?

FlashAttention: FlashAttention is an algorithm that computes exact attention faster and with much less memory by carefully tiling the computation to fit in GPU SRAM rather than going to HBM. Inference: Inference is what happens when you actually run a trained model on new input. For LLMs that means generating tokens one at a time, with sampling and a KV cache.

When should I use FlashAttention vs Inference?

FlashAttention is the right concept when you are focused on architecture. Inference applies when you are focused on inference.

Are FlashAttention and Inference the same thing?

No. FlashAttention is architecture; Inference is inference. They are related but address different parts of the AI stack.