Inference is what happens when you actually run a trained model on new input. For LLMs that means generating tokens one at a time, with sampling and a KV cache.
Explanation
Training is the part where the model's weights change. Inference is the part where they do not — the weights are frozen and the model just predicts. Most of the AI economy now is inference: API calls, chat sessions, agents — every interaction is an inference call.
Inference is dominated by two costs: GPU compute (for the matrix multiplications) and GPU memory (for weights + KV cache). Innovations like FlashAttention, quantization, speculative decoding, and paged KV cache are all attacks on those costs.
Examples
A ChatGPT response: one inference call per turn.
A batch job summarizing 10K documents: 10K inference calls.
Frequently asked
What is Inference?
Inference is what happens when you actually run a trained model on new input. For LLMs that means generating tokens one at a time, with sampling and a KV cache.
What is an example of inference?
A ChatGPT response: one inference call per turn.
How is Inference related to KV Cache?
Inference and KV Cache are both inference concepts. The KV cache stores the key and value vectors of all earlier tokens during generation so the model does not recompute them at every step. It is the main memory cost of LLM inference.
Is Inference considered beginner?
Inference is generally considered beginner-level material in the AI and LLM space.