Skip to main content
ModelTerms

Inference · beginner

Inference

Inference is what happens when you actually run a trained model on new input. For LLMs that means generating tokens one at a time, with sampling and a KV cache.

Explanation

Training is the part where the model's weights change. Inference is the part where they do not — the weights are frozen and the model just predicts. Most of the AI economy now is inference: API calls, chat sessions, agents — every interaction is an inference call.

Inference is dominated by two costs: GPU compute (for the matrix multiplications) and GPU memory (for weights + KV cache). Innovations like FlashAttention, quantization, speculative decoding, and paged KV cache are all attacks on those costs.

Examples

  • A ChatGPT response: one inference call per turn.
  • A batch job summarizing 10K documents: 10K inference calls.

Frequently asked

What is Inference?

Inference is what happens when you actually run a trained model on new input. For LLMs that means generating tokens one at a time, with sampling and a KV cache.

What is an example of inference?

A ChatGPT response: one inference call per turn.

How is Inference related to KV Cache?

Inference and KV Cache are both inference concepts. The KV cache stores the key and value vectors of all earlier tokens during generation so the model does not recompute them at every step. It is the main memory cost of LLM inference.

Is Inference considered beginner?

Inference is generally considered beginner-level material in the AI and LLM space.

KV CacheArchitecture

The KV cache stores the key and value vectors of all earlier tokens during generation so the model does not recompute them at every step. It is the main memory cost of LLM inference.

FlashAttentionArchitecture

FlashAttention is an algorithm that computes exact attention faster and with much less memory by carefully tiling the computation to fit in GPU SRAM rather than going to HBM.

Speculative DecodingInference

Speculative decoding speeds up generation by having a small "draft" model propose several tokens, then verifying them in a single batched call to the big model.

vLLMInfrastructure

vLLM is an open-source high-throughput LLM serving engine. Its PagedAttention KV cache manager is the reason it dramatically outperforms naive serving setups.

QuantizationInfrastructure

Quantization reduces model weights from 16- or 32-bit floats to lower-precision types (INT8, INT4) so the model needs less memory and runs faster, usually with minor quality loss.

Reasoning ModelArchitecture

A reasoning model spends extra compute thinking step-by-step before answering. OpenAI o1/o3, DeepSeek R1, and Anthropic's extended thinking are reasoning models.

Test-Time ComputePrompting

Test-time compute is the extra reasoning, sampling, or search a model can do at inference time to improve quality — more thinking tokens, more candidate answers, or verifier-guided search.

Prompt CachingInference

Prompt caching stores the KV-cache state of a long prefix (system prompt, large document, tool definitions) so subsequent calls that reuse it skip the prefill compute — cutting TTFT and cost by 50-90%.

Side-by-side comparisons

Sources