Skip to main content
ModelTerms

Architecture · advanced

KV Cache (key-value cache)

The KV cache stores the key and value vectors of all earlier tokens during generation so the model does not recompute them at every step. It is the main memory cost of LLM inference.

Explanation

When generating the 1,000th token, naively the model would re-run attention on all 999 previous tokens. The KV cache keeps the per-layer K and V tensors from earlier steps so only the new token needs new K/V computation. This makes generation cost linear per token rather than quadratic.

The cost is memory: each cached token holds K and V tensors per attention head per layer. For a 70B model with a long context, the KV cache can easily exceed 50 GB. Techniques like Grouped-Query Attention, KV cache quantization, and paged attention (vLLM) attack this cost directly.

Examples

  • Generating a 4K-token response: the KV cache fills up to 4K entries per layer.
  • vLLM uses PagedAttention to manage KV cache like virtual memory.

Frequently asked

What is KV Cache?

The KV cache stores the key and value vectors of all earlier tokens during generation so the model does not recompute them at every step. It is the main memory cost of LLM inference.

What is an example of kv cache?

Generating a 4K-token response: the KV cache fills up to 4K entries per layer.

How is KV Cache related to Attention?

KV Cache and Attention are both architecture concepts. Attention is the mechanism a transformer uses to decide which earlier tokens matter most when producing each new one. It mixes information across positions by weighted sum.

Is KV Cache considered advanced?

KV Cache is generally considered advanced-level material in the AI and LLM space.

AttentionArchitecture

Attention is the mechanism a transformer uses to decide which earlier tokens matter most when producing each new one. It mixes information across positions by weighted sum.

InferenceInference

Inference is what happens when you actually run a trained model on new input. For LLMs that means generating tokens one at a time, with sampling and a KV cache.

Context WindowInference

The context window is the maximum number of tokens an LLM can consider in a single call — prompt plus generated output combined.

vLLMInfrastructure

vLLM is an open-source high-throughput LLM serving engine. Its PagedAttention KV cache manager is the reason it dramatically outperforms naive serving setups.

FlashAttentionArchitecture

FlashAttention is an algorithm that computes exact attention faster and with much less memory by carefully tiling the computation to fit in GPU SRAM rather than going to HBM.

Prompt CachingInference

Prompt caching stores the KV-cache state of a long prefix (system prompt, large document, tool definitions) so subsequent calls that reuse it skip the prefill compute — cutting TTFT and cost by 50-90%.

Continuous BatchingInference

Continuous batching lets new requests join an in-flight batch on the next decode step rather than waiting for the current batch to finish, dramatically raising GPU utilization on variable-length workloads.

Side-by-side comparisons

Sources