Architecture · advanced
KV Cache (key-value cache)
The KV cache stores the key and value vectors of all earlier tokens during generation so the model does not recompute them at every step. It is the main memory cost of LLM inference.
Explanation
When generating the 1,000th token, naively the model would re-run attention on all 999 previous tokens. The KV cache keeps the per-layer K and V tensors from earlier steps so only the new token needs new K/V computation. This makes generation cost linear per token rather than quadratic.
The cost is memory: each cached token holds K and V tensors per attention head per layer. For a 70B model with a long context, the KV cache can easily exceed 50 GB. Techniques like Grouped-Query Attention, KV cache quantization, and paged attention (vLLM) attack this cost directly.
Examples
- Generating a 4K-token response: the KV cache fills up to 4K entries per layer.
- vLLM uses PagedAttention to manage KV cache like virtual memory.
Frequently asked
What is KV Cache?
The KV cache stores the key and value vectors of all earlier tokens during generation so the model does not recompute them at every step. It is the main memory cost of LLM inference.
What is an example of kv cache?
Generating a 4K-token response: the KV cache fills up to 4K entries per layer.
How is KV Cache related to Attention?
KV Cache and Attention are both architecture concepts. Attention is the mechanism a transformer uses to decide which earlier tokens matter most when producing each new one. It mixes information across positions by weighted sum.
Is KV Cache considered advanced?
KV Cache is generally considered advanced-level material in the AI and LLM space.