Inference · intermediate
Prompt Caching (prefix caching)
Prompt caching stores the KV-cache state of a long prefix (system prompt, large document, tool definitions) so subsequent calls that reuse it skip the prefill compute — cutting TTFT and cost by 50-90%.
Explanation
Every LLM call processes the full prompt through the model before the first output token. For long static prefixes — a 50K-token reference doc, a thousand-line system prompt, a large tool schema — this prefill is the bulk of the work.
Prompt caching marks a prefix as cacheable. On the first call, the provider computes and stores the KV cache for that prefix. On subsequent calls reusing the same prefix, the cache is loaded directly and only the new suffix has to be processed.
OpenAI, Anthropic, Google Gemini, and AWS Bedrock all support some flavor as of 2024-2025. Pricing typically: cache writes cost ~25-50% more than uncached input; cache hits cost ~10% of uncached input. TTFT drops similarly.
Examples
- A long-context RAG app caches the system prompt + few-shot examples; per-call latency drops from 6s to 1.5s, cost drops ~80%.
- An IDE coding assistant caches the open file as the prefix for many consecutive completions.
When to use prompt caching
Whenever a prefix is reused across calls and exceeds ~1K tokens. The break-even point is low; the upside is large.
Frequently asked
What is Prompt Caching?
Prompt caching stores the KV-cache state of a long prefix (system prompt, large document, tool definitions) so subsequent calls that reuse it skip the prefill compute — cutting TTFT and cost by 50-90%.
What is an example of prompt caching?
A long-context RAG app caches the system prompt + few-shot examples; per-call latency drops from 6s to 1.5s, cost drops ~80%.
How is Prompt Caching related to KV Cache?
Prompt Caching and KV Cache are both inference concepts. The KV cache stores the key and value vectors of all earlier tokens during generation so the model does not recompute them at every step. It is the main memory cost of LLM inference.
When should I use prompt caching?
Whenever a prefix is reused across calls and exceeds ~1K tokens. The break-even point is low; the upside is large.
Is Prompt Caching considered intermediate?
Prompt Caching is generally considered intermediate-level material in the AI and LLM space.