Inference · intermediate

Prompt Caching (prefix caching)

Prompt caching stores the KV-cache state of a long prefix (system prompt, large document, tool definitions) so subsequent calls that reuse it skip the prefill compute — cutting TTFT and cost by 50-90%.

Published May 31, 2026

Explanation

Every LLM call processes the full prompt through the model before the first output token. For long static prefixes — a 50K-token reference doc, a thousand-line system prompt, a large tool schema — this prefill is the bulk of the work.

Prompt caching marks a prefix as cacheable. On the first call, the provider computes and stores the KV cache for that prefix. On subsequent calls reusing the same prefix, the cache is loaded directly and only the new suffix has to be processed.

OpenAI, Anthropic, Google Gemini, and AWS Bedrock all support some flavor as of 2024-2025. Pricing typically: cache writes cost ~25-50% more than uncached input; cache hits cost ~10% of uncached input. TTFT drops similarly.

Examples

A long-context RAG app caches the system prompt + few-shot examples; per-call latency drops from 6s to 1.5s, cost drops ~80%.
An IDE coding assistant caches the open file as the prefix for many consecutive completions.

When to use prompt caching

Whenever a prefix is reused across calls and exceeds ~1K tokens. The break-even point is low; the upside is large.

Frequently asked

What is Prompt Caching?

What is an example of prompt caching?

A long-context RAG app caches the system prompt + few-shot examples; per-call latency drops from 6s to 1.5s, cost drops ~80%.

How is Prompt Caching related to KV Cache?

Prompt Caching and KV Cache are both inference concepts. The KV cache stores the key and value vectors of all earlier tokens during generation so the model does not recompute them at every step. It is the main memory cost of LLM inference.

When should I use prompt caching?

Whenever a prefix is reused across calls and exceeds ~1K tokens. The break-even point is low; the upside is large.

Is Prompt Caching considered intermediate?

Prompt Caching is generally considered intermediate-level material in the AI and LLM space.

KV CacheArchitecture

The KV cache stores the key and value vectors of all earlier tokens during generation so the model does not recompute them at every step. It is the main memory cost of LLM inference.

Time to First TokenInference

Time to first token (TTFT) is how long it takes from sending a request until the first response token arrives. The user-perceived latency metric for streaming chat.

Context WindowInference

The context window is the maximum number of tokens an LLM can consider in a single call — prompt plus generated output combined.

InferenceInference

Inference is what happens when you actually run a trained model on new input. For LLMs that means generating tokens one at a time, with sampling and a KV cache.

Long-Context ModelInference

A long-context model accepts very long inputs — 100K+ tokens, in some cases millions. Claude (200K), GPT-4o (128K), and Gemini 1.5 Pro (1M+) are current examples.

Prompt Caching (prefix caching)

Explanation

Examples

When to use prompt caching

Frequently asked

What is Prompt Caching?

What is an example of prompt caching?

How is Prompt Caching related to KV Cache?

When should I use prompt caching?

Is Prompt Caching considered intermediate?

Side-by-side comparisons

Sources

Explanation

Examples

When to use prompt caching

Frequently asked

What is Prompt Caching?

What is an example of prompt caching?

How is Prompt Caching related to KV Cache?

When should I use prompt caching?

Is Prompt Caching considered intermediate?

Related terms

Side-by-side comparisons

Sources