Skip to main content
ModelTerms

Inference · beginner

Long-Context Model

A long-context model accepts very long inputs — 100K+ tokens, in some cases millions. Claude (200K), GPT-4o (128K), and Gemini 1.5 Pro (1M+) are current examples.

Explanation

Until 2022, 2K-8K tokens was standard. RoPE scaling, FlashAttention, KV-cache tricks, and sparse attention variants unlocked context windows that fit whole books, code repos, or hours of meeting transcripts.

Longer contexts let you skip retrieval ("just paste the docs"), do whole-codebase reasoning, and run agents with deep histories. Costs scale linearly: a 1M-token prompt costs 1M-token's worth of input pricing.

Quality degrades — the "lost in the middle" effect — and benchmarks like needle-in-haystack and RULER measure how reliably the model uses the full window.

Examples

  • Claude Sonnet: 200K-token context — about 500 pages.
  • Gemini 1.5 Pro: 1M-2M tokens, enough for a whole movie of frames.

When to use long-context model

When the inputs genuinely need to fit together and chunking + retrieval would lose context.

Frequently asked

What is Long-Context Model?

A long-context model accepts very long inputs — 100K+ tokens, in some cases millions. Claude (200K), GPT-4o (128K), and Gemini 1.5 Pro (1M+) are current examples.

What is an example of long-context model?

Claude Sonnet: 200K-token context — about 500 pages.

How is Long-Context Model related to Context Window?

Long-Context Model and Context Window are both inference concepts. The context window is the maximum number of tokens an LLM can consider in a single call — prompt plus generated output combined.

When should I use long-context model?

When the inputs genuinely need to fit together and chunking + retrieval would lose context.

Is Long-Context Model considered beginner?

Long-Context Model is generally considered beginner-level material in the AI and LLM space.

Context WindowInference

The context window is the maximum number of tokens an LLM can consider in a single call — prompt plus generated output combined.

Rotary Position EmbeddingArchitecture

RoPE encodes token position by rotating the query and key vectors in attention by an angle proportional to their position. It generalizes well to longer sequences than the model was trained on.

KV CacheArchitecture

The KV cache stores the key and value vectors of all earlier tokens during generation so the model does not recompute them at every step. It is the main memory cost of LLM inference.

Retrieval-Augmented GenerationAgents & Tools

RAG retrieves relevant documents from a corpus at query time and includes them in the prompt, letting an LLM answer with up-to-date, source-cited, private information without retraining.

FlashAttentionArchitecture

FlashAttention is an algorithm that computes exact attention faster and with much less memory by carefully tiling the computation to fit in GPU SRAM rather than going to HBM.

Side-by-side comparisons

Sources