Skip to main content
ModelTerms

Inference · beginner

Context Window (context length)

The context window is the maximum number of tokens an LLM can consider in a single call — prompt plus generated output combined.

Explanation

Early LLMs had context windows of 2K-4K tokens. Modern frontier models offer 128K (GPT-4o, Claude Sonnet) up to 1M+ (Gemini 1.5 Pro, Claude with extended context). Larger windows let you stuff in whole books, long codebases, or extensive conversation histories.

Two practical caveats: longer context costs more (linearly with input tokens) and degrades quality (the "lost in the middle" effect — models often pay less attention to information buried deep in long inputs).

Window extension techniques include longer RoPE scaling, FlashAttention, and retrieval-augmented setups that fetch only the relevant chunks.

Examples

  • GPT-4o: 128K context.
  • Claude Sonnet: 200K context.
  • Gemini 1.5 Pro: 1M+ context.

Frequently asked

What is Context Window?

The context window is the maximum number of tokens an LLM can consider in a single call — prompt plus generated output combined.

What is an example of context window?

GPT-4o: 128K context.

How is Context Window related to Token?

Context Window and Token are both inference concepts. A token is the basic unit an LLM reads and writes — usually a word piece (3-4 characters). LLMs are priced and sized by tokens, not words.

Is Context Window considered beginner?

Context Window is generally considered beginner-level material in the AI and LLM space.

TokenInference

A token is the basic unit an LLM reads and writes — usually a word piece (3-4 characters). LLMs are priced and sized by tokens, not words.

TokenizationInference

Tokenization is the process of splitting raw text into the discrete tokens an LLM consumes. Most modern LLMs use a learned byte-pair-encoding (BPE) tokenizer.

Retrieval-Augmented GenerationAgents & Tools

RAG retrieves relevant documents from a corpus at query time and includes them in the prompt, letting an LLM answer with up-to-date, source-cited, private information without retraining.

KV CacheArchitecture

The KV cache stores the key and value vectors of all earlier tokens during generation so the model does not recompute them at every step. It is the main memory cost of LLM inference.

Rotary Position EmbeddingArchitecture

RoPE encodes token position by rotating the query and key vectors in attention by an angle proportional to their position. It generalizes well to longer sequences than the model was trained on.

Long-Context ModelInference

A long-context model accepts very long inputs — 100K+ tokens, in some cases millions. Claude (200K), GPT-4o (128K), and Gemini 1.5 Pro (1M+) are current examples.

Side-by-side comparisons

Sources