Evaluation · intermediate
Perplexity
Perplexity measures how "surprised" a language model is by held-out text. Lower is better. It is the natural intrinsic eval for next-token prediction.
Explanation
Mathematically, perplexity is the exponential of the average cross-entropy loss per token. Concretely: if a model has perplexity 8 on some text, it is on average as confused as if it had to choose uniformly among 8 next tokens at each step.
Useful for comparing models on the same tokenizer and corpus during training. Cannot be directly compared across models with different tokenizers, and a low perplexity does not guarantee a useful chat model — it just means the model fits text statistics well.
Examples
- Perplexity 12 on WikiText is much better than perplexity 30.
- Watching perplexity drop during pretraining as the model converges.
Frequently asked
What is Perplexity?
Perplexity measures how "surprised" a language model is by held-out text. Lower is better. It is the natural intrinsic eval for next-token prediction.
What is an example of perplexity?
Perplexity 12 on WikiText is much better than perplexity 30.
How is Perplexity related to Benchmark?
Perplexity and Benchmark are both evaluation concepts. A benchmark is a standardized test that scores models on a fixed task, letting you compare them on equal footing. MMLU, HumanEval, and HELM are common examples.
Is Perplexity considered intermediate?
Perplexity is generally considered intermediate-level material in the AI and LLM space.