Evaluation · intermediate

Perplexity

Perplexity measures how "surprised" a language model is by held-out text. Lower is better. It is the natural intrinsic eval for next-token prediction.

Published May 29, 2026

Explanation

Mathematically, perplexity is the exponential of the average cross-entropy loss per token. Concretely: if a model has perplexity 8 on some text, it is on average as confused as if it had to choose uniformly among 8 next tokens at each step.

Useful for comparing models on the same tokenizer and corpus during training. Cannot be directly compared across models with different tokenizers, and a low perplexity does not guarantee a useful chat model — it just means the model fits text statistics well.

Examples

Perplexity 12 on WikiText is much better than perplexity 30.
Watching perplexity drop during pretraining as the model converges.

Frequently asked

What is Perplexity?

Perplexity measures how "surprised" a language model is by held-out text. Lower is better. It is the natural intrinsic eval for next-token prediction.

What is an example of perplexity?

Perplexity 12 on WikiText is much better than perplexity 30.

How is Perplexity related to Benchmark?

Perplexity and Benchmark are both evaluation concepts. A benchmark is a standardized test that scores models on a fixed task, letting you compare them on equal footing. MMLU, HumanEval, and HELM are common examples.

Is Perplexity considered intermediate?

Perplexity is generally considered intermediate-level material in the AI and LLM space.

BenchmarkEvaluation

A benchmark is a standardized test that scores models on a fixed task, letting you compare them on equal footing. MMLU, HumanEval, and HELM are common examples.

Loss FunctionTraining

A loss function measures how wrong a model's predictions are. Training minimizes it. For LLMs the loss is the cross-entropy of predicted vs. actual next tokens.

PretrainingTraining

Pretraining is the initial training phase where an LLM learns to predict the next token on trillions of tokens of general text. It produces a base model that can be adapted later.

TokenizationInference

Tokenization is the process of splitting raw text into the discrete tokens an LLM consumes. Most modern LLMs use a learned byte-pair-encoding (BPE) tokenizer.

Side-by-side comparisons

Sources

Wikipedia — Perplexity