Evaluation · intermediate

ELO Rating (Elo)

ELO is a rating system originally from chess that converts pairwise wins between players into a single skill number. Chatbot Arena uses it to rank LLMs from anonymous user votes.

Published May 31, 2026

Explanation

Every pairwise comparison updates both participants' ratings: winners gain points, losers lose points, and the size of the swing depends on the rating gap (beating a higher-rated opponent moves you more). Aggregated over millions of votes, ELO gives a stable global ranking.

Chatbot Arena (lmsys / lmarena.ai) is the canonical LLM application: anonymous users submit a prompt, see two random models' responses, pick a winner, and the global leaderboard updates. ELO is robust to noisy individual judges and produces rankings that correlate well with internal evals.

Limitations: prompt distribution is what users submit, not what production looks like; voters favor verbose, confident, polite answers; older models drift in rating as the active model pool changes.

Examples

Chatbot Arena: Claude Sonnet 4 with an ELO of ~1325; GPT-3.5 around ~1100.
A two-model A/B test internally: collect 5000 pairwise votes from real users, compute internal ELO to rank prompt variants.

Frequently asked

What is ELO Rating?

ELO is a rating system originally from chess that converts pairwise wins between players into a single skill number. Chatbot Arena uses it to rank LLMs from anonymous user votes.

What is an example of elo rating?

Chatbot Arena: Claude Sonnet 4 with an ELO of ~1325; GPT-3.5 around ~1100.

How is ELO Rating related to Chatbot Arena?

ELO Rating and Chatbot Arena are both evaluation concepts. Chatbot Arena is a public LLM evaluation platform where anonymous users submit prompts, see two random models' responses side-by-side, vote for the better one, and contribute to a global ELO leaderboard.

Is ELO Rating considered intermediate?

ELO Rating is generally considered intermediate-level material in the AI and LLM space.

Chatbot ArenaEvaluation

Chatbot Arena is a public LLM evaluation platform where anonymous users submit prompts, see two random models' responses side-by-side, vote for the better one, and contribute to a global ELO leaderboard.

Pairwise ComparisonEvaluation

Pairwise comparison asks a judge — human or LLM — to pick the better of two responses to the same prompt. Aggregates to a win rate; the dominant method for comparing model or prompt versions.

Win RateEvaluation

Win rate is the share of pairwise comparisons one candidate wins against another. The standard scalar for "model A is better than model B" in modern LLM evaluation.

LLM-as-JudgeEvaluation

LLM-as-judge uses a strong LLM to score or compare outputs from other LLMs. It is how most production teams evaluate quality at scale when human review is too slow.

BenchmarkEvaluation

A benchmark is a standardized test that scores models on a fixed task, letting you compare them on equal footing. MMLU, HumanEval, and HELM are common examples.

Side-by-side comparisons

Sources

Chatbot Arena methodology (arXiv)