Evaluation · intermediate
ELO Rating (Elo)
ELO is a rating system originally from chess that converts pairwise wins between players into a single skill number. Chatbot Arena uses it to rank LLMs from anonymous user votes.
Explanation
Every pairwise comparison updates both participants' ratings: winners gain points, losers lose points, and the size of the swing depends on the rating gap (beating a higher-rated opponent moves you more). Aggregated over millions of votes, ELO gives a stable global ranking.
Chatbot Arena (lmsys / lmarena.ai) is the canonical LLM application: anonymous users submit a prompt, see two random models' responses, pick a winner, and the global leaderboard updates. ELO is robust to noisy individual judges and produces rankings that correlate well with internal evals.
Limitations: prompt distribution is what users submit, not what production looks like; voters favor verbose, confident, polite answers; older models drift in rating as the active model pool changes.
Examples
- Chatbot Arena: Claude Sonnet 4 with an ELO of ~1325; GPT-3.5 around ~1100.
- A two-model A/B test internally: collect 5000 pairwise votes from real users, compute internal ELO to rank prompt variants.
Frequently asked
What is ELO Rating?
ELO is a rating system originally from chess that converts pairwise wins between players into a single skill number. Chatbot Arena uses it to rank LLMs from anonymous user votes.
What is an example of elo rating?
Chatbot Arena: Claude Sonnet 4 with an ELO of ~1325; GPT-3.5 around ~1100.
How is ELO Rating related to Chatbot Arena?
ELO Rating and Chatbot Arena are both evaluation concepts. Chatbot Arena is a public LLM evaluation platform where anonymous users submit prompts, see two random models' responses side-by-side, vote for the better one, and contribute to a global ELO leaderboard.
Is ELO Rating considered intermediate?
ELO Rating is generally considered intermediate-level material in the AI and LLM space.