Skip to main content
ModelTerms

Evaluation · beginner

Win Rate

Win rate is the share of pairwise comparisons one candidate wins against another. The standard scalar for "model A is better than model B" in modern LLM evaluation.

Explanation

If you run a pairwise eval where prompt v4 is judged better than v3 in 116 of 200 cases (with 12 ties), v4's win rate vs v3 is 116/(200-12) = 61.7%. Above 50% means v4 wins on average; statistical significance depends on sample size (~150 cases is typically enough for a 5-point detectable effect).

Win rate dominates eval reporting because it's interpretable — anyone understands "wins 62% of the time." Modern prompt-versioning tools (LangSmith, Phoenix, Braintrust) compute it per slice automatically.

Cautions: win rate against a weak baseline is uninformative; aggregate win rate hides which slices regressed; LLM-judge biases (length, position, family) inflate or distort numbers.

Examples

  • Llama 3 70B Instruct vs GPT-3.5: ~60% win rate on AlpacaEval.
  • Internal prompt v4 vs v3 on the FAQ category: 73% win rate, but on the long-document category: 48%.

Frequently asked

What is Win Rate?

Win rate is the share of pairwise comparisons one candidate wins against another. The standard scalar for "model A is better than model B" in modern LLM evaluation.

What is an example of win rate?

Llama 3 70B Instruct vs GPT-3.5: ~60% win rate on AlpacaEval.

How is Win Rate related to Pairwise Comparison?

Win Rate and Pairwise Comparison are both evaluation concepts. Pairwise comparison asks a judge — human or LLM — to pick the better of two responses to the same prompt. Aggregates to a win rate; the dominant method for comparing model or prompt versions.

Is Win Rate considered beginner?

Win Rate is generally considered beginner-level material in the AI and LLM space.

Pairwise ComparisonEvaluation

Pairwise comparison asks a judge — human or LLM — to pick the better of two responses to the same prompt. Aggregates to a win rate; the dominant method for comparing model or prompt versions.

LLM-as-JudgeEvaluation

LLM-as-judge uses a strong LLM to score or compare outputs from other LLMs. It is how most production teams evaluate quality at scale when human review is too slow.

Chatbot ArenaEvaluation

Chatbot Arena is a public LLM evaluation platform where anonymous users submit prompts, see two random models' responses side-by-side, vote for the better one, and contribute to a global ELO leaderboard.

ELO RatingEvaluation

ELO is a rating system originally from chess that converts pairwise wins between players into a single skill number. Chatbot Arena uses it to rank LLMs from anonymous user votes.

Offline EvaluationEvaluation

Offline evaluation runs a fixed dataset of inputs through a candidate model or prompt, scores each output, and reports aggregate quality — the standard way to compare changes before shipping.

Side-by-side comparisons

Sources