Evaluation · beginner
Win Rate
Win rate is the share of pairwise comparisons one candidate wins against another. The standard scalar for "model A is better than model B" in modern LLM evaluation.
Explanation
If you run a pairwise eval where prompt v4 is judged better than v3 in 116 of 200 cases (with 12 ties), v4's win rate vs v3 is 116/(200-12) = 61.7%. Above 50% means v4 wins on average; statistical significance depends on sample size (~150 cases is typically enough for a 5-point detectable effect).
Win rate dominates eval reporting because it's interpretable — anyone understands "wins 62% of the time." Modern prompt-versioning tools (LangSmith, Phoenix, Braintrust) compute it per slice automatically.
Cautions: win rate against a weak baseline is uninformative; aggregate win rate hides which slices regressed; LLM-judge biases (length, position, family) inflate or distort numbers.
Examples
- Llama 3 70B Instruct vs GPT-3.5: ~60% win rate on AlpacaEval.
- Internal prompt v4 vs v3 on the FAQ category: 73% win rate, but on the long-document category: 48%.
Frequently asked
What is Win Rate?
Win rate is the share of pairwise comparisons one candidate wins against another. The standard scalar for "model A is better than model B" in modern LLM evaluation.
What is an example of win rate?
Llama 3 70B Instruct vs GPT-3.5: ~60% win rate on AlpacaEval.
How is Win Rate related to Pairwise Comparison?
Win Rate and Pairwise Comparison are both evaluation concepts. Pairwise comparison asks a judge — human or LLM — to pick the better of two responses to the same prompt. Aggregates to a win rate; the dominant method for comparing model or prompt versions.
Is Win Rate considered beginner?
Win Rate is generally considered beginner-level material in the AI and LLM space.