Evaluation · intermediate
Pairwise Comparison (A/B eval, side-by-side eval)
Pairwise comparison asks a judge — human or LLM — to pick the better of two responses to the same prompt. Aggregates to a win rate; the dominant method for comparing model or prompt versions.
Explanation
Asking "is this output good?" produces vague absolute scores that drift across raters. Asking "which of A or B is better?" produces stable preferences that aggregate into a clean win rate.
The standard setup: same prompt, two candidate responses, randomly shuffle which is "A" vs "B," ask a judge to pick the better one (or call it a tie). Average over many prompts, report A's win rate.
LLM-as-judge handles bulk pairwise comparisons cheaply but has biases — favoring longer responses, the first-shown option, or the same model family. Mitigate by position-randomizing, scoring both A vs B and B vs A, and using a stronger model than the candidates as judge.
Pairwise comparison is the engine behind Chatbot Arena, internal A/B prompt evals, and most DPO data collection.
Examples
- Comparing prompt v3 vs prompt v4 on 200 fixed examples: GPT-4 judge picks v4 as better in 58% of cases (with 6% ties).
- Chatbot Arena: anonymous users pick between two random models' responses to the same prompt; aggregates to ELO.
Frequently asked
What is Pairwise Comparison?
Pairwise comparison asks a judge — human or LLM — to pick the better of two responses to the same prompt. Aggregates to a win rate; the dominant method for comparing model or prompt versions.
What is an example of pairwise comparison?
Comparing prompt v3 vs prompt v4 on 200 fixed examples: GPT-4 judge picks v4 as better in 58% of cases (with 6% ties).
How is Pairwise Comparison related to Win Rate?
Pairwise Comparison and Win Rate are both evaluation concepts. Win rate is the share of pairwise comparisons one candidate wins against another. The standard scalar for "model A is better than model B" in modern LLM evaluation.
Is Pairwise Comparison considered intermediate?
Pairwise Comparison is generally considered intermediate-level material in the AI and LLM space.