Skip to main content
ModelTerms

Evaluation · intermediate

Pairwise Comparison (A/B eval, side-by-side eval)

Pairwise comparison asks a judge — human or LLM — to pick the better of two responses to the same prompt. Aggregates to a win rate; the dominant method for comparing model or prompt versions.

Explanation

Asking "is this output good?" produces vague absolute scores that drift across raters. Asking "which of A or B is better?" produces stable preferences that aggregate into a clean win rate.

The standard setup: same prompt, two candidate responses, randomly shuffle which is "A" vs "B," ask a judge to pick the better one (or call it a tie). Average over many prompts, report A's win rate.

LLM-as-judge handles bulk pairwise comparisons cheaply but has biases — favoring longer responses, the first-shown option, or the same model family. Mitigate by position-randomizing, scoring both A vs B and B vs A, and using a stronger model than the candidates as judge.

Pairwise comparison is the engine behind Chatbot Arena, internal A/B prompt evals, and most DPO data collection.

Examples

  • Comparing prompt v3 vs prompt v4 on 200 fixed examples: GPT-4 judge picks v4 as better in 58% of cases (with 6% ties).
  • Chatbot Arena: anonymous users pick between two random models' responses to the same prompt; aggregates to ELO.

Frequently asked

What is Pairwise Comparison?

Pairwise comparison asks a judge — human or LLM — to pick the better of two responses to the same prompt. Aggregates to a win rate; the dominant method for comparing model or prompt versions.

What is an example of pairwise comparison?

Comparing prompt v3 vs prompt v4 on 200 fixed examples: GPT-4 judge picks v4 as better in 58% of cases (with 6% ties).

How is Pairwise Comparison related to Win Rate?

Pairwise Comparison and Win Rate are both evaluation concepts. Win rate is the share of pairwise comparisons one candidate wins against another. The standard scalar for "model A is better than model B" in modern LLM evaluation.

Is Pairwise Comparison considered intermediate?

Pairwise Comparison is generally considered intermediate-level material in the AI and LLM space.

Win RateEvaluation

Win rate is the share of pairwise comparisons one candidate wins against another. The standard scalar for "model A is better than model B" in modern LLM evaluation.

LLM-as-JudgeEvaluation

LLM-as-judge uses a strong LLM to score or compare outputs from other LLMs. It is how most production teams evaluate quality at scale when human review is too slow.

Chatbot ArenaEvaluation

Chatbot Arena is a public LLM evaluation platform where anonymous users submit prompts, see two random models' responses side-by-side, vote for the better one, and contribute to a global ELO leaderboard.

ELO RatingEvaluation

ELO is a rating system originally from chess that converts pairwise wins between players into a single skill number. Chatbot Arena uses it to rank LLMs from anonymous user votes.

Preference DataTraining

Preference data is collections of (chosen, rejected) response pairs over the same prompt. It is the fuel for DPO and reward-model training.

Offline EvaluationEvaluation

Offline evaluation runs a fixed dataset of inputs through a candidate model or prompt, scores each output, and reports aggregate quality — the standard way to compare changes before shipping.

Side-by-side comparisons

Sources