Evaluation · beginner

Regression Testing (LLMs) (LLM regression test)

LLM regression testing is the practice of running every prompt or model change against a fixed set of "must-pass" examples — bug repros, edge cases, known failure modes — to catch quality regressions.

Published May 31, 2026

Explanation

In classic software, you write a test for every bug you fix so it never recurs. In LLM apps the same discipline applies but with looser semantics: instead of exact assertions, each test row carries a pass criterion ("the answer cites the policy doc", "the JSON has these fields", "the response does not mention the competitor by name").

A regression suite typically includes: bug-repro traces from production, hand-curated edge cases, examples that broke in a previous release, and a few canonical good cases.

CI runs the whole suite on every PR; PRs that regress on must-pass cases are blocked; flaky tests get rewritten or dropped. The suite grows over time and becomes the codified memory of "things that have broken before."

Examples

A PR template requires: zero regressions on the 30 must-pass examples; overall win rate ≥ 50%; cost per call within 10% of baseline.
A new bug-fix PR adds one row to the regression set before the prompt change ships.

When to use regression testing (llms)

As soon as you have repeat bugs or "we fixed that already, didn't we?" moments.

Frequently asked

What is Regression Testing (LLMs)?

What is an example of regression testing (llms)?

A PR template requires: zero regressions on the 30 must-pass examples; overall win rate ≥ 50%; cost per call within 10% of baseline.

How is Regression Testing (LLMs) related to Eval-Driven Development?

Regression Testing (LLMs) and Eval-Driven Development are both evaluation concepts. Eval-driven development is the LLM analog of test-driven development: you write evals for behavior before changing the prompt or model, and every change is graded against the same eval suite.

When should I use regression testing (llms)?

As soon as you have repeat bugs or "we fixed that already, didn't we?" moments.

Is Regression Testing (LLMs) considered beginner?

Regression Testing (LLMs) is generally considered beginner-level material in the AI and LLM space.

Eval-Driven DevelopmentEvaluation

Eval-driven development is the LLM analog of test-driven development: you write evals for behavior before changing the prompt or model, and every change is graded against the same eval suite.

Offline EvaluationEvaluation

Offline evaluation runs a fixed dataset of inputs through a candidate model or prompt, scores each output, and reports aggregate quality — the standard way to compare changes before shipping.

Online EvaluationEvaluation

Online evaluation runs scoring functions over live production traffic — usually a sample of recent traces — to monitor quality continuously instead of relying solely on a fixed offline dataset.

User Feedback LoopEvaluation

A user feedback loop ingests explicit signals — thumbs up/down, edits, regenerates, copy-to-clipboard — back into evaluation and fine-tuning, turning real usage into a continuous quality signal.

Side-by-side comparisons

Sources

Hamel Husain — Your AI Product Needs Evals