Evaluation · beginner
Eval-Driven Development (EDD)
Eval-driven development is the LLM analog of test-driven development: you write evals for behavior before changing the prompt or model, and every change is graded against the same eval suite.
Explanation
In classic software, a passing test suite is the green light to merge. In LLM apps, the equivalent is an eval suite — a curated dataset of inputs with expected behaviors or graders that score outputs — that runs on every PR.
The discipline: every bug report becomes an eval case; every new feature adds eval coverage before the prompt change ships; every prompt or model change is compared to baseline on the same eval set.
The eval suite is your codification of "what good looks like." Without it, prompt edits are vibes; with it, every change is a measurable A/B between two configurations. Most production LLM teams converge on eval-driven development within 6-12 months.
Examples
- A team's prompt PR template requires: eval set updated if behavior changed, baseline win rate ≥ 50%, no regressions on three named cases.
- Each new bug ticket becomes a row in the eval CSV before the fix gets merged.
When to use eval-driven development
As soon as you have more than one prompt change per week or more than one engineer iterating on the same prompt.
Frequently asked
What is Eval-Driven Development?
Eval-driven development is the LLM analog of test-driven development: you write evals for behavior before changing the prompt or model, and every change is graded against the same eval suite.
What is an example of eval-driven development?
A team's prompt PR template requires: eval set updated if behavior changed, baseline win rate ≥ 50%, no regressions on three named cases.
How is Eval-Driven Development related to Offline Evaluation?
Eval-Driven Development and Offline Evaluation are both evaluation concepts. Offline evaluation runs a fixed dataset of inputs through a candidate model or prompt, scores each output, and reports aggregate quality — the standard way to compare changes before shipping.
When should I use eval-driven development?
As soon as you have more than one prompt change per week or more than one engineer iterating on the same prompt.
Is Eval-Driven Development considered beginner?
Eval-Driven Development is generally considered beginner-level material in the AI and LLM space.