Evaluation · intermediate
Online Evaluation (online eval, production eval)
Online evaluation runs scoring functions over live production traffic — usually a sample of recent traces — to monitor quality continuously instead of relying solely on a fixed offline dataset.
Explanation
Offline eval tells you how a change performs on yesterday's known cases. Online eval tells you how it's performing on today's unknown ones.
Typical setup: a sampler picks 1-10% of production traces; each is run through LLM-as-judge graders (faithfulness, relevance, etc.); scores are stored alongside the trace; a dashboard tracks rolling quality metrics; alerts fire when scores drop more than a threshold.
The catch is cost — every online eval is another LLM call. Tactics: sample smaller percentages, use cheaper judge models, run only flagged sessions through expensive evals, batch eval runs hourly instead of inline.
Online eval also feeds back into the offline set: production failures become regression cases.
Examples
- Phoenix running a faithfulness eval on 5% of production RAG traces, dashboard charts the rolling 7-day mean.
- A customer support bot whose hallucination-detection eval auto-tags any session where the model contradicted its retrieved context.
When to use online evaluation
After offline eval is solid and you have meaningful production volume. Stretch your eval coverage from a fixed set to a live one.
Frequently asked
What is Online Evaluation?
Online evaluation runs scoring functions over live production traffic — usually a sample of recent traces — to monitor quality continuously instead of relying solely on a fixed offline dataset.
What is an example of online evaluation?
Phoenix running a faithfulness eval on 5% of production RAG traces, dashboard charts the rolling 7-day mean.
How is Online Evaluation related to Offline Evaluation?
Online Evaluation and Offline Evaluation are both evaluation concepts. Offline evaluation runs a fixed dataset of inputs through a candidate model or prompt, scores each output, and reports aggregate quality — the standard way to compare changes before shipping.
When should I use online evaluation?
After offline eval is solid and you have meaningful production volume. Stretch your eval coverage from a fixed set to a live one.
Is Online Evaluation considered intermediate?
Online Evaluation is generally considered intermediate-level material in the AI and LLM space.