Skip to main content
ModelTerms

Evaluation · intermediate

Online Evaluation (online eval, production eval)

Online evaluation runs scoring functions over live production traffic — usually a sample of recent traces — to monitor quality continuously instead of relying solely on a fixed offline dataset.

Explanation

Offline eval tells you how a change performs on yesterday's known cases. Online eval tells you how it's performing on today's unknown ones.

Typical setup: a sampler picks 1-10% of production traces; each is run through LLM-as-judge graders (faithfulness, relevance, etc.); scores are stored alongside the trace; a dashboard tracks rolling quality metrics; alerts fire when scores drop more than a threshold.

The catch is cost — every online eval is another LLM call. Tactics: sample smaller percentages, use cheaper judge models, run only flagged sessions through expensive evals, batch eval runs hourly instead of inline.

Online eval also feeds back into the offline set: production failures become regression cases.

Examples

  • Phoenix running a faithfulness eval on 5% of production RAG traces, dashboard charts the rolling 7-day mean.
  • A customer support bot whose hallucination-detection eval auto-tags any session where the model contradicted its retrieved context.

When to use online evaluation

After offline eval is solid and you have meaningful production volume. Stretch your eval coverage from a fixed set to a live one.

Frequently asked

What is Online Evaluation?

Online evaluation runs scoring functions over live production traffic — usually a sample of recent traces — to monitor quality continuously instead of relying solely on a fixed offline dataset.

What is an example of online evaluation?

Phoenix running a faithfulness eval on 5% of production RAG traces, dashboard charts the rolling 7-day mean.

How is Online Evaluation related to Offline Evaluation?

Online Evaluation and Offline Evaluation are both evaluation concepts. Offline evaluation runs a fixed dataset of inputs through a candidate model or prompt, scores each output, and reports aggregate quality — the standard way to compare changes before shipping.

When should I use online evaluation?

After offline eval is solid and you have meaningful production volume. Stretch your eval coverage from a fixed set to a live one.

Is Online Evaluation considered intermediate?

Online Evaluation is generally considered intermediate-level material in the AI and LLM space.

Offline EvaluationEvaluation

Offline evaluation runs a fixed dataset of inputs through a candidate model or prompt, scores each output, and reports aggregate quality — the standard way to compare changes before shipping.

LLM ObservabilityInfrastructure

LLM observability is the practice of capturing, analyzing, and acting on every LLM call in a production system — inputs, outputs, latencies, costs, errors, and quality scores — so you can debug regressions and improve quality over time.

Arize PhoenixInfrastructure

Arize Phoenix is an open-source LLM observability and evaluation tool. It ingests OpenTelemetry traces, renders them in a debug UI, and provides built-in LLM-as-judge evaluators for hallucination, relevance, and toxicity.

LLM-as-JudgeEvaluation

LLM-as-judge uses a strong LLM to score or compare outputs from other LLMs. It is how most production teams evaluate quality at scale when human review is too slow.

Drift DetectionInfrastructure

Drift detection watches for changes in the statistical distribution of inputs, outputs, or quality scores over time — so you can catch a model degrading in production before users complain.

User Feedback LoopEvaluation

A user feedback loop ingests explicit signals — thumbs up/down, edits, regenerates, copy-to-clipboard — back into evaluation and fine-tuning, turning real usage into a continuous quality signal.

Side-by-side comparisons

Sources