Skip to main content
ModelTerms

Infrastructure · intermediate

Arize Phoenix (Phoenix)

Arize Phoenix is an open-source LLM observability and evaluation tool. It ingests OpenTelemetry traces, renders them in a debug UI, and provides built-in LLM-as-judge evaluators for hallucination, relevance, and toxicity.

Explanation

Phoenix is the OSS arm of Arize AI's commercial observability platform. It runs locally or hosted, accepts traces via OpenInference (an OpenTelemetry extension for GenAI), and surfaces them in a notebook-friendly UI.

Beyond viewing, Phoenix ships pre-built LLM-as-judge templates: hallucination detection (does the answer match retrieved context?), Q&A relevance (does it answer the question?), RAG relevance (are retrieved chunks on-topic?), toxicity, summarization quality. You can run these evals on captured traces, get pass/fail per span, and slice by feature flag, prompt version, or user segment.

The pitch: instrument once with OpenInference and you get a debug UI + an eval harness + a dataset builder for free. Used heavily in teams that want LLMOps tooling without committing to a vendor.

Examples

  • A team instruments their RAG pipeline with the Phoenix tracer, then runs the built-in faithfulness eval on yesterday's traffic to find sessions where the model contradicted the docs.
  • Phoenix notebook session: load traces from production, sample 500, run hallucination eval, save the failures as a regression test set.

When to use arize phoenix

When you want open-source LLMOps tooling that works in notebooks, the IDE, and production with the same instrumentation.

Frequently asked

What is Arize Phoenix?

Arize Phoenix is an open-source LLM observability and evaluation tool. It ingests OpenTelemetry traces, renders them in a debug UI, and provides built-in LLM-as-judge evaluators for hallucination, relevance, and toxicity.

What is an example of arize phoenix?

A team instruments their RAG pipeline with the Phoenix tracer, then runs the built-in faithfulness eval on yesterday's traffic to find sessions where the model contradicted the docs.

How is Arize Phoenix related to LLM Observability?

Arize Phoenix and LLM Observability are both infrastructure concepts. LLM observability is the practice of capturing, analyzing, and acting on every LLM call in a production system — inputs, outputs, latencies, costs, errors, and quality scores — so you can debug regressions and improve quality over time.

When should I use arize phoenix?

When you want open-source LLMOps tooling that works in notebooks, the IDE, and production with the same instrumentation.

Is Arize Phoenix considered intermediate?

Arize Phoenix is generally considered intermediate-level material in the AI and LLM space.

LLM ObservabilityInfrastructure

LLM observability is the practice of capturing, analyzing, and acting on every LLM call in a production system — inputs, outputs, latencies, costs, errors, and quality scores — so you can debug regressions and improve quality over time.

TracingInfrastructure

Tracing captures the full causal tree of an LLM request — the user input, retrieval calls, tool calls, intermediate prompts, and the final response — as a hierarchy of timed spans you can replay and inspect.

SpanInfrastructure

A span is a single unit of work within a trace — one LLM call, one tool call, one retrieval — with a start time, end time, attributes (model, tokens, cost), and a parent span that links it into the trace tree.

LangSmithInfrastructure

LangSmith is LangChain's commercial LLM observability and evaluation platform. It captures traces (LangChain-native and OTel), runs evaluations, manages prompt versions, and supports dataset curation.

LangfuseInfrastructure

Langfuse is an open-source LLM observability platform with tracing, prompt management, evaluation, and a self-host option. Popular default for teams who want LangSmith-equivalent tooling without the SaaS lock-in.

LLM-as-JudgeEvaluation

LLM-as-judge uses a strong LLM to score or compare outputs from other LLMs. It is how most production teams evaluate quality at scale when human review is too slow.

Online EvaluationEvaluation

Online evaluation runs scoring functions over live production traffic — usually a sample of recent traces — to monitor quality continuously instead of relying solely on a fixed offline dataset.

Side-by-side comparisons

Sources