Skip to main content
ModelTerms

Infrastructure · intermediate

LLM Observability (LLM ops, LLMOps)

LLM observability is the practice of capturing, analyzing, and acting on every LLM call in a production system — inputs, outputs, latencies, costs, errors, and quality scores — so you can debug regressions and improve quality over time.

Explanation

When a deterministic web service breaks you read a stack trace. When an LLM application breaks the input was vague, the prompt was a paragraph long, retrieval pulled a stale doc, and the model hallucinated a function. None of that shows up in normal logs. LLM observability captures the whole call as a structured trace so you can replay, diff, and grade it.

The core moves: instrument every LLM call as a span; persist the input, output, model, latency, token counts, and cost; tag with a session/user/feature flag; attach eval scores (LLM-as-judge, regex pass/fail, faithfulness, etc.); and let humans annotate examples for fine-tuning or regression tests.

Open-source tools (Arize Phoenix, Langfuse, OpenLLMetry) and commercial platforms (Arize AX, LangSmith, Weights & Biases Weave, Datadog LLM Observability, Helicone, Braintrust) all do this with slightly different ergonomics.

The practical payoff: you can answer "why was this answer bad?" and "did our last prompt change make things worse?" in minutes instead of days.

Examples

  • A support bot logs every (user message, retrieved docs, prompt, response, faithfulness score) tuple to Arize Phoenix; engineers replay bad sessions there.
  • A coding agent emits OpenTelemetry traces; LangSmith renders the multi-step chain so you can find which tool call failed.

When to use llm observability

From day one of any production LLM application. The cost of bolting it on later vastly exceeds wiring it up at the start.

Frequently asked

What is LLM Observability?

LLM observability is the practice of capturing, analyzing, and acting on every LLM call in a production system — inputs, outputs, latencies, costs, errors, and quality scores — so you can debug regressions and improve quality over time.

What is an example of llm observability?

A support bot logs every (user message, retrieved docs, prompt, response, faithfulness score) tuple to Arize Phoenix; engineers replay bad sessions there.

How is LLM Observability related to Tracing?

LLM Observability and Tracing are both infrastructure concepts. Tracing captures the full causal tree of an LLM request — the user input, retrieval calls, tool calls, intermediate prompts, and the final response — as a hierarchy of timed spans you can replay and inspect.

When should I use llm observability?

From day one of any production LLM application. The cost of bolting it on later vastly exceeds wiring it up at the start.

Is LLM Observability considered intermediate?

LLM Observability is generally considered intermediate-level material in the AI and LLM space.

TracingInfrastructure

Tracing captures the full causal tree of an LLM request — the user input, retrieval calls, tool calls, intermediate prompts, and the final response — as a hierarchy of timed spans you can replay and inspect.

SpanInfrastructure

A span is a single unit of work within a trace — one LLM call, one tool call, one retrieval — with a start time, end time, attributes (model, tokens, cost), and a parent span that links it into the trace tree.

Arize PhoenixInfrastructure

Arize Phoenix is an open-source LLM observability and evaluation tool. It ingests OpenTelemetry traces, renders them in a debug UI, and provides built-in LLM-as-judge evaluators for hallucination, relevance, and toxicity.

LangSmithInfrastructure

LangSmith is LangChain's commercial LLM observability and evaluation platform. It captures traces (LangChain-native and OTel), runs evaluations, manages prompt versions, and supports dataset curation.

LangfuseInfrastructure

Langfuse is an open-source LLM observability platform with tracing, prompt management, evaluation, and a self-host option. Popular default for teams who want LangSmith-equivalent tooling without the SaaS lock-in.

Drift DetectionInfrastructure

Drift detection watches for changes in the statistical distribution of inputs, outputs, or quality scores over time — so you can catch a model degrading in production before users complain.

Online EvaluationEvaluation

Online evaluation runs scoring functions over live production traffic — usually a sample of recent traces — to monitor quality continuously instead of relying solely on a fixed offline dataset.

HallucinationEvaluation

A hallucination is a confidently-stated, plausible-sounding LLM output that is factually wrong. It is the failure mode that most often surprises non-expert users.

Side-by-side comparisons

Sources