Learning path · 22 min · intermediate

LLMOps & observability

Tracing every call, watching drift, feeding back into evals. Production AI discipline.

Production LLM apps fail in ways logs do not catch. LLMOps is the discipline of tracing every call as structured spans, watching for input/output/quality drift, running evaluators over live traffic, and feeding user signals back into the next prompt iteration. This path walks the toolkit.

LLM ObservabilityLLM ops
Why this step: The umbrella discipline. Frames everything that follows.
LLM observability is the practice of capturing, analyzing, and acting on every LLM call in a production system — inputs, outputs, latencies, costs, errors, and quality scores — so you can debug regressions and improve quality over time.
Read full entry →Infrastructure · intermediate
TracingLLM tracing
Why this step: The base primitive — the full causal tree of an LLM request as structured data.
Tracing captures the full causal tree of an LLM request — the user input, retrieval calls, tool calls, intermediate prompts, and the final response — as a hierarchy of timed spans you can replay and inspect.
Read full entry →Infrastructure · intermediate
Span
Why this step: The building block of a trace.
A span is a single unit of work within a trace — one LLM call, one tool call, one retrieval — with a start time, end time, attributes (model, tokens, cost), and a parent span that links it into the trace tree.
Read full entry →Infrastructure · intermediate
Arize PhoenixPhoenix
Why this step: The flagship open-source tool. Tracing + eval in one.
Arize Phoenix is an open-source LLM observability and evaluation tool. It ingests OpenTelemetry traces, renders them in a debug UI, and provides built-in LLM-as-judge evaluators for hallucination, relevance, and toxicity.
Read full entry →Infrastructure · intermediate
LangSmith
Why this step: The LangChain-native alternative. Hosted SaaS, integrated prompt-versioning.
LangSmith is LangChain's commercial LLM observability and evaluation platform. It captures traces (LangChain-native and OTel), runs evaluations, manages prompt versions, and supports dataset curation.
Read full entry →Infrastructure · intermediate
Langfuse
Why this step: The other major open-source option. EU-friendly self-hosting.
Langfuse is an open-source LLM observability platform with tracing, prompt management, evaluation, and a self-host option. Popular default for teams who want LangSmith-equivalent tooling without the SaaS lock-in.
Read full entry →Infrastructure · intermediate
User Feedback Loopuser feedback
Why this step: Where production thumbs-up/down/edits become eval signal.
A user feedback loop ingests explicit signals — thumbs up/down, edits, regenerates, copy-to-clipboard — back into evaluation and fine-tuning, turning real usage into a continuous quality signal.
Read full entry →Evaluation · intermediate
Drift Detectionmodel drift
Why this step: Catching input or quality drift before users complain.
Drift detection watches for changes in the statistical distribution of inputs, outputs, or quality scores over time — so you can catch a model degrading in production before users complain.
Read full entry →Infrastructure · advanced
Embedding Drift
Why this step: The semantic-shift detection technique. Catches what statistics miss.
Embedding drift is a specific kind of drift detection — comparing the distribution of input or response embeddings between two time windows to surface semantic shifts that simple statistics would miss.
Read full entry →Infrastructure · advanced
Online Evaluationonline eval
Why this step: The capstone — eval as a continuous production signal.
Online evaluation runs scoring functions over live production traffic — usually a sample of recent traces — to monitor quality continuously instead of relying solely on a fixed offline dataset.
Read full entry →Evaluation · intermediate

You finished the path.

Now stress-test what you remember.

Take the mixed quiz →Pick another path