Skip to main content
ModelTerms

Infrastructure · intermediate

LangSmith

LangSmith is LangChain's commercial LLM observability and evaluation platform. It captures traces (LangChain-native and OTel), runs evaluations, manages prompt versions, and supports dataset curation.

Explanation

LangSmith was the first widely-adopted LLM observability product, originally tuned for LangChain pipelines and now framework-agnostic via OpenTelemetry. Beyond tracing it includes prompt versioning, an evaluator library (custom evaluators, LLM-as-judge, pairwise comparisons), and a hub for sharing prompt templates.

Like Phoenix it leans into the "every saved trace is a test case" pattern — you sample production traces, mark good and bad ones, and use them as both a regression dataset and as fine-tuning data.

Trade-offs vs Phoenix: LangSmith is hosted SaaS (free tier exists, paid plans for higher volume), better integrated with LangChain workflows; Phoenix is OSS-first, more notebook-friendly, and self-hostable.

Examples

  • A LangChain app with one line of setup: every chain run shows up in the LangSmith trace UI with input, output, intermediate steps, and per-step costs.
  • LangSmith dataset eval: pick 200 production examples, run two prompt versions side-by-side, see win rate per slice.

Frequently asked

What is LangSmith?

LangSmith is LangChain's commercial LLM observability and evaluation platform. It captures traces (LangChain-native and OTel), runs evaluations, manages prompt versions, and supports dataset curation.

What is an example of langsmith?

A LangChain app with one line of setup: every chain run shows up in the LangSmith trace UI with input, output, intermediate steps, and per-step costs.

How is LangSmith related to LLM Observability?

LangSmith and LLM Observability are both infrastructure concepts. LLM observability is the practice of capturing, analyzing, and acting on every LLM call in a production system — inputs, outputs, latencies, costs, errors, and quality scores — so you can debug regressions and improve quality over time.

Is LangSmith considered intermediate?

LangSmith is generally considered intermediate-level material in the AI and LLM space.

LLM ObservabilityInfrastructure

LLM observability is the practice of capturing, analyzing, and acting on every LLM call in a production system — inputs, outputs, latencies, costs, errors, and quality scores — so you can debug regressions and improve quality over time.

TracingInfrastructure

Tracing captures the full causal tree of an LLM request — the user input, retrieval calls, tool calls, intermediate prompts, and the final response — as a hierarchy of timed spans you can replay and inspect.

Arize PhoenixInfrastructure

Arize Phoenix is an open-source LLM observability and evaluation tool. It ingests OpenTelemetry traces, renders them in a debug UI, and provides built-in LLM-as-judge evaluators for hallucination, relevance, and toxicity.

LangfuseInfrastructure

Langfuse is an open-source LLM observability platform with tracing, prompt management, evaluation, and a self-host option. Popular default for teams who want LangSmith-equivalent tooling without the SaaS lock-in.

Online EvaluationEvaluation

Online evaluation runs scoring functions over live production traffic — usually a sample of recent traces — to monitor quality continuously instead of relying solely on a fixed offline dataset.

Offline EvaluationEvaluation

Offline evaluation runs a fixed dataset of inputs through a candidate model or prompt, scores each output, and reports aggregate quality — the standard way to compare changes before shipping.

Side-by-side comparisons

Sources