Category
Infrastructure
The hardware and software that makes large models practical.
Arize Phoenix is an open-source LLM observability and evaluation tool. It ingests OpenTelemetry traces, renders them in a debug UI, and provides built-in LLM-as-judge evaluators for hallucination, relevance, and toxicity.
intermediateBFloat16 is a 16-bit floating-point format with FP32's exponent range but only 8 bits of mantissa. The default precision for LLM training and most inference.
intermediateDrift detection watches for changes in the statistical distribution of inputs, outputs, or quality scores over time — so you can catch a model degrading in production before users complain.
advancedEmbedding drift is a specific kind of drift detection — comparing the distribution of input or response embeddings between two time windows to surface semantic shifts that simple statistics would miss.
advancedGPUs are the parallel processors that train and run nearly every modern AI model. Their throughput on matrix multiplication is what makes deep learning practical.
beginnerLangfuse is an open-source LLM observability platform with tracing, prompt management, evaluation, and a self-host option. Popular default for teams who want LangSmith-equivalent tooling without the SaaS lock-in.
intermediateLangSmith is LangChain's commercial LLM observability and evaluation platform. It captures traces (LangChain-native and OTel), runs evaluations, manages prompt versions, and supports dataset curation.
intermediateAn LLM gateway is a proxy layer that sits between application code and one or more LLM providers — handling auth, rate-limit retries, cost tracking, observability, prompt caching, model routing, and PII redaction.
intermediateLLM observability is the practice of capturing, analyzing, and acting on every LLM call in a production system — inputs, outputs, latencies, costs, errors, and quality scores — so you can debug regressions and improve quality over time.
intermediateMixed-precision training does the bulk of forward and backward computation in 16-bit floats (BF16 or FP16) while keeping master weights and certain accumulations in 32-bit. Faster, smaller, same accuracy.
advancedA model router picks the cheapest model that's likely to handle a given request well — based on a small classifier, embedding similarity, or rule-based filters — so you don't pay frontier prices for trivial queries.
intermediatePipeline parallelism splits the model by layer across GPUs — GPU 1 holds layers 0-15, GPU 2 holds 16-31, etc. Forward passes flow through the pipeline like an assembly line.
advancedQuantization reduces model weights from 16- or 32-bit floats to lower-precision types (INT8, INT4) so the model needs less memory and runs faster, usually with minor quality loss.
intermediateA span is a single unit of work within a trace — one LLM call, one tool call, one retrieval — with a start time, end time, attributes (model, tokens, cost), and a parent span that links it into the trace tree.
intermediateTensor parallelism shards individual layers across multiple GPUs — splitting each matrix multiplication so different GPUs compute different output dimensions in parallel.
advancedTPUs are Google's custom AI accelerators, designed specifically for the matrix and reduction operations of neural networks. Used to train Gemini and large parts of Google's AI stack.
intermediateTracing captures the full causal tree of an LLM request — the user input, retrieval calls, tool calls, intermediate prompts, and the final response — as a hierarchy of timed spans you can replay and inspect.
intermediatevLLM is an open-source high-throughput LLM serving engine. Its PagedAttention KV cache manager is the reason it dramatically outperforms naive serving setups.
advanced