Category
Inference
What controls model behavior at generation time.
Batch APIs (OpenAI, Anthropic) accept up to 50K LLM requests in a single submission, run them asynchronously over hours, and return results at ~50% of the synchronous price. The cheap option for bulk processing.
beginnerBeam search explores several candidate continuations in parallel, keeping the top-k partial sequences at each step. Common in translation; rare in modern LLM chat.
intermediateThe context window is the maximum number of tokens an LLM can consider in a single call — prompt plus generated output combined.
beginnerContinuous batching lets new requests join an in-flight batch on the next decode step rather than waiting for the current batch to finish, dramatically raising GPU utilization on variable-length workloads.
advancedGreedy decoding always picks the single highest-probability next token. It is deterministic, fast, and often dull.
beginnerInference is what happens when you actually run a trained model on new input. For LLMs that means generating tokens one at a time, with sampling and a KV cache.
beginnerJSON mode is a provider-specific feature that forces the model to emit syntactically valid JSON. Stronger than asking nicely; weaker than full structured output with a schema.
beginnerA long-context model accepts very long inputs — 100K+ tokens, in some cases millions. Claude (200K), GPT-4o (128K), and Gemini 1.5 Pro (1M+) are current examples.
beginnerPrompt caching stores the KV-cache state of a long prefix (system prompt, large document, tool definitions) so subsequent calls that reuse it skip the prefill compute — cutting TTFT and cost by 50-90%.
intermediateSampling is the act of choosing the next token from the model's output distribution, typically after applying temperature and a truncation strategy like top-p or top-k.
intermediateSpeculative decoding speeds up generation by having a small "draft" model propose several tokens, then verifying them in a single batched call to the big model.
advancedStreaming returns tokens to the client as they're generated rather than holding the full response until completion. Implemented over Server-Sent Events (SSE) or WebSocket; what makes chat UIs feel fast.
beginnerStructured output constrains an LLM to emit text matching a schema — usually JSON. The model can be guaranteed to produce valid output that your code can parse without retries.
intermediateTemperature is a generation parameter that controls randomness. 0 is deterministic (always pick the most likely token); higher values produce more diverse, surprising output.
beginnerTime per output token (TPOT) is the average wall-clock delay between consecutive generated tokens during streaming. Determines how fast text appears once generation starts.
intermediateTime to first token (TTFT) is how long it takes from sending a request until the first response token arrives. The user-perceived latency metric for streaming chat.
intermediateA token is the basic unit an LLM reads and writes — usually a word piece (3-4 characters). LLMs are priced and sized by tokens, not words.
beginnerToken count is the number of tokens in a piece of text under a specific tokenizer. The unit of LLM pricing, context limits, and rate limits.
beginnerTokenization is the process of splitting raw text into the discrete tokens an LLM consumes. Most modern LLMs use a learned byte-pair-encoding (BPE) tokenizer.
intermediateTop-k restricts token sampling to the k highest-probability tokens, then samples from that set. A simpler alternative to top-p.
intermediateTop-p (nucleus sampling) restricts token selection to the smallest set of tokens whose cumulative probability reaches p. Common values are 0.9-0.95.
intermediate