Skip to main content
ModelTerms

Category

Inference

What controls model behavior at generation time.

Batch API

Batch APIs (OpenAI, Anthropic) accept up to 50K LLM requests in a single submission, run them asynchronously over hours, and return results at ~50% of the synchronous price. The cheap option for bulk processing.

beginner
Beam Search

Beam search explores several candidate continuations in parallel, keeping the top-k partial sequences at each step. Common in translation; rare in modern LLM chat.

intermediate
Context Window

The context window is the maximum number of tokens an LLM can consider in a single call — prompt plus generated output combined.

beginner
Continuous Batching

Continuous batching lets new requests join an in-flight batch on the next decode step rather than waiting for the current batch to finish, dramatically raising GPU utilization on variable-length workloads.

advanced
Greedy Decoding

Greedy decoding always picks the single highest-probability next token. It is deterministic, fast, and often dull.

beginner
Inference

Inference is what happens when you actually run a trained model on new input. For LLMs that means generating tokens one at a time, with sampling and a KV cache.

beginner
JSON Mode

JSON mode is a provider-specific feature that forces the model to emit syntactically valid JSON. Stronger than asking nicely; weaker than full structured output with a schema.

beginner
Long-Context Model

A long-context model accepts very long inputs — 100K+ tokens, in some cases millions. Claude (200K), GPT-4o (128K), and Gemini 1.5 Pro (1M+) are current examples.

beginner
Prompt Caching

Prompt caching stores the KV-cache state of a long prefix (system prompt, large document, tool definitions) so subsequent calls that reuse it skip the prefill compute — cutting TTFT and cost by 50-90%.

intermediate
Sampling

Sampling is the act of choosing the next token from the model's output distribution, typically after applying temperature and a truncation strategy like top-p or top-k.

intermediate
Speculative Decoding

Speculative decoding speeds up generation by having a small "draft" model propose several tokens, then verifying them in a single batched call to the big model.

advanced
Streaming (LLM Responses)

Streaming returns tokens to the client as they're generated rather than holding the full response until completion. Implemented over Server-Sent Events (SSE) or WebSocket; what makes chat UIs feel fast.

beginner
Structured Output

Structured output constrains an LLM to emit text matching a schema — usually JSON. The model can be guaranteed to produce valid output that your code can parse without retries.

intermediate
Temperature

Temperature is a generation parameter that controls randomness. 0 is deterministic (always pick the most likely token); higher values produce more diverse, surprising output.

beginner
Time per Output Token

Time per output token (TPOT) is the average wall-clock delay between consecutive generated tokens during streaming. Determines how fast text appears once generation starts.

intermediate
Time to First Token

Time to first token (TTFT) is how long it takes from sending a request until the first response token arrives. The user-perceived latency metric for streaming chat.

intermediate
Token

A token is the basic unit an LLM reads and writes — usually a word piece (3-4 characters). LLMs are priced and sized by tokens, not words.

beginner
Token Count

Token count is the number of tokens in a piece of text under a specific tokenizer. The unit of LLM pricing, context limits, and rate limits.

beginner
Tokenization

Tokenization is the process of splitting raw text into the discrete tokens an LLM consumes. Most modern LLMs use a learned byte-pair-encoding (BPE) tokenizer.

intermediate
Top-k

Top-k restricts token sampling to the k highest-probability tokens, then samples from that set. A simpler alternative to top-p.

intermediate
Top-p

Top-p (nucleus sampling) restricts token selection to the smallest set of tokens whose cumulative probability reaches p. Common values are 0.9-0.95.

intermediate