Inference

What controls model behavior at generation time.

Batch APIs (OpenAI, Anthropic) accept up to 50K LLM requests in a single submission, run them asynchronously over hours, and return results at ~50% of the synchronous price. The cheap option for bulk processing.

beginner

Beam Search

Beam search explores several candidate continuations in parallel, keeping the top-k partial sequences at each step. Common in translation; rare in modern LLM chat.

intermediate

Context Window

The context window is the maximum number of tokens an LLM can consider in a single call — prompt plus generated output combined.

beginner

Continuous Batching

Continuous batching lets new requests join an in-flight batch on the next decode step rather than waiting for the current batch to finish, dramatically raising GPU utilization on variable-length workloads.

advanced

Greedy Decoding

Greedy decoding always picks the single highest-probability next token. It is deterministic, fast, and often dull.

beginner

Inference

Inference is what happens when you actually run a trained model on new input. For LLMs that means generating tokens one at a time, with sampling and a KV cache.

beginner

JSON Mode

JSON mode is a provider-specific feature that forces the model to emit syntactically valid JSON. Stronger than asking nicely; weaker than full structured output with a schema.

beginner

Long-Context Model

A long-context model accepts very long inputs — 100K+ tokens, in some cases millions. Claude (200K), GPT-4o (128K), and Gemini 1.5 Pro (1M+) are current examples.

beginner

Prompt Caching

Prompt caching stores the KV-cache state of a long prefix (system prompt, large document, tool definitions) so subsequent calls that reuse it skip the prefill compute — cutting TTFT and cost by 50-90%.

intermediate

Sampling

Sampling is the act of choosing the next token from the model's output distribution, typically after applying temperature and a truncation strategy like top-p or top-k.

intermediate

Speculative Decoding

Speculative decoding speeds up generation by having a small "draft" model propose several tokens, then verifying them in a single batched call to the big model.

advanced

Streaming (LLM Responses)

Streaming returns tokens to the client as they're generated rather than holding the full response until completion. Implemented over Server-Sent Events (SSE) or WebSocket; what makes chat UIs feel fast.

beginner

Structured Output

Structured output constrains an LLM to emit text matching a schema — usually JSON. The model can be guaranteed to produce valid output that your code can parse without retries.

intermediate

Temperature

Temperature is a generation parameter that controls randomness. 0 is deterministic (always pick the most likely token); higher values produce more diverse, surprising output.

beginner

Time per Output Token

Time per output token (TPOT) is the average wall-clock delay between consecutive generated tokens during streaming. Determines how fast text appears once generation starts.

intermediate

Time to First Token

Time to first token (TTFT) is how long it takes from sending a request until the first response token arrives. The user-perceived latency metric for streaming chat.

intermediate

Token

A token is the basic unit an LLM reads and writes — usually a word piece (3-4 characters). LLMs are priced and sized by tokens, not words.

beginner

Token Count

Token count is the number of tokens in a piece of text under a specific tokenizer. The unit of LLM pricing, context limits, and rate limits.

beginner

Tokenization

Tokenization is the process of splitting raw text into the discrete tokens an LLM consumes. Most modern LLMs use a learned byte-pair-encoding (BPE) tokenizer.

intermediate

Top-k

Top-k restricts token sampling to the k highest-probability tokens, then samples from that set. A simpler alternative to top-p.

intermediate

Top-p

Top-p (nucleus sampling) restricts token selection to the smallest set of tokens whose cumulative probability reaches p. Common values are 0.9-0.95.

intermediate