Learning path · 26 min · advanced

Inference engineering

Make models fast and cheap to serve. The metrics, the techniques, the trade-offs.

Training gets the headlines, but inference is the line item on your monthly bill. This path walks the metrics (TTFT, TPOT), the techniques (prompt caching, continuous batching, speculative decoding, quantization), and the production architecture (LLM gateway, model router).

Inference
Why this step: The setup — what inference is and why it dominates the AI economy now.
Inference is what happens when you actually run a trained model on new input. For LLMs that means generating tokens one at a time, with sampling and a KV cache.
Read full entry →Inference · beginner
KV Cachekey-value cache
Why this step: The dominant memory cost. Knowing this is the prerequisite for everything else.
The KV cache stores the key and value vectors of all earlier tokens during generation so the model does not recompute them at every step. It is the main memory cost of LLM inference.
Read full entry →Architecture · advanced
Time to First TokenTTFT
Why this step: The user-perceived latency metric for streaming chat.
Time to first token (TTFT) is how long it takes from sending a request until the first response token arrives. The user-perceived latency metric for streaming chat.
Read full entry →Inference · intermediate
Time per Output TokenTPOT
Why this step: The streaming-speed metric. Together with TTFT, this defines "feel".
Time per output token (TPOT) is the average wall-clock delay between consecutive generated tokens during streaming. Determines how fast text appears once generation starts.
Read full entry →Inference · intermediate
Streaming (LLM Responses)SSE streaming
Why this step: Why TTFT and TPOT exist as separate concerns at all.
Streaming returns tokens to the client as they're generated rather than holding the full response until completion. Implemented over Server-Sent Events (SSE) or WebSocket; what makes chat UIs feel fast.
Read full entry →Inference · beginner
Prompt Cachingprefix caching
Why this step: The 50-90% cost cut on long-prefix workloads.
Prompt caching stores the KV-cache state of a long prefix (system prompt, large document, tool definitions) so subsequent calls that reuse it skip the prefill compute — cutting TTFT and cost by 50-90%.
Read full entry →Inference · intermediate
Continuous Batchinginflight batching
Why this step: vLLM's killer feature — multi-x throughput by reusing GPU slots.
Continuous batching lets new requests join an in-flight batch on the next decode step rather than waiting for the current batch to finish, dramatically raising GPU utilization on variable-length workloads.
Read full entry →Inference · advanced
Speculative Decoding
Why this step: Free 2-3x speedup with no quality loss.
Speculative decoding speeds up generation by having a small "draft" model propose several tokens, then verifying them in a single batched call to the big model.
Read full entry →Inference · advanced
Quantization
Why this step: The other big inference lever — smaller weights, faster compute.
Quantization reduces model weights from 16- or 32-bit floats to lower-precision types (INT8, INT4) so the model needs less memory and runs faster, usually with minor quality loss.
Read full entry →Infrastructure · intermediate
vLLM
Why this step: The reference open-source serving engine that combines most of the above.
vLLM is an open-source high-throughput LLM serving engine. Its PagedAttention KV cache manager is the reason it dramatically outperforms naive serving setups.
Read full entry →Infrastructure · advanced
LLM Gatewaymodel gateway
Why this step: The proxy layer that handles cross-cutting concerns in production.
An LLM gateway is a proxy layer that sits between application code and one or more LLM providers — handling auth, rate-limit retries, cost tracking, observability, prompt caching, model routing, and PII redaction.
Read full entry →Infrastructure · intermediate
Model RouterLLM router
Why this step: How you avoid paying frontier prices for trivial queries.
A model router picks the cheapest model that's likely to handle a given request well — based on a small classifier, embedding similarity, or rule-based filters — so you don't pay frontier prices for trivial queries.
Read full entry →Infrastructure · intermediate

You finished the path.

Now stress-test what you remember.

Take the mixed quiz →Pick another path