Skip to main content
ModelTerms

Learning path · 26 min · advanced

Inference engineering

Make models fast and cheap to serve. The metrics, the techniques, the trade-offs.

Training gets the headlines, but inference is the line item on your monthly bill. This path walks the metrics (TTFT, TPOT), the techniques (prompt caching, continuous batching, speculative decoding, quantization), and the production architecture (LLM gateway, model router).

  1. Inference

    Why this step: The setup — what inference is and why it dominates the AI economy now.

    Inference is what happens when you actually run a trained model on new input. For LLMs that means generating tokens one at a time, with sampling and a KV cache.

    Read full entry →Inference · beginner
  2. KV Cachekey-value cache

    Why this step: The dominant memory cost. Knowing this is the prerequisite for everything else.

    The KV cache stores the key and value vectors of all earlier tokens during generation so the model does not recompute them at every step. It is the main memory cost of LLM inference.

    Read full entry →Architecture · advanced
  3. Time to First TokenTTFT

    Why this step: The user-perceived latency metric for streaming chat.

    Time to first token (TTFT) is how long it takes from sending a request until the first response token arrives. The user-perceived latency metric for streaming chat.

    Read full entry →Inference · intermediate
  4. Time per Output TokenTPOT

    Why this step: The streaming-speed metric. Together with TTFT, this defines "feel".

    Time per output token (TPOT) is the average wall-clock delay between consecutive generated tokens during streaming. Determines how fast text appears once generation starts.

    Read full entry →Inference · intermediate
  5. Streaming (LLM Responses)SSE streaming

    Why this step: Why TTFT and TPOT exist as separate concerns at all.

    Streaming returns tokens to the client as they're generated rather than holding the full response until completion. Implemented over Server-Sent Events (SSE) or WebSocket; what makes chat UIs feel fast.

    Read full entry →Inference · beginner
  6. Prompt Cachingprefix caching

    Why this step: The 50-90% cost cut on long-prefix workloads.

    Prompt caching stores the KV-cache state of a long prefix (system prompt, large document, tool definitions) so subsequent calls that reuse it skip the prefill compute — cutting TTFT and cost by 50-90%.

    Read full entry →Inference · intermediate
  7. Continuous Batchinginflight batching

    Why this step: vLLM's killer feature — multi-x throughput by reusing GPU slots.

    Continuous batching lets new requests join an in-flight batch on the next decode step rather than waiting for the current batch to finish, dramatically raising GPU utilization on variable-length workloads.

    Read full entry →Inference · advanced
  8. Speculative Decoding

    Why this step: Free 2-3x speedup with no quality loss.

    Speculative decoding speeds up generation by having a small "draft" model propose several tokens, then verifying them in a single batched call to the big model.

    Read full entry →Inference · advanced
  9. Quantization

    Why this step: The other big inference lever — smaller weights, faster compute.

    Quantization reduces model weights from 16- or 32-bit floats to lower-precision types (INT8, INT4) so the model needs less memory and runs faster, usually with minor quality loss.

    Read full entry →Infrastructure · intermediate
  10. vLLM

    Why this step: The reference open-source serving engine that combines most of the above.

    vLLM is an open-source high-throughput LLM serving engine. Its PagedAttention KV cache manager is the reason it dramatically outperforms naive serving setups.

    Read full entry →Infrastructure · advanced
  11. LLM Gatewaymodel gateway

    Why this step: The proxy layer that handles cross-cutting concerns in production.

    An LLM gateway is a proxy layer that sits between application code and one or more LLM providers — handling auth, rate-limit retries, cost tracking, observability, prompt caching, model routing, and PII redaction.

    Read full entry →Infrastructure · intermediate
  12. Model RouterLLM router

    Why this step: How you avoid paying frontier prices for trivial queries.

    A model router picks the cheapest model that's likely to handle a given request well — based on a small classifier, embedding similarity, or rule-based filters — so you don't pay frontier prices for trivial queries.

    Read full entry →Infrastructure · intermediate

You finished the path.

Now stress-test what you remember.

Take the mixed quiz →Pick another path