Skip to main content
ModelTerms

Inference · advanced

Speculative Decoding

Speculative decoding speeds up generation by having a small "draft" model propose several tokens, then verifying them in a single batched call to the big model.

Explanation

If the small model proposes tokens that match what the big model would have picked anyway, you get many tokens per big-model call. Even when proposals are rejected, you spend only one big-model forward pass per draft, so worst case is similar to regular decoding.

In practice you get 2-3x speedup with no change in output distribution, since rejected proposals are caught and corrected. Most production inference stacks (vLLM, TensorRT-LLM) support speculative decoding out of the box.

Examples

  • Llama 3 70B accelerated by Llama 3 8B as draft.
  • Medusa: adds extra decoding heads to skip the separate draft model.

Frequently asked

What is Speculative Decoding?

Speculative decoding speeds up generation by having a small "draft" model propose several tokens, then verifying them in a single batched call to the big model.

What is an example of speculative decoding?

Llama 3 70B accelerated by Llama 3 8B as draft.

How is Speculative Decoding related to Inference?

Speculative Decoding and Inference are both inference concepts. Inference is what happens when you actually run a trained model on new input. For LLMs that means generating tokens one at a time, with sampling and a KV cache.

Is Speculative Decoding considered advanced?

Speculative Decoding is generally considered advanced-level material in the AI and LLM space.

Side-by-side comparisons

Sources