Inference · advanced
Speculative Decoding
Speculative decoding speeds up generation by having a small "draft" model propose several tokens, then verifying them in a single batched call to the big model.
Explanation
If the small model proposes tokens that match what the big model would have picked anyway, you get many tokens per big-model call. Even when proposals are rejected, you spend only one big-model forward pass per draft, so worst case is similar to regular decoding.
In practice you get 2-3x speedup with no change in output distribution, since rejected proposals are caught and corrected. Most production inference stacks (vLLM, TensorRT-LLM) support speculative decoding out of the box.
Examples
- Llama 3 70B accelerated by Llama 3 8B as draft.
- Medusa: adds extra decoding heads to skip the separate draft model.
Frequently asked
What is Speculative Decoding?
Speculative decoding speeds up generation by having a small "draft" model propose several tokens, then verifying them in a single batched call to the big model.
What is an example of speculative decoding?
Llama 3 70B accelerated by Llama 3 8B as draft.
How is Speculative Decoding related to Inference?
Speculative Decoding and Inference are both inference concepts. Inference is what happens when you actually run a trained model on new input. For LLMs that means generating tokens one at a time, with sampling and a KV cache.
Is Speculative Decoding considered advanced?
Speculative Decoding is generally considered advanced-level material in the AI and LLM space.