Inference · advanced

Speculative Decoding

Speculative decoding speeds up generation by having a small "draft" model propose several tokens, then verifying them in a single batched call to the big model.

Published May 29, 2026

Explanation

If the small model proposes tokens that match what the big model would have picked anyway, you get many tokens per big-model call. Even when proposals are rejected, you spend only one big-model forward pass per draft, so worst case is similar to regular decoding.

In practice you get 2-3x speedup with no change in output distribution, since rejected proposals are caught and corrected. Most production inference stacks (vLLM, TensorRT-LLM) support speculative decoding out of the box.

Examples

Llama 3 70B accelerated by Llama 3 8B as draft.
Medusa: adds extra decoding heads to skip the separate draft model.

Frequently asked

What is Speculative Decoding?

Speculative decoding speeds up generation by having a small "draft" model propose several tokens, then verifying them in a single batched call to the big model.

What is an example of speculative decoding?

Llama 3 70B accelerated by Llama 3 8B as draft.

How is Speculative Decoding related to Inference?

Speculative Decoding and Inference are both inference concepts. Inference is what happens when you actually run a trained model on new input. For LLMs that means generating tokens one at a time, with sampling and a KV cache.

Is Speculative Decoding considered advanced?

Speculative Decoding is generally considered advanced-level material in the AI and LLM space.

InferenceInference

Inference is what happens when you actually run a trained model on new input. For LLMs that means generating tokens one at a time, with sampling and a KV cache.

vLLMInfrastructure

vLLM is an open-source high-throughput LLM serving engine. Its PagedAttention KV cache manager is the reason it dramatically outperforms naive serving setups.

KV CacheArchitecture

The KV cache stores the key and value vectors of all earlier tokens during generation so the model does not recompute them at every step. It is the main memory cost of LLM inference.

QuantizationInfrastructure

Quantization reduces model weights from 16- or 32-bit floats to lower-precision types (INT8, INT4) so the model needs less memory and runs faster, usually with minor quality loss.

Side-by-side comparisons

Sources

Speculative Decoding (arXiv)