Prompting · intermediate

Test-Time Compute (inference-time compute)

Test-time compute is the extra reasoning, sampling, or search a model can do at inference time to improve quality — more thinking tokens, more candidate answers, or verifier-guided search.

Published May 30, 2026

Explanation

Pre-2024 LLMs spent one forward pass per token and called it a day. Reasoning models (o1, R1) spend many tokens "thinking" before the user-visible answer. Beyond pure CoT, test-time compute can also mean: best-of-N sampling, tree search over candidate steps, verifier-guided reranking, and self-consistency voting.

The empirical finding driving this: for a fixed model, doubling test-time compute often beats doubling training compute on hard problems — the new third scaling axis alongside parameters and data.

Practically, more test-time compute means longer latency and higher per-call cost, so it is reserved for harder tasks.

Examples

o1 thinking for 30 seconds before answering a math olympiad problem.
Best-of-32 sampling on a coding task, then choosing the highest-scored answer.

When to use test-time compute

Whenever quality matters more than latency — math, code, research, structured planning.

Frequently asked

What is Test-Time Compute?

Test-time compute is the extra reasoning, sampling, or search a model can do at inference time to improve quality — more thinking tokens, more candidate answers, or verifier-guided search.

What is an example of test-time compute?

o1 thinking for 30 seconds before answering a math olympiad problem.

How is Test-Time Compute related to Reasoning Model?

Test-Time Compute and Reasoning Model are both prompting concepts. A reasoning model spends extra compute thinking step-by-step before answering. OpenAI o1/o3, DeepSeek R1, and Anthropic's extended thinking are reasoning models.

When should I use test-time compute?

Whenever quality matters more than latency — math, code, research, structured planning.

Is Test-Time Compute considered intermediate?

Test-Time Compute is generally considered intermediate-level material in the AI and LLM space.

Reasoning ModelArchitecture

A reasoning model spends extra compute thinking step-by-step before answering. OpenAI o1/o3, DeepSeek R1, and Anthropic's extended thinking are reasoning models.

Chain-of-ThoughtPrompting

Chain-of-thought prompting asks the model to show its reasoning step by step before giving a final answer. It dramatically improves performance on multi-step problems.

InferenceInference

Inference is what happens when you actually run a trained model on new input. For LLMs that means generating tokens one at a time, with sampling and a KV cache.

SamplingInference

Sampling is the act of choosing the next token from the model's output distribution, typically after applying temperature and a truncation strategy like top-p or top-k.

Scaling LawsTraining

Scaling laws are the empirical power-law relationship between model size, training data, training compute, and resulting loss. They predict that bigger, more data-fed models keep improving in a smooth, forecastable way.

Side-by-side comparisons

Sources

Scaling LLM Test-Time Compute Optimally (arXiv)