Skip to main content
ModelTerms

Prompting · intermediate

Self-Consistency

Self-consistency samples N chain-of-thought completions for the same problem and takes the majority answer. Improves accuracy on math and reasoning tasks at N× the cost.

Explanation

Chain-of-thought produces variable reasoning paths under sampling. Self-consistency exploits this by sampling many paths and voting on the final answer — wrong paths tend to disagree with each other, while the right path is reached by more sample trajectories.

The result: substantial accuracy gains on math and reasoning benchmarks (e.g. ~10 points on GSM8K) at the cost of N× compute. N is typically 5-40.

In 2024-2026, frontier reasoning models (o1, R1) effectively bake self-consistency into their training — they explore many paths internally before committing. Explicit self-consistency at the application level is now most relevant for non-reasoning models.

Examples

  • A GSM8K eval: sample 32 CoT completions per problem, take the majority numeric answer.
  • A SQL-generation app: sample 5 candidate queries, run each against the schema, pick the one that returns the expected row count.

When to use self-consistency

When the task has a verifiable answer (math, logic, code that compiles) and N× compute is acceptable.

Frequently asked

What is Self-Consistency?

Self-consistency samples N chain-of-thought completions for the same problem and takes the majority answer. Improves accuracy on math and reasoning tasks at N× the cost.

What is an example of self-consistency?

A GSM8K eval: sample 32 CoT completions per problem, take the majority numeric answer.

How is Self-Consistency related to Chain-of-Thought?

Self-Consistency and Chain-of-Thought are both prompting concepts. Chain-of-thought prompting asks the model to show its reasoning step by step before giving a final answer. It dramatically improves performance on multi-step problems.

When should I use self-consistency?

When the task has a verifiable answer (math, logic, code that compiles) and N× compute is acceptable.

Is Self-Consistency considered intermediate?

Self-Consistency is generally considered intermediate-level material in the AI and LLM space.

Chain-of-ThoughtPrompting

Chain-of-thought prompting asks the model to show its reasoning step by step before giving a final answer. It dramatically improves performance on multi-step problems.

Test-Time ComputePrompting

Test-time compute is the extra reasoning, sampling, or search a model can do at inference time to improve quality — more thinking tokens, more candidate answers, or verifier-guided search.

SamplingInference

Sampling is the act of choosing the next token from the model's output distribution, typically after applying temperature and a truncation strategy like top-p or top-k.

Reasoning ModelArchitecture

A reasoning model spends extra compute thinking step-by-step before answering. OpenAI o1/o3, DeepSeek R1, and Anthropic's extended thinking are reasoning models.

TemperatureInference

Temperature is a generation parameter that controls randomness. 0 is deterministic (always pick the most likely token); higher values produce more diverse, surprising output.

Side-by-side comparisons

Sources