Inference · beginner

Temperature

Temperature is a generation parameter that controls randomness. 0 is deterministic (always pick the most likely token); higher values produce more diverse, surprising output.

Published May 29, 2026

Explanation

Mathematically, temperature divides the logits before softmax. temperature=1 is "use the model's native distribution"; less than 1 sharpens the distribution (more conservative); greater than 1 flattens it (more creative).

In practice: 0.0-0.3 for factual or code generation, 0.7 for general chat, 1.0+ for brainstorming. Most APIs cap around 2.0 because higher values produce gibberish.

Temperature is almost always paired with top-p or top-k sampling to truncate the distribution's long tail before sampling.

Examples

Temperature 0: same prompt, same response, every time.
Temperature 1.0: a chat model gives varied but coherent responses.

When to use temperature

Low for code/extraction; medium for chat; high for creative writing.

Frequently asked

What is Temperature?

Temperature is a generation parameter that controls randomness. 0 is deterministic (always pick the most likely token); higher values produce more diverse, surprising output.

What is an example of temperature?

Temperature 0: same prompt, same response, every time.

How is Temperature related to Top-p?

Temperature and Top-p are both inference concepts. Top-p (nucleus sampling) restricts token selection to the smallest set of tokens whose cumulative probability reaches p. Common values are 0.9-0.95.

When should I use temperature?

Low for code/extraction; medium for chat; high for creative writing.

Is Temperature considered beginner?

Temperature is generally considered beginner-level material in the AI and LLM space.

Top-pInference

Top-p (nucleus sampling) restricts token selection to the smallest set of tokens whose cumulative probability reaches p. Common values are 0.9-0.95.

Top-kInference

Top-k restricts token sampling to the k highest-probability tokens, then samples from that set. A simpler alternative to top-p.

SamplingInference

Sampling is the act of choosing the next token from the model's output distribution, typically after applying temperature and a truncation strategy like top-p or top-k.

InferenceInference

Inference is what happens when you actually run a trained model on new input. For LLMs that means generating tokens one at a time, with sampling and a KV cache.

Side-by-side comparisons

Sources

OpenAI API — Chat completions