Temperature

LLMs

A scalar that divides logits before softmax at sampling time — lower temperature makes the model more deterministic, higher makes it more random.


In one line

Divide the next-token logits by T before sampling — T close to 0 picks the top token, T large flattens the distribution.

What it actually means

At sampling time, the model produces a logit per vocabulary token. Dividing all logits by T and then applying softmax rescales the distribution: T < 1 makes the peak sharper (more deterministic), T > 1 makes it flatter (more diverse). T = 0 is equivalent to greedy argmax. Temperature is usually combined with top-k or top-p (nucleus) truncation so you don’t sample from the long tail. Typical ranges: 0 for exact structured output, 0.2–0.5 for factual Q&A, 0.7–1.0 for creative writing, above 1.0 for exploration.

Why it matters

Temperature is the single knob you turn first when tuning an LLM’s behavior. Hallucinating too much? Lower it. Output feels robotic or repetitive? Raise it. Getting inconsistent JSON? Drop it to 0. It costs nothing to try, and the effect is immediate.

Example

response = client.chat.completions.create(
    model="gpt-4o",
    messages=messages,
    temperature=0.2,   # factual, consistent
    top_p=0.9,
)

You’ll hear it when

  • Tuning any LLM prompt in production.
  • Debugging inconsistent or over-creative outputs.
  • Reading sampling documentation for a model API.
  • Comparing deterministic evals at temperature 0.

Related terms