Temperature
A scalar that divides logits before softmax at sampling time — lower temperature makes the model more deterministic, higher makes it more random.
In one line
Divide the next-token logits by T before sampling — T close to 0 picks the top token, T large flattens the distribution.
What it actually means
At sampling time, the model produces a logit per vocabulary token. Dividing all logits by T and then applying softmax rescales the distribution: T < 1 makes the peak sharper (more deterministic), T > 1 makes it flatter (more diverse). T = 0 is equivalent to greedy argmax. Temperature is usually combined with top-k or top-p (nucleus) truncation so you don’t sample from the long tail. Typical ranges: 0 for exact structured output, 0.2–0.5 for factual Q&A, 0.7–1.0 for creative writing, above 1.0 for exploration.
Why it matters
Temperature is the single knob you turn first when tuning an LLM’s behavior. Hallucinating too much? Lower it. Output feels robotic or repetitive? Raise it. Getting inconsistent JSON? Drop it to 0. It costs nothing to try, and the effect is immediate.
Example
response = client.chat.completions.create(
model="gpt-4o",
messages=messages,
temperature=0.2, # factual, consistent
top_p=0.9,
)
You’ll hear it when
- Tuning any LLM prompt in production.
- Debugging inconsistent or over-creative outputs.
- Reading sampling documentation for a model API.
- Comparing deterministic evals at temperature 0.