Temperature $T > 0$ is a scalar parameter that controls the sharpness of a softmax distribution:
$$P_T(x_i) = \frac{e^{z_i / T}}{\sum_j e^{z_j / T}}$$
The same softmax form, but logits are divided by $T$ before exponentiation.
Effects:
- $T = 1$: standard softmax.
- $T \to 0$: distribution concentrates on the argmax token. Equivalent to greedy decoding.
- $T \to \infty$: distribution approaches uniform. Maximum diversity.
- $T < 1$: sharpens the distribution, increasing the most-likely token's probability.
- $T > 1$: smooths the distribution, increasing diversity.
Mathematical effect: temperature scales the entropy of the distribution. Lower $T$ lowers entropy (sharper distribution, more deterministic); higher $T$ raises entropy (smoother distribution, more diverse). In the high-temperature limit $T \to \infty$ the distribution approaches uniform with maximum entropy $\log K$ over $K$ tokens; in the low-temperature limit $T \to 0$ entropy approaches $0$ as all mass collapses onto the argmax.
Interpretation in physics: in statistical mechanics, $P(x) \propto e^{-E(x) / T}$ is the Boltzmann distribution at temperature $T$. Low $T$ concentrates probability on low-energy states; high $T$ spreads it. The same intuition applies here: identify $E(x) = -z(x)$, then "temperature" has its physical meaning.
Typical values for LLM generation:
- $T = 0$ (or near 0, e.g. 0.01): deterministic, for tasks with a single correct answer (math, factual QA, structured output).
- $T = 0.7$ to $1.0$: standard for chat / general-purpose.
- $T = 1.0$ to $1.5$: for creative writing, brainstorming.
- $T > 2.0$: usually unstable, unhelpful.
For knowledge distillation (Hinton, Vinyals, Dean 2015): training a smaller "student" to match a larger "teacher" benefits from $T > 1$ on the teacher's outputs. The smoothed distribution carries information about the relative likelihoods of incorrect classes that helps the student learn. Distillation loss:
$$\mathcal{L}_\mathrm{distill} = T^2 \cdot D_\mathrm{KL}(P_T^\mathrm{teacher} \| P_T^\mathrm{student})$$
The $T^2$ compensates for the fact that gradients of softmax with temperature scale as $1/T^2$.
Calibration via temperature scaling (Guo et al. 2017): after training, learn a single $T$ that minimises NLL on a held-out set. Surprisingly effective at fixing miscalibration in modern deep networks.
In sampling pipelines: typically applied first (rescale logits), then top-$k$ or top-$p$ filtering, then sampling.
Related terms: Softmax, Top-k Sampling, Top-p (Nucleus) Sampling, Expected Calibration Error, Knowledge Distillation
Discussed in:
- Chapter 12: Sequence Models, Sequence Models