Temperature (sampling), Glossary, Textbook of AI

Temperature $T > 0$ is a scalar parameter that controls the sharpness of a softmax distribution:

$$P_T(x_i) = \frac{e^{z_i / T}}{\sum_j e^{z_j / T}}$$

The same softmax form, but logits are divided by $T$ before exponentiation.

Effects:

$T = 1$: standard softmax.
$T \to 0$: distribution concentrates on the argmax token. Equivalent to greedy decoding.
$T \to \infty$: distribution approaches uniform. Maximum diversity.
$T < 1$: sharpens the distribution, increasing the most-likely token's probability.
$T > 1$: smooths the distribution, increasing diversity.

Mathematical effect: temperature scales the entropy of the distribution. Lower $T$ lowers entropy (sharper distribution, more deterministic); higher $T$ raises entropy (smoother distribution, more diverse). In the high-temperature limit $T \to \infty$ the distribution approaches uniform with maximum entropy $\log K$ over $K$ tokens; in the low-temperature limit $T \to 0$ entropy approaches $0$ as all mass collapses onto the argmax.

Interpretation in physics: in statistical mechanics, $P(x) \propto e^{-E(x) / T}$ is the Boltzmann distribution at temperature $T$. Low $T$ concentrates probability on low-energy states; high $T$ spreads it. The same intuition applies here: identify $E(x) = -z(x)$, then "temperature" has its physical meaning.

Typical values for LLM generation:

$T = 0$ (or near 0, e.g. 0.01): deterministic, for tasks with a single correct answer (math, factual QA, structured output).
$T = 0.7$ to $1.0$: standard for chat / general-purpose.
$T = 1.0$ to $1.5$: for creative writing, brainstorming.
$T > 2.0$: usually unstable, unhelpful.

For knowledge distillation (Hinton, Vinyals, Dean 2015): training a smaller "student" to match a larger "teacher" benefits from $T > 1$ on the teacher's outputs. The smoothed distribution carries information about the relative likelihoods of incorrect classes that helps the student learn. Distillation loss:

$$\mathcal{L}_\mathrm{distill} = T^2 \cdot D_\mathrm{KL}(P_T^\mathrm{teacher} \| P_T^\mathrm{student})$$

The $T^2$ compensates for the fact that gradients of softmax with temperature scale as $1/T^2$.

Calibration via temperature scaling (Guo et al. 2017): after training, learn a single $T$ that minimises NLL on a held-out set. Surprisingly effective at fixing miscalibration in modern deep networks.

In sampling pipelines: typically applied first (rescale logits), then top-$k$ or top-$p$ filtering, then sampling.

Discussed in:

Chapter 12: Sequence Models, Sequence Models

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).