2.5 Determinants and trace
A square matrix is a small table of numbers, but the linear map it represents is a much richer object. To get a feel for what a matrix is doing without staring at every individual entry, it helps to summarise it. Two scalar summaries appear throughout deep learning. The first is the determinant, written $\det(\mathbf{A})$ or $|\mathbf{A}|$. It tells you, in a single number, how much the matrix stretches or shrinks volumes, whether it flips orientation, and, most usefully, whether the matrix is invertible. The second is the trace, written $\text{tr}(\mathbf{A})$, the sum of the diagonal entries; it equals the sum of the eigenvalues, which makes it structural rather than casual.
The two are invariants, numbers the matrix carries regardless of how you choose coordinates. Determinants reappear in normalising flows, in the multivariate Gaussian density, and in change-of-variable arguments. Trace reappears in the Frobenius norm, in regularisers, in KL divergences between Gaussians, and in the matrix calculus identities of §2.9.
Determinant: definition
For the smallest non-trivial case, a $2 \times 2$ matrix, the determinant has a particularly simple formula:
$$\det\begin{pmatrix} a & b \\ c & d \end{pmatrix} = ad - bc.$$
You multiply the entries on the leading diagonal, multiply the entries on the off-diagonal, and subtract. That is all. As a numerical example, take
$$\mathbf{A} = \begin{pmatrix} 1 & 2 \\ 3 & 4 \end{pmatrix}, \qquad \det(\mathbf{A}) = 1 \cdot 4 - 2 \cdot 3 = 4 - 6 = -2.$$
The negative sign is genuinely meaningful, not a bookkeeping accident, it tells you that the map represented by $\mathbf{A}$ flips orientation. We will return to the geometric meaning in the next subsection.
For a $3 \times 3$ matrix the formula is longer but follows the same spirit. Using cofactor expansion along the first row,
$$\det\begin{pmatrix} a & b & c \\ d & e & f \\ g & h & i \end{pmatrix} = a(ei - fh) - b(di - fg) + c(dh - eg).$$
Each term is the entry from the top row, multiplied by the determinant of the $2 \times 2$ submatrix you obtain by crossing out that entry's row and column, with alternating plus and minus signs. The pattern continues recursively: a $4 \times 4$ determinant is a sum of four $3 \times 3$ determinants, each of which is itself a sum of three $2 \times 2$ determinants.
The fully general definition is a sum over all $n!$ permutations of the indices $\{1, 2, \dots, n\}$, with each term carrying the sign of the permutation. That definition is mathematically clean but computationally hopeless: $10! = 3{,}628{,}800$ already, and $20!$ exceeds the number of atoms in a small bacterium. Real software never computes determinants this way. Instead, libraries factor the matrix using LU decomposition, $\mathbf{A} = \mathbf{P}\mathbf{L}\mathbf{U}$, where $\mathbf{L}$ is lower triangular with ones on the diagonal, $\mathbf{U}$ is upper triangular, and $\mathbf{P}$ is a permutation matrix. The determinant of a triangular matrix is just the product of its diagonal entries, so $\det(\mathbf{A}) = \pm\, u_{11} u_{22} \cdots u_{nn}$, where the sign comes from the parity of $\mathbf{P}$. The cost is $O(n^3)$, the same as solving a linear system, fast enough to be practical for matrices in the thousands.
A practical aside is worth flagging here. Numerical determinants of large matrices overflow or underflow easily, because multiplying $n$ numbers each of size around $10$ produces a result of size around $10^n$. For this reason serious code never computes $\det(\mathbf{A})$ as a single floating-point number when $n$ is large; it computes $\log|\det(\mathbf{A})|$ instead, by summing the logarithms of the diagonal entries of $\mathbf{U}$. The sign is tracked separately. Whenever you see a model report a log-determinant, as you will throughout the normalising-flow literature, this is why.
Geometric interpretation: signed volume
The determinant has a clean geometric meaning that makes the algebraic formula feel inevitable rather than arbitrary. Take the unit cube in $\mathbb{R}^n$, the cube spanned by the standard basis vectors $\mathbf{e}_1, \mathbf{e}_2, \dots, \mathbf{e}_n$. Apply the linear map $\mathbf{A}$. Each basis vector $\mathbf{e}_i$ becomes the $i$-th column of $\mathbf{A}$, and the unit cube becomes the parallelepiped spanned by those columns. The volume of that parallelepiped is exactly $|\det(\mathbf{A})|$.
A few worked cases pin down the intuition. The identity matrix $\mathbf{I}$ leaves the unit cube alone, so its determinant should be $1$, and indeed $\det(\mathbf{I}) = 1$. The diagonal matrix
$$\text{diag}(2, 3) = \begin{pmatrix} 2 & 0 \\ 0 & 3 \end{pmatrix}$$
stretches the unit square by a factor of two horizontally and three vertically, producing a $2 \times 3$ rectangle whose area is $6$. The determinant agrees: $2 \cdot 3 - 0 \cdot 0 = 6$. The matrix
$$\begin{pmatrix} 1 & 1 \\ 1 & 1 \end{pmatrix}$$
has both columns pointing in the same direction, so the parallelogram it spans is degenerate, it has collapsed to a line and has zero area. The determinant agrees once more: $1 \cdot 1 - 1 \cdot 1 = 0$.
The sign of the determinant carries extra information that the absolute value alone cannot. A positive determinant means the map preserves orientation; a negative determinant means it flips orientation, like a mirror reflection. A rotation in the plane has determinant $+1$. A reflection has determinant $-1$. A determinant of zero means the map has crushed at least one dimension flat, the matrix is singular and has no inverse. This last point is the workhorse: in numerical contexts you almost never want to invert a matrix whose determinant is exactly zero, and you should be suspicious of one whose determinant is very small.
This volume-scaling story shows up explicitly in modern AI through the change-of-variables formula for probability densities. If $\mathbf{Y} = \mathbf{f}(\mathbf{X})$ is an invertible transformation with Jacobian $\mathbf{J}$, the densities are related by $p_Y(\mathbf{y}) = p_X(\mathbf{x}) / |\det(\mathbf{J})|$. Normalising flows, an entire family of generative models, are designed precisely so that this Jacobian determinant is cheap to compute, allowing the log-likelihood to be evaluated exactly during training.
It also pays to think about what a small but non-zero determinant means. A matrix with $\det(\mathbf{A}) = 10^{-12}$ is technically invertible, but only just; the parallelepiped it produces has almost no volume, and floating-point arithmetic will struggle to invert it reliably. Numerical analysts call such matrices ill-conditioned, and the determinant alone is not the right diagnostic, the condition number, introduced in §2.7 alongside the SVD, is. But the determinant gives a useful first warning, especially when it suddenly drops several orders of magnitude during training.
Properties
Determinants obey a small set of identities that you will reach for again and again when manipulating expressions in matrix calculus. They are worth memorising rather than rederiving.
- $\det(\mathbf{I}_n) = 1$. The identity preserves volume.
- $\det(\mathbf{A}\mathbf{B}) = \det(\mathbf{A})\det(\mathbf{B})$. The determinant is multiplicative: composing two maps multiplies their volume-scaling factors.
- $\det(\mathbf{A}^\top) = \det(\mathbf{A})$. Transposition does not change the determinant.
- $\det(\alpha\mathbf{A}) = \alpha^n \det(\mathbf{A})$. Scaling every entry by $\alpha$ scales every column by $\alpha$, and there are $n$ columns, so the volume picks up $\alpha^n$.
- $\det(\mathbf{A}^{-1}) = 1/\det(\mathbf{A})$. The inverse map undoes the volume scaling, so its determinant is the reciprocal. (This identity also tells you immediately why $\mathbf{A}$ has no inverse when $\det(\mathbf{A}) = 0$: division by zero.)
- $\det(\mathbf{A}) = 0 \iff \mathbf{A}$ is singular $\iff$ $\mathbf{A}$ is rank deficient $\iff$ at least one eigenvalue of $\mathbf{A}$ is zero. These four conditions are different ways of saying the same thing.
- $\det(\mathbf{A}) = \prod_{i=1}^n \lambda_i$. The determinant equals the product of the eigenvalues, counted with multiplicity. This is one of the most useful identities in matrix analysis. It tells you that the volume-scaling factor is fully determined by the spectrum.
- $\det(\mathbf{P}^{-1}\mathbf{A}\mathbf{P}) = \det(\mathbf{A})$. The determinant is a similarity invariant, it does not depend on the basis you choose, only on the underlying linear map.
The multiplicative property is what makes the determinant compatible with composition, and it explains why eigenvalues, which arise from a single matrix, combine through products to give the determinant. The contrast with the trace, which combines eigenvalues through addition, is one of the central themes of spectral theory.
It is also worth checking the identities against the running example. With $\mathbf{A} = \begin{pmatrix} 1 & 2 \\ 3 & 4 \end{pmatrix}$ we computed $\det(\mathbf{A}) = -2$. The transpose $\mathbf{A}^\top = \begin{pmatrix} 1 & 3 \\ 2 & 4 \end{pmatrix}$ has determinant $1 \cdot 4 - 3 \cdot 2 = -2$, confirming $\det(\mathbf{A}^\top) = \det(\mathbf{A})$. Scaling by $\alpha = 2$ gives $2\mathbf{A} = \begin{pmatrix} 2 & 4 \\ 6 & 8 \end{pmatrix}$, whose determinant is $2 \cdot 8 - 4 \cdot 6 = -8 = 2^2 \cdot (-2)$, exactly $\alpha^n \det(\mathbf{A})$ with $n = 2$. Tiny checks like this catch sign errors before they propagate into larger derivations.
Trace: definition
The trace is the simplest scalar you can extract from a square matrix:
$$\text{tr}(\mathbf{A}) = \sum_{i=1}^n A_{ii}.$$
Just add up the diagonal entries. For our running $2 \times 2$ example,
$$\text{tr}\begin{pmatrix} 1 & 2 \\ 3 & 4 \end{pmatrix} = 1 + 4 = 5.$$
There is no special formula for $3 \times 3$ or $n \times n$, the rule is the same at every size. Despite its simplicity, the trace has three properties that make it appear constantly in derivations.
The first is linearity: $\text{tr}(\alpha\mathbf{A} + \beta\mathbf{B}) = \alpha\,\text{tr}(\mathbf{A}) + \beta\,\text{tr}(\mathbf{B})$. This follows immediately from the definition, since the diagonal of a sum is the sum of the diagonals.
The second is the cyclic property:
$$\text{tr}(\mathbf{A}\mathbf{B}) = \text{tr}(\mathbf{B}\mathbf{A}),$$
whenever both products are defined, which happens even when $\mathbf{A}$ and $\mathbf{B}$ are not square individually, provided $\mathbf{A}$ is $m \times n$ and $\mathbf{B}$ is $n \times m$. This cyclic invariance generalises to longer products:
$$\text{tr}(\mathbf{A}\mathbf{B}\mathbf{C}) = \text{tr}(\mathbf{B}\mathbf{C}\mathbf{A}) = \text{tr}(\mathbf{C}\mathbf{A}\mathbf{B}).$$
You may rotate the factors freely under a trace, but you may not swap them arbitrarily, only cyclic shifts are allowed. This single identity is what lets you simplify gradients in matrix calculus, rearrange expressions involving covariances, and prove identities about Gaussian distributions without ever writing out indices.
The third property is the spectral one:
$$\text{tr}(\mathbf{A}) = \sum_{i=1}^n \lambda_i.$$
The trace equals the sum of the eigenvalues. Together with $\det(\mathbf{A}) = \prod_i \lambda_i$, this places the trace and the determinant as the two endpoints of the characteristic polynomial. Every coefficient of that polynomial is a similarity invariant of the matrix, and trace and determinant are simply the simplest two: the coefficient of $\lambda^{n-1}$ and the constant term.
Where determinants appear in AI
Determinants are not abstract trivia. They surface in several distinct corners of modern machine learning.
Normalising flows. The log-likelihood of a flow-based generative model is written
$$\log p_X(\mathbf{x}) = \log p_Z(\mathbf{f}(\mathbf{x})) + \log\left|\det \mathbf{J}_{\mathbf{f}}(\mathbf{x})\right|,$$
where $\mathbf{f}$ is the invertible flow and $\mathbf{J}_{\mathbf{f}}$ is its Jacobian. The whole architectural challenge of flows is choosing layers whose Jacobian determinants are cheap to compute. Affine coupling layers in RealNVP and Glow give triangular Jacobians, so their determinants are products of the diagonal entries.
Multivariate Gaussian densities. The density of a multivariate Gaussian is
$$p(\mathbf{x}) = \frac{1}{\sqrt{(2\pi)^d \det(\boldsymbol{\Sigma})}} \exp\left(-\tfrac{1}{2}(\mathbf{x}-\boldsymbol{\mu})^\top \boldsymbol{\Sigma}^{-1} (\mathbf{x}-\boldsymbol{\mu})\right).$$
The $\det(\boldsymbol{\Sigma})$ in the normalising constant ensures the density integrates to one regardless of the shape of the covariance ellipsoid. In Bayesian inference, the determinants of posterior covariances control how confident the model is.
Geometric image transforms. Affine transformations used for data augmentation, rotations, scalings, shears, reflections, are matrices whose determinants tell you whether the transformation preserves area and orientation. A determinant of $\pm 1$ means the augmentation is volume-preserving.
Where trace appears in AI
The trace is everywhere in matrix calculus and loss-function design, often hidden inside identities you have to learn to recognise.
Frobenius norm. The squared Frobenius norm of a matrix is the sum of the squares of its entries, and it has a clean trace expression:
$$\|\mathbf{A}\|_F^2 = \sum_{i,j} A_{ij}^2 = \text{tr}(\mathbf{A}^\top \mathbf{A}).$$
This is the standard $L^2$-style regulariser on weight matrices in deep networks: weight decay corresponds to adding $\lambda \|\mathbf{W}\|_F^2$ to the loss.
Effective rank. When $\boldsymbol{\Sigma}$ is a covariance matrix, the ratio $\text{tr}(\boldsymbol{\Sigma})^2 / \|\boldsymbol{\Sigma}\|_F^2$ gives a continuous measure of how many directions actually carry signal, a softer alternative to the discrete rank. This shows up in analyses of representation learning and in pruning.
KL divergence between Gaussians. The closed-form KL divergence between two multivariate Gaussians involves a trace term, $\text{tr}(\boldsymbol{\Sigma}_1^{-1}\boldsymbol{\Sigma}_2)$, alongside a determinant ratio and a quadratic form in the means. Variational autoencoders evaluate this expression at every step of training.
Spectral normalisation. Spectral normalisation, used in GANs and elsewhere to stabilise training, controls the largest singular value of weight matrices. The trace of $\mathbf{W}^\top \mathbf{W}$ is the sum of squared singular values and provides a useful proxy when the largest one is hard to compute exactly.
Matrix-calculus shortcuts. Most identities in matrix calculus rely on the trace. The gradient of $\text{tr}(\mathbf{A}\mathbf{X})$ with respect to $\mathbf{X}$ is $\mathbf{A}^\top$; the gradient of $\text{tr}(\mathbf{X}^\top \mathbf{A} \mathbf{X})$ with respect to $\mathbf{X}$ is $(\mathbf{A} + \mathbf{A}^\top)\mathbf{X}$. These would be intractable as index-laden sums but become routine once you recognise the trace and apply the cyclic property. We will use these in §2.9 to derive backpropagation cleanly without indexing every entry.
What you should take away
- The determinant is the signed volume-scaling factor of the linear map, and it equals zero precisely when the matrix is non-invertible.
- The determinant is multiplicative, equals the product of the eigenvalues, and is computed in practice via $O(n^3)$ LU decomposition, never the $n!$-term permutation sum.
- The trace is the sum of the diagonal entries, equals the sum of the eigenvalues, and is linear with the cyclic property $\text{tr}(\mathbf{A}\mathbf{B}) = \text{tr}(\mathbf{B}\mathbf{A})$.
- Determinants drive normalising flows, the multivariate Gaussian normalising constant, and Bayesian posteriors. Trace drives the Frobenius norm, KL divergences, weight-decay regularisation, and most matrix-calculus simplifications.
- Both quantities are similarity invariants, properties of the underlying linear map rather than of the chosen basis, which is why §2.6 will recognise them as the simplest symmetric polynomials in the eigenvalues, and why they reappear so insistently throughout matrix calculus and probabilistic modelling.