2.15 Exercises
Exercises are graded by an estimated time-on-task: ★ a few minutes; ★★ around fifteen minutes; ★★★ an hour or more, possibly involving coding.
E2.1 (★). For $\mathbf{u} = (1, -2, 3)$ and $\mathbf{v} = (-2, 0, 4)$, compute $\mathbf{u} + \mathbf{v}$, $3\mathbf{u} - 2\mathbf{v}$, $\mathbf{u} \cdot \mathbf{v}$, $\|\mathbf{u}\|_1$, $\|\mathbf{u}\|_2$, $\|\mathbf{u}\|_\infty$, and the cosine similarity between $\mathbf{u}$ and $\mathbf{v}$.
E2.2 (★). Show that the $\ell^1$, $\ell^2$, and $\ell^\infty$ norms each satisfy the triangle inequality. (You may use the fact that $|a + b| \le |a| + |b|$ for real numbers.)
E2.3 (★). Write down the standard basis of $\mathbb{R}^4$ and verify that any vector $(v_1, v_2, v_3, v_4)$ has unique coordinates with respect to this basis.
E2.4 (★★). Find the projection of $\mathbf{u} = (2, 3, 5)$ onto $\mathbf{v} = (1, 1, 1)$, and verify that $\mathbf{u}$ minus that projection is orthogonal to $\mathbf{v}$.
E2.5 (★). Compute the matrix product $\mathbf{A}\mathbf{B}$ and (where defined) $\mathbf{B}\mathbf{A}$ for $$\mathbf{A} = \begin{bmatrix} 2 & -1 \\ 0 & 3 \\ 1 & 4 \end{bmatrix}, \qquad \mathbf{B} = \begin{bmatrix} 1 & 2 & 0 \\ -1 & 1 & 5 \end{bmatrix}.$$
E2.6 (★★). Show that the matrices $$\mathbf{A} = \begin{bmatrix} 1 & 1 \\ 0 & 1 \end{bmatrix}, \qquad \mathbf{B} = \begin{bmatrix} 1 & 0 \\ 1 & 1 \end{bmatrix}$$ do not commute, and compute $\mathbf{A}\mathbf{B} - \mathbf{B}\mathbf{A}$.
E2.7 (★★). Prove that $(\mathbf{A}\mathbf{B})^\top = \mathbf{B}^\top \mathbf{A}^\top$ for compatible matrices.
E2.8 (★★). Construct a $3 \times 3$ matrix with rank exactly 2. Find its column space and null space explicitly.
E2.9 (★★). Compute the inverse of $\mathbf{A} = \begin{bmatrix} 4 & 7 \\ 2 & 6 \end{bmatrix}$ by hand. Verify $\mathbf{A}\mathbf{A}^{-1} = \mathbf{I}$.
E2.10 (★★). Show that for any square matrix, $\det(\mathbf{A}\mathbf{B}) = \det(\mathbf{A})\det(\mathbf{B})$ implies $\det(\mathbf{A}^{-1}) = 1/\det(\mathbf{A})$.
E2.11 (★★). Find the eigenvalues and eigenvectors of $\mathbf{A} = \begin{bmatrix} 5 & 2 \\ 2 & 5 \end{bmatrix}$. Verify the spectral theorem: $\mathbf{A} = \mathbf{Q}\boldsymbol{\Lambda}\mathbf{Q}^\top$ with orthogonal $\mathbf{Q}$.
E2.12 (★★). Find the eigenvalues of the rotation matrix $\mathbf{R}_\theta = \begin{bmatrix} \cos\theta & -\sin\theta \\ \sin\theta & \cos\theta \end{bmatrix}$. Discuss the geometric meaning of complex eigenvalues here.
E2.13 (★★). A symmetric matrix has eigenvalues $\{3, 1, -2\}$. What is its trace? Its determinant? Is it positive definite?
E2.14 (★★). Compute by hand the SVD of $\mathbf{A} = \begin{bmatrix} 2 & 0 \\ 0 & -3 \\ 0 & 0 \end{bmatrix}$.
E2.15 (★★). State and prove the Eckart–Young theorem in the Frobenius norm. (Hint: write $\mathbf{A} = \mathbf{U}\boldsymbol{\Sigma}\mathbf{V}^\top$ and use orthogonal invariance of $\|\cdot\|_F$.)
E2.16 (★★★). Implement power iteration in NumPy to find the dominant eigenvalue of a random $100 \times 100$ symmetric matrix. Plot the convergence of the Rayleigh quotient as a function of iteration number, on a log scale.
E2.17 (★★). Show that $\nabla_{\mathbf{x}} (\mathbf{x}^\top \mathbf{A} \mathbf{x}) = (\mathbf{A} + \mathbf{A}^\top)\mathbf{x}$ and specialise to the symmetric case.
E2.18 (★★). Derive the gradient of the cross-entropy loss $L = -\sum_i y_i \log \hat y_i$, where $\hat y_i = \operatorname{softmax}(\mathbf{z})_i$, with respect to the logits $\mathbf{z}$. Show that the answer simplifies to $\hat{\mathbf{y}} - \mathbf{y}$.
E2.19 (★★). Derive the gradient of the regularised least-squares loss $L = \|\mathbf{X}\boldsymbol{\beta} - \mathbf{y}\|_2^2 + \lambda \|\boldsymbol{\beta}\|_2^2$, set it to zero, and solve to get the ridge regression estimator $\hat{\boldsymbol{\beta}} = (\mathbf{X}^\top \mathbf{X} + \lambda \mathbf{I})^{-1}\mathbf{X}^\top \mathbf{y}$.
E2.20 (★★). Express the trace of $\mathbf{A}\mathbf{B}$ as a sum over indices and use it to show $\operatorname{tr}(\mathbf{A}\mathbf{B}) = \operatorname{tr}(\mathbf{B}\mathbf{A})$ even when $\mathbf{A}\mathbf{B}$ and $\mathbf{B}\mathbf{A}$ have different sizes.
E2.21 (★★). Show that for any matrix $\mathbf{A}$, the matrices $\mathbf{A}^\top \mathbf{A}$ and $\mathbf{A}\mathbf{A}^\top$ are positive semi-definite. Show they have the same non-zero eigenvalues.
E2.22 (★★★). Implement PCA from scratch on the Iris dataset. Project the four-dimensional features onto the top two principal components and produce a scatter plot coloured by class. Compare with sklearn.decomposition.PCA. Comment on which classes are linearly separable in the 2D projection.
E2.23 (★★). A weight matrix in a neural network has shape $(\text{out}, \text{in})$. Given a batch of inputs of shape $(\text{batch}, \text{in})$, write the einsum that produces the output of shape $(\text{batch}, \text{out})$.
E2.24 (★★★). In NumPy, implement the multi-head self-attention score computation $S_{bhst} = \sum_d Q_{bhsd} K_{bhtd} / \sqrt d$, then apply softmax over the last axis and contract with values $V_{bhtd}$ to produce $O_{bhsd}$. Verify your shapes for batch 2, 4 heads, sequence length 8, head dimension 16.
E2.25 (★★). Compute the condition number of the $5 \times 5$ Hilbert matrix $H_{ij} = 1/(i + j - 1)$ in NumPy. Generate a known $\boldsymbol{\beta}$, form $\mathbf{y} = \mathbf{H}\boldsymbol{\beta}$, and recover $\boldsymbol{\beta}$ via (a) np.linalg.solve(H, y), (b) np.linalg.solve(H.T @ H, H.T @ y), (c) np.linalg.lstsq(H, y). Compare reconstruction errors.
E2.26 (★★★). Write a NumPy function stable_logsumexp(z) that computes $\log \sum_i e^{z_i}$ without overflowing or underflowing. Test it on inputs containing $z_i = 1000$.
E2.27 (★★). Show that for an orthogonal matrix $\mathbf{Q}$, $\|\mathbf{Q}\mathbf{x}\|_2 = \|\mathbf{x}\|_2$ for every $\mathbf{x}$. What does this mean for the determinant of $\mathbf{Q}$?
E2.28 (★★). Let $\mathbf{A}$ be a $4 \times 4$ matrix with characteristic polynomial $(\lambda - 1)^2(\lambda - 2)(\lambda - 3)$. What is its trace? Determinant? Is it necessarily diagonalisable?
E2.29 (★★★). Implement Gram–Schmidt orthonormalisation. Compare its numerical accuracy on a $20 \times 10$ matrix against the modified Gram–Schmidt variant and against np.linalg.qr. Plot $\|\mathbf{Q}^\top \mathbf{Q} - \mathbf{I}\|_F$ for each method.
E2.30 (★★). A dataset has $1000$ images at $64 \times 64$ pixels (greyscale). Stored raw, this is $4{,}096{,}000$ floats. After PCA with the top $50$ components, how many floats are needed to store the components and the projected data? What is the compression ratio?
E2.31 (★★). Show that for a symmetric matrix $\mathbf{A}$ with eigenvalues $\lambda_1 \ge \lambda_2 \ge \dots \ge \lambda_n$, the maximum value of $\mathbf{x}^\top \mathbf{A}\mathbf{x}$ subject to $\|\mathbf{x}\| = 1$ is $\lambda_1$, attained at the corresponding eigenvector. (This is the Rayleigh quotient maximum.)
E2.32 (★★). Argue from the SVD that the operator norm $\|\mathbf{A}\|_2 := \sup_{\mathbf{x} \ne 0} \|\mathbf{A}\mathbf{x}\|_2 / \|\mathbf{x}\|_2$ equals the largest singular value $\sigma_1$.
Solution sketches
E2.1. $\mathbf{u} + \mathbf{v} = (-1, -2, 7)$. $3\mathbf{u} - 2\mathbf{v} = (3 + 4, -6, 9 - 8) = (7, -6, 1)$. $\mathbf{u}\cdot\mathbf{v} = -2 + 0 + 12 = 10$. $\|\mathbf{u}\|_1 = 6$. $\|\mathbf{u}\|_2 = \sqrt{14}$. $\|\mathbf{u}\|_\infty = 3$. $\|\mathbf{v}\|_2 = \sqrt{20}$. Cosine similarity $= 10 / (\sqrt{14}\sqrt{20}) = 10/\sqrt{280} \approx 0.598$.
E2.4. $\mathbf{u}\cdot \mathbf{v} = 2 + 3 + 5 = 10$, $\mathbf{v}\cdot\mathbf{v} = 3$. Vector projection: $(10/3)(1, 1, 1) = (10/3, 10/3, 10/3)$. Residual: $(2 - 10/3, 3 - 10/3, 5 - 10/3) = (-4/3, -1/3, 5/3)$. Dot with $\mathbf{v}$: $-4/3 - 1/3 + 5/3 = 0$ ✓.
E2.5. $\mathbf{A}$ is $3\times 2$ and $\mathbf{B}$ is $2\times 3$. $\mathbf{A}\mathbf{B}$ is $3\times 3$. Row 1: $(2\cdot 1 - 1\cdot(-1),\ 2\cdot 2 - 1\cdot 1,\ 2\cdot 0 - 1\cdot 5) = (3, 3, -5)$. Row 2: $(0\cdot 1 + 3\cdot(-1),\ 0\cdot 2 + 3\cdot 1,\ 0\cdot 0 + 3\cdot 5) = (-3, 3, 15)$. Row 3: $(1\cdot 1 + 4\cdot(-1),\ 1\cdot 2 + 4\cdot 1,\ 1\cdot 0 + 4\cdot 5) = (-3, 6, 20)$. So $\mathbf{A}\mathbf{B} = \begin{bmatrix} 3 & 3 & -5 \\ -3 & 3 & 15 \\ -3 & 6 & 20 \end{bmatrix}$. $\mathbf{B}\mathbf{A}$ is $2\times 2$: $\begin{bmatrix} 1\cdot 2 + 2\cdot 0 + 0\cdot 1 & 1\cdot(-1) + 2\cdot 3 + 0\cdot 4 \\ -1\cdot 2 + 1\cdot 0 + 5\cdot 1 & -1\cdot(-1) + 1\cdot 3 + 5\cdot 4 \end{bmatrix} = \begin{bmatrix} 2 & 5 \\ 3 & 24 \end{bmatrix}$.
E2.7. $((\mathbf{A}\mathbf{B})^\top)_{ij} = (\mathbf{A}\mathbf{B})_{ji} = \sum_k a_{jk} b_{ki} = \sum_k (\mathbf{A}^\top)_{kj} (\mathbf{B}^\top)_{ik} = \sum_k (\mathbf{B}^\top)_{ik} (\mathbf{A}^\top)_{kj} = (\mathbf{B}^\top \mathbf{A}^\top)_{ij}$.
E2.9. $\det \mathbf{A} = 24 - 14 = 10$. So $\mathbf{A}^{-1} = \frac{1}{10}\begin{bmatrix} 6 & -7 \\ -2 & 4 \end{bmatrix} = \begin{bmatrix} 0.6 & -0.7 \\ -0.2 & 0.4 \end{bmatrix}$. Multiplying back: $\mathbf{A}\mathbf{A}^{-1} = \begin{bmatrix} 4\cdot 0.6 + 7\cdot(-0.2) & 4\cdot(-0.7) + 7\cdot 0.4 \\ 2\cdot 0.6 + 6\cdot(-0.2) & 2\cdot(-0.7) + 6\cdot 0.4 \end{bmatrix} = \begin{bmatrix} 1 & 0 \\ 0 & 1 \end{bmatrix}$ ✓.
E2.11. Characteristic polynomial $(5-\lambda)^2 - 4 = (\lambda - 3)(\lambda - 7)$. Eigenvalues $3$ and $7$. For $\lambda = 7$, $-2v_1 + 2v_2 = 0$ gives $\mathbf{v}_1 = (1, 1)/\sqrt 2$. For $\lambda = 3$, $2v_1 + 2v_2 = 0$ gives $\mathbf{v}_2 = (1, -1)/\sqrt 2$. Stack into $\mathbf{Q} = \frac{1}{\sqrt 2}\begin{bmatrix} 1 & 1 \\ 1 & -1 \end{bmatrix}$, $\boldsymbol{\Lambda} = \operatorname{diag}(7, 3)$. $\mathbf{Q}^\top \mathbf{Q} = \mathbf{I}$. Verify $\mathbf{Q}\boldsymbol{\Lambda}\mathbf{Q}^\top$ recovers $\mathbf{A}$ ✓.
E2.13. Trace $3 + 1 - 2 = 2$. Determinant $3 \cdot 1 \cdot (-2) = -6$. Not positive definite (negative eigenvalue).
E2.17. $f = \sum_{ij} a_{ij} x_i x_j$. $\partial f/\partial x_k = \sum_j a_{kj} x_j + \sum_i a_{ik} x_i = (\mathbf{A}\mathbf{x})_k + (\mathbf{A}^\top \mathbf{x})_k$, so $\nabla f = (\mathbf{A} + \mathbf{A}^\top)\mathbf{x}$. Symmetric: $2\mathbf{A}\mathbf{x}$.
E2.18. $\hat y_i = e^{z_i}/\sum_j e^{z_j}$. $L = -\sum_i y_i \log \hat y_i = -\sum_i y_i z_i + \sum_i y_i \log Z$ where $Z = \sum_j e^{z_j}$. With $\sum_i y_i = 1$, $L = -\sum_i y_i z_i + \log Z$. $\partial L/\partial z_k = -y_k + e^{z_k}/Z = -y_k + \hat y_k = (\hat{\mathbf{y}} - \mathbf{y})_k$.
E2.19. Gradient: $2\mathbf{X}^\top(\mathbf{X}\boldsymbol{\beta} - \mathbf{y}) + 2\lambda \boldsymbol{\beta}$. Set to zero: $(\mathbf{X}^\top \mathbf{X} + \lambda \mathbf{I})\boldsymbol{\beta} = \mathbf{X}^\top \mathbf{y}$. The matrix $\mathbf{X}^\top \mathbf{X} + \lambda \mathbf{I}$ is invertible whenever $\lambda > 0$ (its eigenvalues are at least $\lambda$). So $\hat{\boldsymbol{\beta}} = (\mathbf{X}^\top \mathbf{X} + \lambda \mathbf{I})^{-1}\mathbf{X}^\top \mathbf{y}$.
E2.20. $(\mathbf{A}\mathbf{B})_{ii} = \sum_j a_{ij} b_{ji}$, so $\operatorname{tr}(\mathbf{A}\mathbf{B}) = \sum_i \sum_j a_{ij} b_{ji} = \sum_j \sum_i b_{ji} a_{ij} = \sum_j (\mathbf{B}\mathbf{A})_{jj} = \operatorname{tr}(\mathbf{B}\mathbf{A})$. Note both products are square (the first is $m\times m$, the second $n\times n$) and the trace makes sense for each.
E2.21. $\mathbf{x}^\top \mathbf{A}^\top \mathbf{A} \mathbf{x} = \|\mathbf{A}\mathbf{x}\|_2^2 \ge 0$ and equals zero iff $\mathbf{A}\mathbf{x} = \mathbf{0}$, hence positive semi-definite. The non-zero eigenvalues match because $\mathbf{A}^\top \mathbf{A}$ and $\mathbf{A}\mathbf{A}^\top$ share the SVD: their eigenvalues are the squared singular values, padded with zeros in different ways.
E2.27. $\|\mathbf{Q}\mathbf{x}\|_2^2 = \mathbf{x}^\top \mathbf{Q}^\top \mathbf{Q}\mathbf{x} = \mathbf{x}^\top \mathbf{x} = \|\mathbf{x}\|_2^2$. Determinant: $\det(\mathbf{Q}^\top \mathbf{Q}) = (\det \mathbf{Q})^2 = \det \mathbf{I} = 1$, so $\det \mathbf{Q} = \pm 1$.
E2.30. Top-50 components: store $\mathbf{V}_{50} \in \mathbb{R}^{4096 \times 50}$ and projections $\mathbf{X}_{50} \in \mathbb{R}^{1000 \times 50}$, totalling $4096 \cdot 50 + 1000 \cdot 50 = 254{,}800$ floats versus the original $4{,}096{,}000$. Compression ratio $\approx 6.2\%$, i.e. roughly 16× reduction.
E2.31. Diagonalise $\mathbf{A} = \mathbf{Q}\boldsymbol{\Lambda}\mathbf{Q}^\top$. Substitute $\mathbf{y} = \mathbf{Q}^\top \mathbf{x}$; the constraint $\|\mathbf{x}\| = 1$ becomes $\|\mathbf{y}\| = 1$ since $\mathbf{Q}$ is orthogonal. Then $\mathbf{x}^\top \mathbf{A}\mathbf{x} = \sum_i \lambda_i y_i^2 \le \lambda_1 \sum_i y_i^2 = \lambda_1$, with equality when $\mathbf{y} = \mathbf{e}_1$, i.e. $\mathbf{x} = \mathbf{q}_1$, the eigenvector for $\lambda_1$.
E2.32. $\|\mathbf{A}\mathbf{x}\|_2^2 = \mathbf{x}^\top \mathbf{A}^\top \mathbf{A}\mathbf{x}$. By E2.31, the maximum of this on $\|\mathbf{x}\|=1$ is the largest eigenvalue of $\mathbf{A}^\top \mathbf{A}$, which is $\sigma_1^2$. So $\|\mathbf{A}\|_2 = \sigma_1$.