Statistics: Exercises

Dr Chris Paton

Exercises

Mean vs median sensitivity. A dataset has values $\{2, 4, 4, 5, 6, 6, 7\}$. Compute the mean and median. Now replace 7 with 700 and recompute. Quantify the change in each.
Bessel's correction. Show by direct calculation that for a sample of size $n$ from any distribution with finite variance $\sigma^2$, $\mathbb{E}[\frac{1}{n}\sum(X_i - \bar X)^2] = \frac{n-1}{n}\sigma^2$. Hint: expand the square and use $\operatorname{Var}(\bar X) = \sigma^2/n$.
Skew detection. Generate 10000 samples from each of: $\mathcal{N}(0,1)$, $\operatorname{Exp}(1)$, and $\operatorname{Lognormal}(0,1)$. Compute the sample skewness for each and explain which sign corresponds to a long right tail.
CLT at work. Roll 30 fair six-sided dice and record the average. Repeat 10000 times. Plot the histogram of averages. What distribution does it approach, with what mean and standard deviation? Confirm numerically.
Bias–variance MSE. Two estimators of $\theta = 5$: $\hat\theta_1$ has bias 0 and variance 9; $\hat\theta_2$ has bias 1 and variance 4. Which has lower MSE? By how much?
Sufficient statistic. Show that for $X_1, \ldots, X_n \sim \operatorname{Poisson}(\lambda)$, the sum $T = \sum X_i$ is sufficient for $\lambda$. Use the factorisation theorem.
MLE for exponential. Derive the MLE of $\lambda$ for an i.i.d. sample from $\operatorname{Exp}(\lambda)$ with density $f(x) = \lambda e^{-\lambda x}$, $x > 0$.
MLE for uniform. $X_1, \ldots, X_n \sim \operatorname{Uniform}(0, \theta)$. Derive the MLE. Why is it biased? Compute its bias.
Fisher information for Bernoulli. Verify $I(\theta) = 1/[\theta(1-\theta)]$. Where is information maximised? Minimised? Interpret.
Beta–Bernoulli posterior. Starting with $\operatorname{Beta}(2, 2)$ and observing 8 successes in 10 trials, find the posterior distribution, posterior mean, posterior mode, and 95% credible interval.
MAP from Gaussian prior. For a single observation $x$ of $X \sim \mathcal{N}(\theta, 1)$ with prior $\theta \sim \mathcal{N}(0, \tau^2)$, derive the posterior distribution and the MAP estimate. Show how the MAP relates to ridge regression.
Confidence interval coverage. Simulate 1000 datasets from $\mathcal{N}(0, 1)$ with $n = 20$. For each, construct a 95% $t$-CI for the mean. What fraction contain 0? Should be $\approx 95\%$.
Power calculation. For a one-sample $z$-test of $H_0: \mu = 0$ at $\alpha = 0.05$ with $\sigma = 1$ and $n = 100$, what is the power against $\mu = 0.2$? Against $\mu = 0.5$? Against $\mu = 0.1$?
P-value misinterpretations. State three common misinterpretations of "the p-value is 0.03" and explain why each is wrong.
Bonferroni vs BH. With 20 tests producing p-values $\{0.001, 0.005, 0.01, 0.02, 0.03, 0.04, 0.045, 0.05, 0.06, 0.08, 0.1, 0.12, 0.15, 0.2, 0.3, 0.4, 0.5, 0.6, 0.8, 0.9\}$ at $\alpha = 0.05$, which are rejected by Bonferroni? By Benjamini–Hochberg?
Bootstrap for correlation. Generate $n = 50$ from a bivariate Gaussian with $\rho = 0.6$. Compute the sample correlation, and bootstrap a 95% CI. Compare to the Fisher $z$-transform CI.
OLS by hand. Given $X = [(1, 1), (1, 2), (1, 3), (1, 4)]$ (intercept and one feature) and $Y = (2, 5, 4, 8)$, compute $\hat\beta_{\text{OLS}}$ by hand. Verify with numpy.linalg.lstsq.
Logistic regression intercept-only. For data $y_1, \ldots, y_n \in \{0, 1\}$ with no features, show that the intercept-only logistic-regression MLE recovers $\hat p = \bar y$.
GLM canonical link. State the canonical link for the Gaussian, Bernoulli, Binomial, Poisson, and Gamma distributions and explain what each is good for.
James–Stein on simulated data. Simulate $k = 10$ values $\mu_i \sim \mathcal{N}(0, 5)$, then $X_i \sim \mathcal{N}(\mu_i, 1)$. Compute the MLE risk $\sum(X_i - \mu_i)^2$ and the James–Stein risk $\sum(\hat\mu_i^{\text{JS}} - \mu_i)^2$. Repeat 1000 times, average, and compare.
Cross-validation variance. Cross-validation gives a single risk estimate. Estimate its standard error via repeated $K$-fold CV. Why might this underestimate the true CV variance?
AIC vs BIC. Show that for $n > e^2 \approx 7.4$, BIC penalises additional parameters more heavily than AIC. What does this imply about model choice as $n$ grows?
Simpson's paradox construction. Construct a 2x2x2 contingency table with two treatments and two outcomes within two strata where treatment A wins in each stratum but B wins overall.
Confounder vs mediator. Define each. Give a clinical example. Should you adjust for a mediator when estimating a total causal effect? Why or why not?
Paired test power. Two models scored on the same 200 test examples have correlated errors ($\rho = 0.85$) and differ by 1.5% on average with per-example SD 0.30. Does a paired $t$-test detect the difference at $\alpha = 0.05$? Compare to the unpaired test.
Bootstrap for AUC. Implement a bootstrap CI for the AUC of a binary classifier on a 1000-example test set. Compare its width with the asymptotic Hanley–McNeil formula.
Bias–variance simulation. Reproduce the bias–variance simulation of Section 5.16 with k-NN regression for $k \in \{1, 3, 10, 30\}$. Identify the sweet spot.
Double descent demonstration. Train a series of two-layer neural networks of increasing width on a small classification dataset. Track training and test error. Do you see a double descent? Why or why not?
MLE for linear regression with heteroscedastic noise. Suppose $y_i \sim \mathcal{N}(x_i^\top\beta, \sigma_i^2)$ with known $\sigma_i$. Derive the MLE of $\beta$, it is weighted least squares with weights $1/\sigma_i^2$.
Predictive distribution comparison. For a Gaussian likelihood with unknown mean (variance known) and conjugate Gaussian prior, derive the posterior predictive distribution and compare its variance to (a) the sampling variance of $\bar X$ and (b) the variance of a single observation. Interpret.
Calibration as a statistical property. A binary classifier outputs probabilities. Define what it means for the classifier to be calibrated. Construct an empirical calibration curve from 10000 predictions. How does it relate to reliability diagrams?
Reporting CIs in a benchmark paper. Choose a recent ML paper that compares its method to a baseline. Calculate (or extract) the test-set size and reported metric. Estimate the 95% CI on the difference. Was the comparison statistically conclusive?