5.3 Descriptive Statistics
Before fitting any model, look at your data. Descriptive statistics give you the tools for that first look, and skipping it is the most reliable way to ship a broken model.
Central tendency
- Mean: $\bar{x} = \frac{1}{n}\sum_{i=1}^n x_i$. Minimises the sum of squared deviations $\sum (x_i - c)^2$ over $c$. Sensitive to outliers, one extreme value can shift it a lot.
- Median: the middle value when sorted. Far more robust to outliers; minimises the sum of absolute deviations $\sum |x_i - c|$.
- Mode: the most frequent value. The only sensible measure for purely categorical data.
A trimmed mean (remove the top and bottom $k\%$ then average) interpolates between mean and median, trading some efficiency under Gaussian noise for robustness against contamination. In ML, the trimmed mean is the basis of robust aggregators in federated learning that defend against Byzantine workers.
Spread
- Variance (sample): $s^2 = \frac{1}{n-1}\sum(x_i - \bar{x})^2$. The $n-1$ (Bessel's correction) makes $s^2$ an unbiased estimator of the population variance $\sigma^2$. Why $n-1$? Because we estimated the mean from the same data, consuming one degree of freedom, the deviations $x_i - \bar{x}$ are constrained to sum to zero.
- Standard deviation: $s = \sqrt{s^2}$. Same units as the data.
- Interquartile range (IQR = $Q_3 - Q_1$): a robust spread measure used in box plots; an outlier is conventionally a point more than $1.5 \times \text{IQR}$ from the nearest quartile.
- Median absolute deviation (MAD): $\operatorname{median}(|x_i - \operatorname{median}(x)|)$. Even more robust; widely used in robust ML and computer vision.
Shape
- Skewness measures asymmetry. Income, wealth, file sizes, and event-count distributions are typically right-skewed. Skewness motivates the use of the log transform.
- Kurtosis measures tail heaviness relative to a Gaussian (which has kurtosis 3, or "excess kurtosis" 0). Financial returns, gradient magnitudes during deep learning training, and biological measurements are often heavy-tailed; Gaussian-based variance estimates underestimate the frequency of extreme events.
In AI, knowing the shape of your data guides:
- Choice of transform (log, Box–Cox, Yeo–Johnson) before linear models that assume Gaussian residuals.
- Choice of loss function (squared error implicitly assumes Gaussianity; Huber loss handles heavier tails; absolute loss handles still heavier).
- Distributional assumptions in generative models (a Student-$t$ output head can be far better calibrated than Gaussian for long-tailed targets).
Multivariate summaries
For data with many features, the covariance matrix $\Sigma$ has entries $\Sigma_{ij} = \operatorname{Cov}(X_i, X_j)$, and the correlation matrix rescales each entry to $[-1, 1]$. Scatter-plot matrices and correlation heatmaps reveal clusters of redundant features that motivate dimensionality reduction.
A practical concern: in high-dimensional regression, predictors that are highly correlated cause the OLS solution to become unstable. The variance inflation factor $\text{VIF}_j = 1/(1 - R_j^2)$, where $R_j^2$ is the coefficient of determination from regressing feature $j$ on all the others, quantifies this. A VIF above 5 or 10 is conventionally treated as concerning. PCA, ridge regression, or lasso can each address the issue but do so in different ways: PCA changes the basis, ridge shrinks coefficients while keeping all features, lasso zeroes out some entirely.
Anscombe's quartet, why you plot
Frank Anscombe's 1973 four-dataset construction is the canonical reminder that summary statistics conceal structure. Each of his four datasets has identical means, identical variances, identical correlations, and identical regression lines, yet the scatter plots could hardly be more different: one is a clean linear relationship, one is a clear curve, one has an outlier driving the entire fit, and one has an extreme high-leverage point that fixes the slope from a single observation.
The modern echo of Anscombe's quartet is the Datasaurus Dozen (Matejka and Fitzmaurice, 2017), a set of 13 datasets with identical summary statistics to two decimal places, including one in the unmistakable shape of a dinosaur. The lesson is unchanged. Always plot your data. Histograms expose multimodality. Box plots expose outliers. Pairs plots expose redundancy. QQ-plots expose distributional shape. In ML, this discipline extends to plotting feature distributions per class, embedding visualisations (UMAP, t-SNE), confusion matrices, and calibration curves. Models that look fine on aggregate metrics fail on subgroups; only plotting reveals the failure.
Worked example, hand calculation
Consider the small dataset $\mathbf{x} = (4, 8, 6, 5, 3, 12, 7)$, $n = 7$.
Mean: $\bar x = (4+8+6+5+3+12+7)/7 = 45/7 \approx 6.43$.
Sorted: $(3, 4, 5, 6, 7, 8, 12)$, median $= 6$.
Squared deviations from $\bar x$:
$$(4-6.43)^2 = 5.90,\ (8-6.43)^2 = 2.46,\ (6-6.43)^2 = 0.18,$$ $$(5-6.43)^2 = 2.04,\ (3-6.43)^2 = 11.76,\ (12-6.43)^2 = 31.02,\ (7-6.43)^2 = 0.32.$$
Sum $= 53.71$. Sample variance $s^2 = 53.71/6 \approx 8.95$. Sample SD $s \approx 2.99$.
Note how the value 12 dominates the sum of squared deviations (31.02 of 53.71). One outlying observation drives most of the variance. The median (6) is unmoved by this point; the mean (6.43) is.
import numpy as np
x = np.array([4, 8, 6, 5, 3, 12, 7])
print(f"mean={x.mean():.3f} median={np.median(x):.3f} "
f"std={x.std(ddof=1):.3f} IQR={np.percentile(x,75)-np.percentile(x,25):.3f}")
Sampling
A simple random sample gives every possible subset of size $n$ equal probability. Other schemes, stratified, cluster, systematic, exploit known population structure to improve efficiency. In machine learning, the i.i.d. assumption (training examples drawn independently from a fixed distribution) is itself a sampling assumption. When it breaks, selection bias, distribution shift, label leakage, models can fail badly in deployment. Many notorious AI failures (a chest X-ray model that learns "this hospital uses portable X-rays for sicker patients", a recidivism model that learns historical policing patterns) are sampling failures, not learning failures. No amount of model sophistication compensates for an unrepresentative sample.