Unsupervised Learning: 8.13   Autoencoders

Dr Chris Paton

8.13 Autoencoders

An autoencoder is a neural network that learns to compress its input and then reconstruct it. The network is trained as a single pipeline: the first half, the encoder, squeezes the input through a narrow bottleneck; the second half, the decoder, expands the bottleneck back out and tries to recover the original. The network is judged purely by how close the reconstruction is to the input. There are no human labels involved. The input is the target. That is what makes autoencoders an unsupervised method: they extract structure from unlabelled data.

The interesting object is not the reconstruction itself, which is usually only an approximation, but the bottleneck. Because the network is forced to push everything it knows about the input through this narrow channel, the values in the bottleneck become a compressed, learned summary of the input, what the literature calls a latent code or representation. Once the network is trained, the encoder alone is a useful tool: it gives you a low-dimensional vector for any input, which you can use for visualisation, as features for a downstream classifier, for fast nearest-neighbour search, or as input to another model entirely.

Autoencoders are the workhorse of four practical tasks: non-linear dimensionality reduction beyond what PCA can manage; denoising signals that have been corrupted by noise; anomaly detection in streams of sensor or transaction data; and providing the architectural backbone of variational autoencoders, modern self-supervised vision pretraining, and the latent-space tricks that make Stable Diffusion tractable. They sit on the bridge between classical unsupervised statistics and modern representation learning.

A bridge to the section just above. §8.8 introduced PCA: the optimal linear compression of data into a small number of orthogonal directions. It is fast, has a closed-form solution via SVD, and is a sensible default for any continuous dataset. But linearity is also its ceiling. If the structure in your data lies on a curved manifold, pixels of digits rotating, faces under different lighting, words in a sentence, then no linear projection will ever capture it cleanly. §8.13 lifts that ceiling. Replace the linear encoder and decoder of PCA with multilayer neural networks and you have a non-linear analogue: the autoencoder. PCA becomes the special case where the activations are linear, and the autoencoder is what you get when you let those layers bend.

Symbols Used Here

$\mathbf{x}$input

$\mathbf{z}$latent code

$\hat{\mathbf{x}}$reconstruction

$f_{\text{enc}}, f_{\text{dec}}$encoder, decoder networks

$\theta_{\text{enc}}, \theta_{\text{dec}}$parameters

The architecture

The encoder is a function $f_{\text{enc}}: \mathbb{R}^d \to \mathbb{R}^k$ with parameters $\theta_{\text{enc}}$. Given an input $\mathbf{x} \in \mathbb{R}^d$, it produces the latent code $\mathbf{z} = f_{\text{enc}}(\mathbf{x}) \in \mathbb{R}^k$. The decoder is the mirror function $f_{\text{dec}}: \mathbb{R}^k \to \mathbb{R}^d$ with parameters $\theta_{\text{dec}}$, taking the latent code and producing a reconstruction $\hat{\mathbf{x}} = f_{\text{dec}}(\mathbf{z})$. Composed, they form a single map from input space back to itself.

Training minimises a reconstruction loss averaged over the dataset:

$$ \mathcal{L}(\theta_{\text{enc}}, \theta_{\text{dec}}) = \frac{1}{n} \sum_{i=1}^{n} \bigl\lVert \mathbf{x}^{(i)} - f_{\text{dec}}(f_{\text{enc}}(\mathbf{x}^{(i)})) \bigr\rVert^2. $$

Squared error is the conventional choice for continuous data such as embeddings, sensor readings, or normalised pixel intensities. For binary or pixel-probability data, binary cross-entropy is more natural and tends to give crisper reconstructions. Other losses appear in specialised settings: perceptual losses comparing features from a pretrained vision network for image autoencoders, or adversarial losses combined with a reconstruction term in autoencoder-GAN hybrids.

The defining structural choice is the bottleneck: the latent dimension $k$ is much smaller than the input dimension $d$. This is what forces the network to compress. If $k \geq d$ and the network has enough capacity, it can simply learn the identity function, store the input verbatim in the hidden layer and copy it back, which teaches it nothing. Setting $k \ll d$ removes that escape route. The network must throw information away, and the only way to keep the reconstruction loss low is to throw away the least useful information, retaining whatever is most predictive of the rest of the input. That implicit priority ordering is precisely what makes the bottleneck a meaningful representation.

In practice the encoder and decoder are stacks of fully connected or convolutional layers with non-linear activations between them. ReLU is standard in the hidden layers; the output activation is chosen to match the data range, sigmoid for pixels in [0, 1], no activation for unbounded continuous targets. Optimisation is by stochastic gradient descent with Adam, exactly as for any other neural network. There is nothing exotic about the training procedure; the unusual aspect is only that the labels are the inputs themselves.

Worked example: MNIST autoencoder

MNIST is the canonical sandbox for autoencoders. Each example is a 28 × 28 greyscale image of a handwritten digit, flattened into a 784-dimensional vector with pixel intensities normalised to [0, 1]. The dataset has 60,000 training images and 10,000 test images. Labels exist but are not used during autoencoder training.

A reasonable architecture: encoder 784 → 256 → 64 → 8, decoder 8 → 64 → 256 → 784. ReLU between hidden layers, sigmoid at the output of the decoder so reconstructions stay in [0, 1]. The latent dimension of 8 is aggressive, the network must compress each image into eight floating-point numbers, but it is enough for digits.

In PyTorch the model is short:

import torch
import torch.nn as nn

class Autoencoder(nn.Module):
    def __init__(self, d=784, k=8):
        super().__init__()
        self.encoder = nn.Sequential(
            nn.Linear(d, 256), nn.ReLU(),
            nn.Linear(256, 64), nn.ReLU(),
            nn.Linear(64, k),
        )
        self.decoder = nn.Sequential(
            nn.Linear(k, 64), nn.ReLU(),
            nn.Linear(64, 256), nn.ReLU(),
            nn.Linear(256, d), nn.Sigmoid(),
        )
    def forward(self, x):
        z = self.encoder(x)
        return self.decoder(z), z

Train for fifty epochs with Adam at learning rate $10^{-3}$, batch size 128, using binary cross-entropy on flattened pixels. On a single GPU this finishes in a few minutes. Reconstruction loss falls quickly during the first few epochs and then plateaus.

What does the network learn? The eight-dimensional latents pick up the major axes of variation in the dataset. If you plot training images projected to two of those eight dimensions, points cluster by digit class even though no class label was ever shown to the network. Other axes encode stroke thickness, slant, the size of the loop in a six or a nine, and the height-to-width ratio. The network has discovered, without supervision, that these are the most informative properties to preserve.

Reconstructions look like blurred originals. The blur is not a bug; it is the right behaviour. Eight dimensions cannot represent every pixel-level detail, so the decoder produces the most likely digit consistent with the latent code, smoothing over fine variation. A digit drawn with a wobbly pen comes back smooth. A faint digit comes back cleaner. This averaging behaviour is the reason mean-squared-error autoencoders are sometimes called blurry: they are minimising squared error per pixel; the optimum given an uncertain latent is the conditional mean, which is smooth.

The latent space supports interesting operations. Interpolating between the codes of a 1 and a 7, then decoding the intermediate points, produces a smooth morph from one digit to the other passing through plausible intermediate shapes. Adding the difference between codes for thick and thin sevens to a different digit thickens that digit. These behaviours are not guaranteed, the decoder is only trained to reconstruct training points, but they emerge naturally because the bottleneck has organised the data by its underlying factors of variation.

Linear autoencoder = PCA

If you strip every non-linearity out of an autoencoder, make the encoder a single linear layer $\mathbf{z} = W_{\text{enc}} \mathbf{x}$, the decoder a single linear layer $\hat{\mathbf{x}} = W_{\text{dec}} \mathbf{z}$, and use squared-error reconstruction loss, the global optimum is exactly PCA. The encoder weights span the same subspace as the top $k$ eigenvectors of the data covariance matrix, and the reconstruction $\hat{\mathbf{x}}$ is the projection of $\mathbf{x}$ onto that subspace.

This is the classical result of Bourlard and Kamp (1988) and Baldi and Hornik (1989): the loss surface of the linear autoencoder has a unique global minimum (up to a rotation in the latent space) and gradient descent will find a basis of the principal subspace. The autoencoder need not recover the exact eigenvectors, any orthonormal rotation of them gives the same reconstruction loss, but the subspace is the same, and a singular value decomposition of the trained encoder weights produces the eigenvectors directly.

Two consequences follow. First, this confirms that autoencoders generalise PCA: turning on activation non-linearities is the only difference, and that is what lets autoencoders capture curved structure that PCA cannot. Second, it tells you what to expect in practice. If you suspect your autoencoder is not learning anything useful, train a linear one as a sanity check; its reconstruction error is the PCA reconstruction error, and a deep autoencoder that does worse than PCA is broken.

Variants

The basic undercomplete autoencoder is the starting point; several variants tune the inductive bias for specific purposes.

Denoising autoencoder (Vincent et al. 2008). Corrupt the input, add Gaussian noise, mask pixels, drop tokens, and train the network to reconstruct the clean version. The network cannot rely on memorising the input and must learn what features are stable under the corruption process. This connects directly to score matching and is the conceptual ancestor of diffusion models.

Sparse autoencoder. Add a penalty that forces most latent units to be zero on most inputs, either via an $L_1$ penalty on activations or a KL term enforcing a low target average activation. The result is a dictionary-like representation in which a small number of features fire for any given input. Modern sparse autoencoders trained on transformer activations (Anthropic 2024, OpenAI 2024) have become a central tool of mechanistic interpretability, recovering human-interpretable concepts inside language models.

Contractive autoencoder (Rifai et al. 2011). Penalise the Frobenius norm of the encoder Jacobian $\lVert \partial f_{\text{enc}}/\partial \mathbf{x} \rVert_F^2$. This forces the representation to change slowly as the input changes, encoding invariance to small perturbations.

Variational autoencoder (Kingma and Welling 2014; covered in Chapter 14). The encoder outputs the parameters of a probability distribution over the latent rather than a single point, and training maximises the evidence lower bound (ELBO). This converts the autoencoder into a proper generative model and produces a smooth, probabilistic latent space that supports principled sampling.

Tied vs untied weights

A common older design choice is to tie the decoder weights to the transpose of the encoder weights: if the encoder has weight matrix $W$, the decoder uses $W^\top$. This halves the number of parameters and adds an inductive bias that pairs the analysis and synthesis directions, similar in spirit to using transposed convolutions. It also keeps the linear autoencoder genuinely equivalent to PCA in form, not just in subspace.

In modern deep autoencoders weights are usually left untied, partly because optimisers cope easily with the extra parameters and partly because deep networks benefit from the additional flexibility. Tied weights remain useful when the dataset is small, when interpretability matters, or when there is a principled reason to expect the encoder and decoder to be each other's inverse.

Anomaly detection with autoencoders

Train an autoencoder on data that represents normal operation. The network learns to reconstruct that data well. At inference time, reconstruct each test instance and compute the reconstruction error $\lVert \mathbf{x} - f_{\text{dec}}(f_{\text{enc}}(\mathbf{x})) \rVert$. Inputs that look like the training data reconstruct well and have low error. Inputs that depart from the training distribution, anomalies, faults, intrusions, reconstruct poorly because the network has never had a reason to model that part of the input space, so their reconstruction error is high. Set a threshold and flag whatever exceeds it.

The approach is appealing because it requires only normal data to train, which is plentiful in most real systems where anomalies are by definition rare. It works in manufacturing, vibration or vision data from machines, where worn or damaged parts produce subtly different signals; in network intrusion detection, where unusual traffic patterns reconstruct poorly compared with day-to-day flows; in sensor monitoring on aircraft, ships, or industrial plants; and in medical signal monitoring, where abnormal ECG or EEG segments stand out.

Practical caveats. The choice of bottleneck size matters: too small and even normal inputs reconstruct poorly, drowning the signal in noise; too large and the network reconstructs anomalies as well as it reconstructs normal data, defeating the point. The threshold must be calibrated on a validation set containing labelled examples of both classes, even if training was unsupervised. And the method assumes that anomalies look genuinely different from normal data in the feature space, a sufficiently subtle anomaly that lies on the same manifold will not be caught.

Where autoencoders appear in modern AI

Autoencoders look unfashionable next to transformers, but they are everywhere underneath the surface of modern systems.

Variational autoencoders (Chapter 14) are the proper generative-model version of the autoencoder. Their structured latent space supports sampling, interpolation, and conditional generation, and they remain a workhorse for representation learning where uncertainty matters.

VQ-VAE (van den Oord et al. 2017) replaces the continuous latent with a discrete codebook of vectors. Each input patch is mapped to its nearest codebook entry. This yields a sequence of integer indices that downstream models, autoregressive transformers, language models, can treat as tokens. DALL-E used a VQ-VAE to tokenise images. Jukebox used hierarchical VQ-VAEs to tokenise raw audio. Tokenising continuous signals is the move that lets language-model architectures generate images, audio, and video.

Stable Diffusion's latent VAE. Diffusion models on raw 512 × 512 pixels are expensive. Stable Diffusion trains a KL-regularised VAE that compresses 512 × 512 images to a 64 × 64 spatial latent, runs the diffusion process in that compressed space, and decodes to pixels at the end. This single trick, diffusion in a learned latent rather than in pixel space, is what made text-to-image generation tractable on consumer hardware.

Masked autoencoders for self-supervised pretraining (He et al. 2022). Mask 75% of the patches in an image and train a transformer to reconstruct the missing patches from the visible ones. The encoder learns rich visual representations without labels, transferring to ImageNet classification, segmentation, and detection at the level of supervised pretraining. The same masked-reconstruction principle drives BERT for text and many speech models.

Learned compression. Image and video codecs based on autoencoder architectures now match or beat hand-engineered codecs such as JPEG and HEVC at low bit rates, particularly on perceptual quality metrics, and are likely to dominate the next generation of media compression standards.

What you should take away

An autoencoder is an encoder–decoder neural network trained to reconstruct its input through a narrow bottleneck; the bottleneck is the learned compressed representation.
A linear autoencoder with squared-error loss is exactly PCA; non-linear activations are what lift autoencoders beyond linear dimensionality reduction onto curved manifolds.
Variants, denoising, sparse, contractive, variational, change the inductive bias to suit specific goals such as robustness, interpretability, or generative modelling.
Anomaly detection works by training on normal data and flagging high reconstruction error on test inputs, with the bottleneck size and threshold both requiring careful calibration.
Modern systems lean on autoencoders pervasively: VQ-VAE tokenisers, the latent VAE inside Stable Diffusion, masked autoencoders for self-supervised vision pretraining, and learned compression codecs all build on the same compress-then-reconstruct idea.