Graph Convolutional Network, Glossary, Textbook of AI

The graph convolutional network (GCN), introduced by Kipf and Welling in 2017, is the simplest and most influential message-passing GNN. It defines a single layer as:

$$H^{(l+1)} = \sigma\!\left(\hat A H^{(l)} W^{(l)}\right)$$

where $H^{(l)} \in \mathbb{R}^{N \times d_l}$ stacks the node embeddings, $W^{(l)}$ is a learnable weight matrix, $\sigma$ is a non-linearity (usually ReLU), and

$$\hat A = \tilde D^{-1/2} \tilde A \tilde D^{-1/2}, \qquad \tilde A = A + I, \qquad \tilde D_{ii} = \sum_j \tilde A_{ij}$$

Adding the identity $I$ to the adjacency $A$ inserts self-loops so a node retains its own information; the $D^{-1/2}$ normalisation symmetrically rescales each entry by the square roots of the connected nodes' degrees, which keeps the spectrum of $\hat A$ in $[-1, 1]$ and prevents activations from exploding or vanishing as layers stack.

The spectral motivation is that a true graph convolution multiplies a signal by a filter $g_\theta(L)$ in the eigenbasis of the graph Laplacian $L = I - D^{-1/2} A D^{-1/2}$. Computing the eigendecomposition is $\mathcal{O}(N^3)$, infeasible for large graphs. Approximating $g_\theta(L)$ with a first-order Chebyshev polynomial and tying coefficients yields exactly the propagation rule above. The GCN is therefore both a localised spectral filter and a per-node operation that updates each node from a normalised average of itself and its neighbours.

Per node, the rule is:

$$h_v^{(l+1)} = \sigma\!\left(\sum_{u \in \mathcal{N}(v) \cup \{v\}} \frac{1}{\sqrt{\tilde d_v \tilde d_u}}\, W^{(l)} h_u^{(l)}\right)$$

GCNs are simple, fast (linear in the number of edges), and a strong baseline for semi-supervised node classification: given labels for a small fraction of nodes plus the full graph, predict labels for the rest. On citation networks (Cora, Citeseer, Pubmed) a two-layer GCN achieves accuracies that earlier methods needed elaborate engineering to match.

Limitations are well-documented. Stacking more than two or three layers causes over-smoothing because repeated normalised averaging is a low-pass filter; the GCN cannot learn different importances for different neighbours (the GAT addresses this); it assumes a fixed graph at training and inference (GraphSAGE addresses this with neighbour sampling); and it does not distinguish certain non-isomorphic graphs that the Weisfeiler--Lehman test separates. Nonetheless the GCN remains the canonical first thing to try on a graph-learning problem and the reference point against which more elaborate architectures are measured.

Discussed in:

Chapter 9: Neural Networks, Graph Neural Networks

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).