The graph attention network (GAT), introduced by Velickovic et al. in 2018, applies the attention mechanism of transformers to graphs. Where the GCN weights every neighbour by a fixed function of node degrees, the GAT lets the network learn how much each neighbour should contribute to a node's update.
For each pair of connected nodes $(v, u)$, the GAT computes an unnormalised attention score:
$$e_{vu} = \mathrm{LeakyReLU}\!\left(a^\top [W h_v \,\|\, W h_u]\right)$$
where $W \in \mathbb{R}^{d' \times d}$ is a shared linear projection, $\|$ denotes concatenation, $a \in \mathbb{R}^{2d'}$ is a learned vector, and the LeakyReLU has slope $0.2$. Scores are normalised by softmax over each node's neighbourhood:
$$\alpha_{vu} = \frac{\exp(e_{vu})}{\sum_{k \in \mathcal{N}(v)} \exp(e_{vk})}$$
The new node representation is the attention-weighted sum of transformed neighbour features:
$$h_v' = \sigma\!\left(\sum_{u \in \mathcal{N}(v)} \alpha_{vu}\, W h_u\right)$$
As in transformers, multi-head attention is used: $K$ independent attention mechanisms run in parallel, and their outputs are concatenated (in middle layers) or averaged (in the final layer), giving:
$$h_v' = \big\|_{k=1}^{K} \sigma\!\left(\sum_{u \in \mathcal{N}(v)} \alpha_{vu}^{(k)} W^{(k)} h_u\right)$$
This stabilises learning and lets different heads focus on different relational patterns.
Compared to the GCN, the GAT has two practical advantages. First, it is inductive: because attention weights depend only on node features, not on the global graph structure, the same model can be applied to graphs not seen at training time. Second, the attention weights provide a small amount of interpretability: one can inspect which neighbours a node attends to most strongly, which is useful in chemistry and biology where an edge with high attention may correspond to a chemically meaningful interaction.
GATs match or exceed GCN performance on standard node-classification benchmarks (Cora, Citeseer, Pubmed) and the inductive PPI benchmark, often by a few percentage points. They scale linearly with the number of edges per layer, so they are tractable on graphs with millions of edges when combined with neighbour sampling. The GAT is the canonical bridge between graph learning and the attention mechanism that has come to dominate the rest of deep learning, and it foreshadows graph transformers that attend over all pairs of nodes rather than only edges.
Related terms: Graph Neural Network, Graph Convolutional Network, Message Passing Neural Network, Attention Mechanism, Transformer
Discussed in:
- Chapter 9: Neural Networks, Graph Neural Networks