The dot product (also called inner product or scalar product) of two vectors $\mathbf{u}, \mathbf{v} \in \mathbb{R}^n$ is the scalar
$$\mathbf{u} \cdot \mathbf{v} = \sum_{i=1}^n u_i v_i = u_1 v_1 + u_2 v_2 + \cdots + u_n v_n.$$
The result is a single number, not a vector. The operation is computationally cheap (one multiply-accumulate per dimension) and geometrically rich, making it the workhorse of numerical linear algebra and the most ubiquitous primitive in modern AI hardware. A modern GPU's tensor cores are essentially massively parallel dot-product engines.
Geometric interpretation
The fundamental geometric identity is
$$\mathbf{u} \cdot \mathbf{v} = \|\mathbf{u}\| \, \|\mathbf{v}\| \, \cos\theta,$$
where $\theta$ is the angle between the vectors and $\|\mathbf{u}\| = \sqrt{\mathbf{u} \cdot \mathbf{u}}$ is the Euclidean norm. Three regimes follow:
- Zero dot product ($\cos\theta = 0$): the vectors are orthogonal.
- Positive dot product: the vectors point in similar directions ($\theta < 90^\circ$).
- Negative dot product: the vectors point in opposite directions ($\theta > 90^\circ$).
Dividing by the norms removes magnitude and yields the cosine similarity
$$\mathrm{cos\,sim}(\mathbf{u}, \mathbf{v}) = \frac{\mathbf{u} \cdot \mathbf{v}}{\|\mathbf{u}\| \, \|\mathbf{v}\|} \in [-1, +1],$$
the standard similarity metric in information retrieval, recommendation systems, embedding-based search (e.g. RAG over documents) and face recognition.
Algebraic properties
The dot product is commutative ($\mathbf{u} \cdot \mathbf{v} = \mathbf{v} \cdot \mathbf{u}$), bilinear (linear in each argument), and positive-definite ($\mathbf{u} \cdot \mathbf{u} \geq 0$ with equality iff $\mathbf{u} = \mathbf{0}$). These three axioms define an inner product space and underpin Hilbert-space generalisations central to functional analysis and quantum mechanics.
Role in machine learning
The dot product is the basic mechanism of similarity and prediction throughout AI:
- A linear classifier scores an input $\mathbf{x}$ as $\mathbf{w} \cdot \mathbf{x} + b$. The decision boundary is the hyperplane where this dot product equals zero.
- A single neuron computes a dot product of its weights with its inputs, then applies an activation: $a = \sigma(\mathbf{w} \cdot \mathbf{x} + b)$.
- Attention in transformers computes scaled dot products $\mathbf{q} \cdot \mathbf{k} / \sqrt{d_k}$ between query and key vectors to determine how much one token should attend to another.
- Matrix multiplication is nothing more than a batched collection of dot products: $(AB)_{ij} = \mathbf{a}_i \cdot \mathbf{b}_j^\top$ where $\mathbf{a}_i$ is the $i$-th row of $A$ and $\mathbf{b}_j$ is the $j$-th column of $B$.
- Convolution is a sliding dot product between filter and patch.
- Word embeddings are evaluated by analogy tasks such as $\text{king} - \text{man} + \text{woman} \approx \text{queen}$, all assessed by dot-product similarity.
Hardware
Because so much of deep learning reduces to dot products, modern accelerators are designed around them. NVIDIA's tensor cores, Google's TPU systolic array, and Apple's Neural Engine all perform fused multiply-add operations at vast scale. The FLOP counts quoted for AI hardware (e.g. an H100's 989 TFLOP/s in BF16) are essentially dot-product rates.
Understanding the dot product is the single most important step toward fluency in AI mathematics.
Interactive
Video
Related terms: Matrix Multiplication, Attention Mechanism, Embedding
Discussed in:
- Chapter 4: Probability, Mathematical Foundations
- Chapter 8: Unsupervised Learning, The Transformer