Linear Algebra: 2.2   Vectors, vector spaces, and norms

Dr Chris Paton

2.2 Vectors, vector spaces, and norms

A vector is, at heart, just a list of numbers. When a spam filter looks at an incoming email, it does not "read" it the way you do; it counts how many times each word in its dictionary appears, and the resulting list of word-counts, perhaps thirty thousand numbers long, is the email, as the machine sees it. That list is a vector. When a face-recognition system inspects a photograph, every pixel is converted to a number representing brightness or colour, and the photograph becomes a long list of those numbers. That list, again, is a vector. Whenever an AI system meets the world, the first thing it does is turn the world into a list of numbers.

Neural networks are layers of operations performed on vectors. Transformers compute relationships between vectors. Word embeddings, image embeddings, audio embeddings, all the "embeddings" that modern AI systems pass between their components, are vectors. The operations defined in this section, vector addition, scalar multiplication, the dot product, and the norm, are the operations from which all the famous architectures are assembled.

Symbols Used Here

$\mathbb{R}$the real numbers (any decimal: $-3.7$, $0$, $\pi$, $1.4 \times 10^{6}$, etc.)

$\mathbb{R}^n$the set of $n$-dimensional vectors of real numbers

$\mathbf{x}$a vector (bold-face lowercase letter); a list of $n$ real numbers

$x_i$the $i$-th entry of vector $\mathbf{x}$ (a single real number); $i$ ranges from 1 to $n$

$n$$d$, dimension of a vector space (positive integer); we use $d$ for "data dimension" and $n$ interchangeably

$\mathbf{0}$the zero vector $(0, 0, \ldots, 0)$

$\mathbf{x} + \mathbf{y}$vector addition (entry-wise)

$\alpha \mathbf{x}$scalar multiplication by $\alpha \in \mathbb{R}$ (multiply every entry by $\alpha$)

$\mathbf{x}^\top \mathbf{y}$dot product (also written $\langle \mathbf{x}, \mathbf{y} \rangle$ or $\mathbf{x} \cdot \mathbf{y}$); a single real number

$\|\mathbf{x}\|$the (Euclidean) length / norm of $\mathbf{x}$; a non-negative real number

$\|\mathbf{x}\|_p$the $p$-norm; a non-negative real number

$\angle(\mathbf{x}, \mathbf{y})$the angle between $\mathbf{x}$ and $\mathbf{y}$, in radians

What a vector actually is

A vector is an ordered list of real numbers. Two clarifying words there: ordered, meaning the position of each number matters (the list $(3, 4)$ is not the same vector as $(4, 3)$); and real, meaning each entry is drawn from $\mathbb{R}$, the set of real numbers. The set $\mathbb{R}$ contains every decimal you have ever met: integers like $-3$ and $0$ and $7$, simple fractions like $0.5$, irrational numbers like $\pi$ and $\sqrt{2}$, and very small or very large numbers like $1.4 \times 10^{6}$. We will use real numbers throughout this chapter; the few places where complex numbers matter will be flagged explicitly.

If a vector contains $n$ real numbers we say it lives in $\mathbb{R}^n$, pronounced "R-en" or "R to the $n$". Read aloud, the symbol $\mathbb{R}^n$ is a label for "the collection of all possible lists of $n$ real numbers". For instance:

$(3, 4)$ is a vector in $\mathbb{R}^2$, because it contains two real numbers.
$(1, -1, 2)$ is a vector in $\mathbb{R}^3$, because it contains three real numbers.
The list of $300$ numbers a Word2Vec model produces for the word "cat" is a vector in $\mathbb{R}^{300}$.
A modern language-model embedding for the same word might have $768$, $1{,}536$, or even $4{,}096$ entries; we would write it as a vector in $\mathbb{R}^{768}$, $\mathbb{R}^{1{,}536}$, or $\mathbb{R}^{4{,}096}$.

The number $n$ is called the dimension of the vector space. In some textbooks and papers you will see the letter $d$ used instead, particularly when the dimension represents some "data" quantity such as the number of features or the size of an embedding. For us $n$ and $d$ mean the same thing; we will switch between them for whichever reads more naturally in context.

We write vectors in two equivalent ways. As a row, $\mathbf{x} = (x_1, x_2, \ldots, x_n)$, with each $x_i$ a real number called the $i$-th entry or $i$-th component of the vector. As a column,

$$\mathbf{x} = \begin{bmatrix} x_1 \\ x_2 \\ \vdots \\ x_n \end{bmatrix}.$$

The two layouts hold the same information; the column form merely matters once we begin multiplying vectors by matrices in the next section. We use bold lowercase letters ($\mathbf{x}$, $\mathbf{y}$, $\mathbf{v}$) for vectors and ordinary italic letters with subscripts ($x_1$, $x_2$, $x_i$) for their entries, a small but consistent typographic distinction that prevents a great deal of confusion later on.

You may have heard a vector described as "an arrow with a length and a direction". That description is not wrong; it is one of two faces of the same object. The arrow picture is a geometric vector: an entity living in two- or three-dimensional space, with a tail and a head. The list-of-numbers picture is an algebraic vector. Once we choose a coordinate system, typically the standard $x$-axis, $y$-axis, $z$-axis of school geometry, every arrow corresponds to a unique list of numbers and every list of numbers corresponds to a unique arrow. They are the same object, dressed up two ways. For most of AI we work directly with the algebraic version, because $1{,}536$-dimensional arrows are difficult to draw, but the geometric intuition will keep being useful.

Vectors as points in space

There is a third, equivalent way to think about a vector: as a point in space. A two-dimensional vector $(3, 4)$ can be drawn on graph paper by going $3$ units to the right and $4$ units up, then placing a dot. That dot, at coordinates $(3, 4)$, is the vector. If we prefer the arrow picture, we draw an arrow from the origin $(0, 0)$, the special point with all entries zero, to that dot. The end at the origin is the tail; the end at $(3, 4)$ is the head.

A three-dimensional vector $(1, -1, 2)$ likewise becomes a point in a room: $1$ unit east of a chosen corner, $1$ unit south of it, and $2$ units up from the floor. We can still picture it. A vector with $768$ entries, meanwhile, lives in a $768$-dimensional space we cannot picture directly, but it is still a point, and the rules for combining and measuring it are exactly the same as the rules for the two-dimensional case. The trick the textbook will use repeatedly is to develop intuition with two- or three-dimensional pictures, and then trust that the algebraic operations carry that intuition into thousands of dimensions unchanged.

A small worked example. Take a piece of graph paper, mark the origin, and plot the vector $\mathbf{x} = (3, 4)$. The tail is at $(0, 0)$, the head is at $(3, 4)$, and the arrow runs diagonally up and to the right. Now plot $\mathbf{y} = (-1, 2)$: the head is one unit to the left of the origin and two units above it. These two arrows are vectors in $\mathbb{R}^2$. We will combine and measure them in the next subsections.

Vector addition and scalar multiplication

Two operations form the foundation of all linear algebra: adding two vectors, and multiplying a vector by a number.

Vector addition is performed entry by entry. Given two vectors $\mathbf{x}$ and $\mathbf{y}$ of the same dimension, their sum $\mathbf{x} + \mathbf{y}$ is the vector whose $i$-th entry is $x_i + y_i$. In symbols,

$$\mathbf{x} + \mathbf{y} = (x_1 + y_1, \, x_2 + y_2, \, \ldots, \, x_n + y_n).$$

A worked numerical example. Let $\mathbf{x} = (3, 1)$ and $\mathbf{y} = (1, 2)$. Then

$$\mathbf{x} + \mathbf{y} = (3 + 1, \, 1 + 2) = (4, 3).$$

Geometrically, place the tail of $\mathbf{y}$ at the head of $\mathbf{x}$; the new arrow from the origin to the head of the relocated $\mathbf{y}$ is $\mathbf{x} + \mathbf{y}$. This is the so-called tip-to-tail construction. Equivalently, the parallelogram with sides $\mathbf{x}$ and $\mathbf{y}$ has $\mathbf{x} + \mathbf{y}$ as its diagonal. Both constructions agree, and both produce the entry-wise sum.

Scalar multiplication stretches a vector. Given a vector $\mathbf{x}$ and a real number $\alpha$ (called a scalar), the product $\alpha \mathbf{x}$ is the vector obtained by multiplying every entry by $\alpha$:

$$\alpha \mathbf{x} = (\alpha x_1, \, \alpha x_2, \, \ldots, \, \alpha x_n).$$

A worked numerical example. With $\mathbf{x} = (3, 1)$,

$$2 \cdot \mathbf{x} = (2 \cdot 3, \, 2 \cdot 1) = (6, 2).$$

Geometrically, $2 \mathbf{x}$ is an arrow pointing in the same direction as $\mathbf{x}$ but twice as long. The scalar $0.5$ would halve the length; $-1$ would flip the arrow to point the opposite way; and $0$ would crush the vector to the origin. A negative scalar reverses the direction; a scalar between $0$ and $1$ shrinks the vector; a scalar greater than $1$ stretches it.

These two operations interact through several familiar laws:

Commutativity of addition: $\mathbf{x} + \mathbf{y} = \mathbf{y} + \mathbf{x}$.
Associativity of addition: $(\mathbf{x} + \mathbf{y}) + \mathbf{z} = \mathbf{x} + (\mathbf{y} + \mathbf{z})$.
Distributivity: $\alpha(\mathbf{x} + \mathbf{y}) = \alpha \mathbf{x} + \alpha \mathbf{y}$ and $(\alpha + \beta) \mathbf{x} = \alpha \mathbf{x} + \beta \mathbf{x}$.
Identity: there is a special vector $\mathbf{0} = (0, 0, \ldots, 0)$, called the zero vector, satisfying $\mathbf{x} + \mathbf{0} = \mathbf{x}$ for every $\mathbf{x}$. The scalar $1$ satisfies $1 \cdot \mathbf{x} = \mathbf{x}$.

These laws are not deep facts, each follows immediately from the entry-wise definition and the corresponding law for ordinary numbers, but they tell us that vectors behave almost exactly like numbers under addition and scalar multiplication. The set $\mathbb{R}^n$ together with these two operations is called a real vector space, and the abstract definition of a vector space lifts these properties to objects that are not obviously lists of numbers (polynomials, continuous functions, even all $h \times w$ greyscale images). For our purposes, a vector space is just $\mathbb{R}^n$ with addition and scalar multiplication. Any expression of the form $\alpha_1 \mathbf{v}_1 + \alpha_2 \mathbf{v}_2 + \cdots + \alpha_k \mathbf{v}_k$, built from a handful of vectors using the two operations, is called a linear combination, and the set of all linear combinations of a given collection of vectors is called their span. A basis is a minimal collection whose span is the whole space; the standard basis of $\mathbb{R}^n$ consists of the vectors with a single $1$ and the rest zeros. We will say more about bases and span later in this section.

The dot product: how aligned are two vectors?

The dot product of two vectors of the same dimension is a single real number obtained by multiplying corresponding entries and adding the results. In symbols, for $\mathbf{x}, \mathbf{y} \in \mathbb{R}^n$,

$$\mathbf{x}^\top \mathbf{y} = \sum_{i=1}^{n} x_i y_i = x_1 y_1 + x_2 y_2 + \cdots + x_n y_n.$$

The same operation is written in three ways in the literature: $\mathbf{x}^\top \mathbf{y}$ (the transpose notation, common in machine learning), $\langle \mathbf{x}, \mathbf{y} \rangle$ (the inner-product notation, common in pure mathematics), and $\mathbf{x} \cdot \mathbf{y}$ (the dot notation, common in physics). All three mean exactly the same thing. We will use $\mathbf{x}^\top \mathbf{y}$ throughout, because it matches the matrix conventions of the next section.

A worked numerical example. Let $\mathbf{x} = (3, 1)$ and $\mathbf{y} = (1, 2)$. Then

$$\mathbf{x}^\top \mathbf{y} = 3 \cdot 1 + 1 \cdot 2 = 3 + 2 = 5.$$

The dot product of two vectors in $\mathbb{R}^2$ takes four multiplications and additions; the dot product of two $1{,}000$-dimensional vectors takes a thousand multiplications and a thousand additions. In every case the answer is a single number, and that number carries surprising geometric information.

The geometric meaning of the dot product is given by the identity

$$\mathbf{x}^\top \mathbf{y} = \|\mathbf{x}\| \, \|\mathbf{y}\| \, \cos\theta,$$

where $\|\mathbf{x}\|$ and $\|\mathbf{y}\|$ are the lengths of the two vectors (defined formally in the next subsection) and $\theta$ is the angle between them. Three things follow from this identity, and each has direct consequences for AI:

The dot product measures projection. If $\mathbf{y}$ has length $1$, then $\mathbf{x}^\top \mathbf{y}$ is exactly the length of the shadow $\mathbf{x}$ casts on the line through $\mathbf{y}$. We use this idea every time we ask "how much of $\mathbf{x}$ points in the direction of $\mathbf{y}$?", a question that arises in least-squares regression, in the attention mechanism of Transformers, and in dozens of other places. The formal name for this shadow is the projection of $\mathbf{x}$ onto $\mathbf{y}$.

The sign of the dot product tells us whether the vectors point in similar directions. Because the lengths $\|\mathbf{x}\|$ and $\|\mathbf{y}\|$ are non-negative, the sign of $\mathbf{x}^\top \mathbf{y}$ is the sign of $\cos\theta$. A positive dot product means the angle is less than $90°$, the vectors broadly agree. A negative dot product means the angle is greater than $90°$, they broadly disagree. A dot product of zero means the angle is exactly $90°$.

A zero dot product means orthogonality. Two vectors are orthogonal, at right angles to one another, exactly when their dot product is zero. In two and three dimensions this is the familiar notion of perpendicularity; in higher dimensions it generalises directly. Orthogonal vectors carry information that is, in a precise sense, independent: a change along one of them does not show up as movement along the other.

A concrete example. In modern natural-language processing, every word is mapped by a learned embedding to a high-dimensional vector. The training procedure arranges these vectors so that words used in similar contexts end up with similar directions. If we measure the dot product (or its scaled version, cosine similarity, defined below) between the vector for "king" and the vector for "queen", we get a large positive number; the two vectors point similar ways. The dot product between "king" and "monarch" is similarly large. The dot product between "king" and "banana", however, is close to zero; the two concepts share little. The same dot-product machinery, scaled up, is what a Transformer's attention mechanism uses to decide which words in a sentence to focus on. Every "attention score" you see in a diagram of GPT or BERT is, underneath, a dot product.

The cosine similarity between two non-zero vectors is the dot product divided by the product of the lengths:

$$\operatorname{cos\_sim}(\mathbf{x}, \mathbf{y}) = \frac{\mathbf{x}^\top \mathbf{y}}{\|\mathbf{x}\| \, \|\mathbf{y}\|} = \cos\theta.$$

It always lies between $-1$ and $+1$: $-1$ for vectors pointing in exactly opposite directions, $0$ for orthogonal vectors, and $+1$ for vectors pointing in exactly the same direction. Because dividing by the lengths cancels out any rescaling of either vector, cosine similarity compares vectors purely by direction, a property exploited by every modern vector database and every retrieval-augmented language model.

Norms: how long is a vector?

A norm is a way of measuring the length of a vector. The most familiar norm, the one we have been calling "length" without defining, is the Euclidean norm.

The Euclidean norm, also called the 2-norm or $L^2$ norm, of a vector $\mathbf{x} \in \mathbb{R}^n$ is

$$\|\mathbf{x}\|_2 = \sqrt{\sum_{i=1}^{n} x_i^2} = \sqrt{x_1^2 + x_2^2 + \cdots + x_n^2}.$$

A worked numerical example. For $\mathbf{x} = (3, 4)$,

$$\|\mathbf{x}\|_2 = \sqrt{3^2 + 4^2} = \sqrt{9 + 16} = \sqrt{25} = 5.$$

This is exactly Pythagoras's theorem: the length of the diagonal of a $3$-by-$4$ rectangle is $5$. Pythagoras still works in higher dimensions; the formula above is just the same calculation taken across all $n$ entries. The Euclidean norm is the workhorse of machine learning; it is the "length" implied whenever a paper writes $\|\mathbf{x}\|$ without a subscript, and it has two pleasant properties: it does not change under rotations of the coordinate system, and it is differentiable everywhere except at the zero vector, which makes it convenient for gradient-based optimisation.

There are other norms. The 1-norm or $L^1$ norm is the sum of absolute values:

$$\|\mathbf{x}\|_1 = \sum_{i=1}^{n} |x_i| = |x_1| + |x_2| + \cdots + |x_n|.$$

For $\mathbf{x} = (3, 4)$ we get $\|\mathbf{x}\|_1 = |3| + |4| = 7$. Geometrically, the 1-norm is the distance you would walk if you could only move along the coordinate axes, sometimes called the "Manhattan distance", because it is the distance a taxi must drive on a city grid. The 1-norm shows up everywhere sparsity matters: an $L^1$ regularisation penalty on a model's weights pushes many of them to be exactly zero, producing sparse, interpretable models. This is the basis of Lasso regression and a recurring tool in modern interpretability research.

The infinity-norm or maximum norm is the largest absolute entry:

$$\|\mathbf{x}\|_\infty = \max_i |x_i|.$$

For $\mathbf{x} = (3, 4)$ we get $\|\mathbf{x}\|_\infty = \max(3, 4) = 4$. The infinity-norm answers "what is the worst single entry?" and is the natural measure when no individual coordinate may exceed some budget. It appears in adversarial robustness, an attacker bounded in infinity-norm may shift any pixel of an image by at most $\epsilon$, and in interval arithmetic.

All three are special cases of the general $L^p$ norm, parameterised by a real number $p \ge 1$:

$$\|\mathbf{x}\|_p = \left( \sum_{i=1}^{n} |x_i|^p \right)^{1/p}.$$

Setting $p = 2$ gives the Euclidean norm; setting $p = 1$ gives the 1-norm; letting $p \to \infty$ gives the maximum norm. The family is sometimes called the $\ell^p$ norms (using a script $\ell$); the notation $L^p$ is more common when working with functions rather than vectors, but the two are used almost interchangeably in machine-learning papers.

A note on the so-called "$L^0$ norm". Some papers write $\|\mathbf{x}\|_0$ for the count of non-zero entries of $\mathbf{x}$. This is not actually a norm (it fails one of the defining axioms), but it is the natural measure of sparsity, and the $L^1$ norm above is best understood as a tractable, convex relaxation of $L^0$.

Why do norms matter for AI? They appear in three roles. First, as regularisation penalties: adding $\lambda \|\mathbf{w}\|_2^2$ to a loss function (so-called weight decay or $L^2$ regularisation) discourages large weights and reduces overfitting; adding $\lambda \|\mathbf{w}\|_1$ (Lasso) encourages sparsity. Second, as loss functions: mean squared error is the squared Euclidean norm of the residual vector, $\frac{1}{n}\|\mathbf{y} - \hat{\mathbf{y}}\|_2^2$. Third, as similarity measures: the Euclidean distance $\|\mathbf{x} - \mathbf{y}\|_2$ between two embedding vectors quantifies how different they are, and is a standard retrieval metric alongside cosine similarity. The choice of norm is a modelling decision: $L^2$ for smooth optimisation and rotation-invariant geometry, $L^1$ for sparsity, $L^\infty$ for worst-case bounds.

A unit vector is a vector with Euclidean norm equal to $1$. Given any non-zero vector $\mathbf{x}$, the operation $\mathbf{x} \mapsto \mathbf{x} / \|\mathbf{x}\|_2$ produces a unit vector pointing in the same direction; we call this normalising the vector. Normalisation is essential whenever we want to compare directions while ignoring magnitudes, the basis for cosine similarity and for the so-called "RMS-norm" used inside many modern Transformer architectures.

Angles, projection, orthogonality

The angle $\theta$ between two non-zero vectors is determined by the dot product and the norms. Rearranging the geometric identity from earlier,

$$\cos\theta = \frac{\mathbf{x}^\top \mathbf{y}}{\|\mathbf{x}\| \, \|\mathbf{y}\|},$$

so the angle itself is

$$\theta = \angle(\mathbf{x}, \mathbf{y}) = \arccos\!\left( \frac{\mathbf{x}^\top \mathbf{y}}{\|\mathbf{x}\| \, \|\mathbf{y}\|} \right),$$

where $\arccos$ is the inverse cosine and the result is in radians, between $0$ and $\pi$.

A worked numerical example. Let $\mathbf{x} = (1, 0)$ and $\mathbf{y} = (1, 1)$. Then $\mathbf{x}^\top \mathbf{y} = 1 \cdot 1 + 0 \cdot 1 = 1$, $\|\mathbf{x}\| = \sqrt{1^2 + 0^2} = 1$, and $\|\mathbf{y}\| = \sqrt{1^2 + 1^2} = \sqrt{2}$. So

$$\cos\theta = \frac{1}{1 \cdot \sqrt{2}} = \frac{1}{\sqrt{2}} \approx 0.707,$$

giving $\theta = \arccos(1/\sqrt{2}) = \pi/4$ radians $= 45°$. The vectors $(1, 0)$ and $(1, 1)$ meet at a $45°$ angle, exactly what the picture suggests.

Two vectors are orthogonal when their dot product is zero, equivalently, when the angle between them is $90°$. The zero vector is conventionally orthogonal to everything. Orthogonality matters because orthogonal directions carry independent information: features that are mutually orthogonal contribute to the geometry of a dataset without overlapping. Three places this fact is decisive:

The columns of a rotation matrix are mutually orthogonal unit vectors. Rotating coordinates is a change of orthogonal basis.
Principal-component analysis finds an orthogonal set of directions in which the data has progressively decreasing variance; we will derive PCA from the singular value decomposition in section 2.8.
The projection of one vector onto another decomposes the first into a piece along the second and a piece orthogonal to the second. The orthogonal piece is what is "left over" after the projection is removed, a manoeuvre at the heart of least-squares regression and of Gram–Schmidt orthogonalisation.

In high-dimensional vector spaces a hyperplane is the natural generalisation of a line in two dimensions or a plane in three: it is the set of vectors orthogonal to a fixed direction $\mathbf{w}$, possibly shifted from the origin. Linear classifiers separate two classes by such a hyperplane; the equation $\mathbf{w}^\top \mathbf{x} + b = 0$ defines the boundary, and the sign of $\mathbf{w}^\top \mathbf{x} + b$ assigns the class. We will see this geometry again when we discuss support vector machines and the linear layers of neural networks.

What you should take away

A vector is an ordered list of $n$ real numbers, an element of $\mathbb{R}^n$. The same object can equally be pictured as a point in space or as an arrow from the origin to that point.
Vector addition combines two vectors entry by entry; scalar multiplication stretches a vector by multiplying every entry by the same number. These two operations make $\mathbb{R}^n$ a vector space.
The dot product $\mathbf{x}^\top \mathbf{y}$ takes two vectors and returns a single number; geometrically it equals $\|\mathbf{x}\| \, \|\mathbf{y}\| \cos\theta$, so it measures how aligned the two vectors are.
A norm measures the length of a vector. The Euclidean norm $\|\mathbf{x}\|_2$ is Pythagoras's theorem in $n$ dimensions; the 1-norm and the infinity-norm capture different notions of size; the general $L^p$ norm interpolates between them.
Two vectors are orthogonal when their dot product is zero, geometrically, when they meet at right angles. Orthogonal vectors carry independent information.
Projection decomposes one vector into a component along another vector and a component orthogonal to it; this single fact underlies least-squares regression, Gram–Schmidt orthogonalisation and PCA.
Almost everything in deep learning, from a perceptron to a Transformer, is built from dot products and norms applied at scale. Master these objects in two and three dimensions and the high-dimensional case follows by direct analogy.