SLAM, Glossary, Textbook of AI

Simultaneous Localisation And Mapping (SLAM) is the problem of building a map of an unknown environment while simultaneously tracking the robot's pose within it. It is a chicken-and-egg problem: localising requires a map, but mapping requires knowing where you are. SLAM solves them jointly.

Formally, given controls $u_{1:t}$ (odometry), measurements $z_{1:t}$ (camera frames, laser scans, depth images), and an initial pose $x_0$, the goal is to estimate the joint posterior over robot trajectory $x_{1:t}$ and map $m$:

$$p(x_{1:t}, m \mid z_{1:t}, u_{1:t})$$

The map representation varies with the application: landmark-based maps (a list of 3D points), occupancy grids (a 2D array of cells, each marked free or occupied), elevation maps, or dense meshes.

Three algorithmic families dominate.

EKF-SLAM maintains a single Gaussian over a state vector that concatenates the current pose with all landmark positions: $\mu = [x_t; m_1; m_2; \ldots; m_K]$. The covariance $\Sigma$ is dense (every pair of landmarks is correlated through the trajectory) and grows as $\mathcal{O}(K^2)$, so EKF-SLAM scales only to a few hundred landmarks. It dominated the early literature (Smith and Cheeseman 1986).

Particle-filter SLAM (FastSLAM, Montemerlo et al. 2002) factorises the posterior using the conditional independence $p(m \mid x_{1:t}, z_{1:t}) = \prod_k p(m_k \mid x_{1:t}, z_{1:t})$: given the trajectory, each landmark is independent. Each particle holds a trajectory hypothesis plus per-landmark Kalman filters, and the map is Rao--Blackwellised out. This scales to thousands of landmarks.

Graph-SLAM (Lu and Milios 1997, Kaess et al. 2008) is the modern dominant approach. The trajectory and landmarks become nodes in a factor graph, and each measurement (odometry, landmark observation, loop closure) becomes a factor encoding a probabilistic constraint. The MAP estimate is the solution to a non-linear least-squares problem:

$$x^*, m^* = \arg\min \sum_{(i,j) \in \mathcal{E}} \|z_{ij} - h(x_i, m_j)\|^2_{\Sigma_{ij}^{-1}}$$

solved with sparse Gauss--Newton or Levenberg--Marquardt. Tools include $g^2o$, GTSAM, Ceres. Loop closure, recognising that the robot has returned to a previously visited place, adds long-range constraints that dramatically tighten the estimate; bag-of-words place recognition (DBoW2) and learned descriptors detect loop closures in real time.

Visual SLAM uses cameras as the primary sensor. ORB-SLAM (Mur-Artal et al. 2015) extracts ORB features, tracks them across frames, builds a sparse 3D point map, and runs bundle adjustment in a back-end thread; it runs in real time on a CPU. LSD-SLAM and DSO are direct (feature-free) variants. Dense visual SLAM systems like KinectFusion, ElasticFusion, and BundleFusion build full 3D meshes and surfaces.

Modern frontiers blend deep learning with SLAM: learned features (SuperPoint, R2D2), learned matchers (SuperGlue), and end-to-end systems (DROID-SLAM, NeRF-based SLAM) that jointly optimise camera pose and a neural radiance field. SLAM is the spatial intelligence that lets autonomous cars, drones, augmented-reality headsets, and warehouse robots operate in the world without GPS, and it is one of the cleanest practical applications of probabilistic inference and non-linear optimisation in modern AI.

Discussed in:

Chapter 12: Sequence Models, Robotics and Control

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).