The autonomous driving stack is the hierarchical software architecture that turns raw sensor streams into steering, throttle and brake commands. The classical formulation, dating back to the DARPA Urban Challenge (2007), decomposes the problem into four layers, perception, prediction, planning and control, each a self-contained module with defined inputs and outputs. Modern stacks blur these boundaries with end-to-end neural components, but the four-layer schema remains the dominant mental model.
Perception ingests lidar point clouds, camera images and radar returns, and outputs a structured scene description: 3D bounding boxes for detected objects with associated classes and tracking IDs, semantic segmentation of drivable surface and lane geometry, traffic-light and sign states, free-space occupancy. Modern perception networks unify multi-modal inputs into a bird's-eye-view (BEV) feature map via lift-splat-shoot or transformer-based view transformation (e.g. BEVFormer, LSS), fusing temporal context across frames.
Prediction forecasts the future trajectories of every dynamic agent in the scene over a horizon of 3–8 seconds. Because behaviour is inherently multi-modal, a vehicle at an intersection may turn or proceed straight, outputs are typically distributions over trajectories rather than point estimates: $p(\boldsymbol{\tau}_{1:T} \mid \mathcal{S}, \mathcal{M})$ where $\mathcal{S}$ is scene context and $\mathcal{M}$ is the HD map. Common parameterisations include weighted sets of trajectories (MultiPath), Gaussian mixtures over waypoints, occupancy-flow grids, and most recently autoregressive language-style models (MotionLM, Wayformer).
Planning chooses the ego trajectory the vehicle will follow given the predicted scene. The classical formulation is a constrained optimisation $\min_{\boldsymbol{\tau}} J(\boldsymbol{\tau}) \text{ s.t. } \mathcal{C}_{\text{safety}}, \mathcal{C}_{\text{kinematic}}, \mathcal{C}_{\text{traffic}}$ solved by sampling-and-evaluation or trajectory optimisation. Modern stacks supplement this with learned cost terms or replace the planner outright with a neural policy.
Control turns the chosen trajectory into actuator commands at high frequency (50–200 Hz). Standard tools are model-predictive control for longitudinal, and a feedback controller (PID, Stanley or pure pursuit) for lateral motion, often with a kinematic bicycle model.
The contemporary debate is between modular and end-to-end stacks. Modular pipelines (Waymo, Mobileye, Cruise) preserve interpretable interfaces, allow per-module verification and exploit hand-built priors such as HD maps. End-to-end neural stacks (Tesla FSD v12, Wayve, comma.ai) train a single network from sensor to control, claiming that data and scale beat hand-engineering and that error propagation across modules vanishes when there are no modules. Hybrid designs are emerging: end-to-end planners that consume modular perception outputs, or world-model-based stacks where a generative model (DreamerV3-style, GAIA-1 from Wayve, DriveDreamer) predicts future sensor frames conditioned on candidate actions and the planner picks the action whose imagined future minimises a cost.
Standard datasets and benchmarks include nuScenes, Waymo Open, Argoverse 2, KITTI-360, and the CARLA Leaderboard for closed-loop simulation. The CARLA simulator and MetaDrive are the dominant open environments for training and evaluating end-to-end stacks under controllable conditions.
Related terms: Tesla FSD, Waymo Driver, Convolutional Neural Network, Transformer, Reinforcement Learning, World Model
Discussed in:
- Chapter 17: Applications, Autonomous Driving