Applications: 17.4   Robotics

Dr Chris Paton

17.4 Robotics

The previous section followed AI into the laboratory, where models such as AlphaFold and GNoME accelerate the search for proteins, materials and reactions. We now leave the laboratory and step into the corridor, the kitchen, the warehouse and the operating theatre. Robotics is the frontier where machine learning is forced to confront the physical world: gravity, friction, noisy sensors, brittle plastic clips, slippery vegetables, uncooperative children and the occasional overturned coffee cup. None of these are problems that yield to a clever loss function alone.

For four decades the discipline has lived a curious double life. In structured industrial settings, automotive welding cells, semiconductor wafer handlers, parcel sorters, robots are already a multi-billion-pound industry. In unstructured human environments, the same machines have struggled to fold a towel. The asymmetry has a simple explanation: industrial robots are programmed once, against jigs that hold the world still; everything else demands continual perception, planning and recovery from surprise. Modern AI has begun, finally and unevenly, to close that gap. Learned policies, vision-language-action models and diffusion-based controllers have replaced large parts of the hand-coded stack, while companies such as Boston Dynamics, Tesla, Figure, Unitree, Apptronik and Agility Robotics are racing to ship general-purpose humanoids. Google's RT-2 made it plausible that a single neural network could read a sentence, look at a kitchen, and pour the soup. Whether such systems can be made reliable, safe and cheap enough for everyday deployment is the open question of this decade.

Why robotics is hard

It is tempting to assume that, having solved chess, Go, protein folding and most of language, AI should make short work of loading a dishwasher. The opposite is closer to the truth. The bottleneck is not cognition but embodiment. Real-world inputs are noisy: cameras smear under motion blur, depth sensors fail on transparent and shiny surfaces, and force-torque readings drift with temperature. Real-world dynamics are uncertain: a cardboard box may be empty or full, a cable may snag, a floor tile may be loose. Recovery from failure is expensive: a dropped mug breaks, a slipped scalpel injures, a fallen humanoid bends an actuator costing thousands of pounds.

Three further difficulties compound the picture. The first is the reality gap. Modern policies are typically pretrained in simulation, where billions of timesteps are cheap, and then transferred to physical hardware. Simulators handle rigid-body contact passably but struggle with deformable cloth, granular media, fluids and the intricate friction of a screw entering a thread. A policy that walks confidently in simulation may stagger on real lino. Domain randomisation, training across thousands of perturbed simulators, narrows the gap but does not close it.

The second is data scarcity. A web-scale language model sees trillions of tokens; a robot collects perhaps a few hundred episodes per day per machine, each with proprioception, vision and forces. Open X-Embodiment, the largest public manipulation dataset assembled so far, contains roughly one million episodes from twenty-two robots, a rounding error compared with what a vision-language model consumes during pretraining.

The third is safety. A language model that hallucinates produces a wrong sentence; a humanoid that hallucinates produces a wrong joint torque next to a child. Certification, fail-safes, force limits and emergency stops are not optional, and they constrain the policies one can deploy. Together these difficulties explain why robotics, despite enormous investment, still lags virtual tasks. The remainder of this section describes how the field is responding.

Vision-language-action models

The most consequential idea of the last three years has been to fold robot control into the same architecture that already understands images and text. A vision-language-action model, or VLA, takes a camera frame and a natural-language instruction as input and produces an action command as output. The breakthrough was to start from a pretrained vision-language backbone, a model that already knows what a "cup" is, what "pick up" means and what a kitchen looks like, and then co-fine-tune it on robot trajectories alongside the original web data.

Google DeepMind's RT-1 (2022) was the first system to bring this recipe to scale. Trained on 130,000 episodes of the Everyday Robot platform performing more than 700 tasks, RT-1 emitted eleven-dimensional action vectors as autoregressive tokens. RT-2, released in 2023, replaced the from-scratch transformer with PaLI-X, a fifty-five-billion-parameter vision-language model. The result was a substantial jump in generalisation: RT-2 could follow novel instructions involving objects and verbs that had never appeared in the robot data. Asked to "pick up the empty drink can", it identified emptiness from visual cues without an explicit training example. Asked to "move the apple to the German national flag colours", it transferred concepts from the web corpus into action.

Open and academic work followed quickly. OpenVLA (Stanford, 2024) is a seven-billion-parameter open-weight VLA built on Llama 2 with SigLIP and DinoV2 vision encoders, trained on Open X-Embodiment. It matched or exceeded RT-2-X on the same benchmark while being eight times smaller and freely available. $\pi_0$ (pi-zero) from Physical Intelligence, released in October 2024, is a three-billion-parameter VLA with a flow-matching action head, trained on 10,000 hours of cross-embodiment data. Demonstrations included folding laundry, bussing tables and assembling cardboard boxes, tasks long considered the hard ceiling of bimanual manipulation. The 2025 successor $\pi_{0.5}$ added hierarchical reasoning over longer horizons.

By early 2026 the VLA recipe has solidified. Pretrain a large vision-language model on web data; co-fine-tune on a mixture of robot data drawn from many embodiments; output actions either autoregressively as tokens or via a diffusion or flow-matching head. The lessons are clear. Foundation-model pretraining transfers, surprisingly well, into action. Cross-embodiment data, episodes pooled across many physically different robots, is more valuable than any single platform's data. And language conditioning gives operators a usable interface: instructions, not waypoints. Open questions remain. Can VLAs handle truly long-horizon tasks such as assembling furniture from instructions, or completing a surgical procedure end to end? How much real-robot data is genuinely needed once foundation pretraining and high-fidelity simulation are both available? And how should regulators certify a learned policy whose behaviour cannot be fully enumerated in advance?

Diffusion policies

Behaviour cloning, training a network to imitate human teleoperation, was the workhorse of learned manipulation for a decade. It is simple, but it has a notorious failure mode. Demonstrations of a single task are usually multimodal: there are several reasonable ways to grasp a mug, several reasonable trajectories to wipe a plate. A model that minimises mean-squared error across these alternatives learns to average them, which often produces a trajectory that does neither thing well, reaching halfway between two valid grasps and slipping off the handle.

The 2023 paper "Diffusion Policy: Visuomotor Policy Learning via Action Diffusion" by Chi and colleagues, presented at Robotics: Science and Systems, recast the problem. Rather than predicting a single action at each timestep, the model is a conditional denoising diffusion model that samples a sequence of actions over a horizon of eight to sixteen steps, conditioned on a short window of observations. Because diffusion can represent multimodal distributions, the policy preserves rather than averages the alternative ways of completing the task. At test time the model denoises a Gaussian into a coherent action sequence; the robot executes a few steps and then re-samples, giving a form of receding-horizon control.

Three properties have made diffusion policy the default choice for visuomotor manipulation. First, robustness to demonstration quality: the policy tolerates noisy and inconsistent human teleoperation that ruins simple behaviour-cloning baselines. Second, smoothness: predicting an action sequence rather than a point gives temporal coherence, which classical controllers struggle to match. Third, composability: the diffusion head can be bolted onto larger backbones, and most modern VLAs, including $\pi_0$, adopt either a diffusion head or its close cousin, flow-matching.

Diffusion policy is not a panacea. Sampling is more expensive than a single forward pass, which forces a trade-off between control frequency and policy size; researchers commonly run a slower diffusion planner at a few Hertz alongside a fast classical controller at hundreds of Hertz. And diffusion does not solve the data problem: the policy can only sample modes that appeared in training. Nonetheless, the conceptual move, treating action prediction as generative modelling of a trajectory rather than discriminative regression of a point, has proved durable, and its influence on the wider VLA field has been considerable.

Humanoid robots in 2026

Humanoids are the headline-grabbing arm of robotics, partly because they are visually compelling and partly because most human environments are built to human dimensions. The state of the art in early 2026 is a crowded field. Tesla Optimus Gen 2 has reached limited internal deployment in Tesla factories for parts handling. Figure 03 (October 2025), the latest product from Figure AI, ran for 1,250 hours producing more than 30,000 BMW X3s at the BMW Spartanburg, South Carolina plant; the pilot is scheduled to expand in summer 2026. Unitree has shipped its H1 and the smaller, cheaper G1 in volume, mainly to research laboratories. Apptronik Apollo is in commercial pilot with Mercedes-Benz and GXO Logistics. Agility Robotics' Digit, the longest-running commercial humanoid, performs warehouse tote handling for Amazon and others. Boston Dynamics retired its hydraulic Atlas in April 2024 and replaced it with an all-electric model that is markedly faster, lighter and quieter; commercial deployment in Hyundai factories began in 2025.

Two facts cut against the marketing. First, almost all economically valuable humanoid work in 2026 is still teleoperated, not autonomous. A remote operator in India or Mexico steers the robot through difficult segments while the policy handles routine motion. This is not failure; it is the same recipe that domesticated industrial robotics in the 1980s, only with neural networks rather than fixed jigs. Second, autonomous policies are improving fast. Boston Dynamics, Figure, Unitree and a clutch of academic groups have published autonomous demonstrations of complex bimanual manipulation, dynamic locomotion, recovery from pushes and stair climbing using simulation-trained reinforcement-learning policies and VLA controllers.

The remaining barriers are economic and physical. A humanoid platform costs between £30,000 and £150,000 in 2026, a battery lasts between two and six hours, and reliability over thousands of operating hours is unproven. Whether the form factor is the right answer at all, given that most warehouses would be better served by wheels and most homes by a single-armed countertop unit, is the strategic question that the industry will answer in the next five years.

Self-driving

Self-driving cars are robotics' oldest active deployment programme and its most chastening lesson in optimism. Waymo, the Alphabet subsidiary, runs commercial rider-only operations in Phoenix, San Francisco, Los Angeles, Austin and parts of San Mateo county; it had completed more than ten million paid rides by late 2025 and continues to expand. Its sensor stack, lidar, radar and cameras with high-definition maps, and conservative operational design domain have allowed it to stay ahead on safety statistics, with reported incident rates substantially below human drivers in matched conditions.

Tesla FSD takes a different route: vision-only inference at scale, no lidar, no high-definition maps, deployed in supervised mode across the full US fleet. The system improves rapidly with every model release but remains classified by the regulator as Level 2, the human is responsible. Cruise, the GM subsidiary, paused operations after a 2023 incident in San Francisco, restarted limited testing under new leadership, and was wound down by GM in late 2024. Mobileye ships driver-assist and supervised autonomy across many vehicle brands. Wayve in London pursues end-to-end learning without HD maps and has announced commercial pilots with Nissan and Uber. Pony.ai and Baidu Apollo run robotaxi services in Beijing, Shenzhen and Guangzhou.

The field's lesson is that the long tail is real. Closing the gap between ninety-nine per cent and 99.999 per cent reliability has consumed a decade and tens of billions of pounds, and the path to true ubiquity still runs through regulation, insurance and public trust as much as through engineering. Surgical robotics faces the same long tail in a higher-stakes domain.

What you should take away

Robotics is hard for embodied reasons, not cognitive ones. Noisy sensors, uncertain dynamics, expensive failure and the simulation-to-reality gap make it qualitatively different from virtual tasks.
Vision-language-action models are now the dominant paradigm. RT-2, OpenVLA and $\pi_0$ show that pretraining on web data and co-fine-tuning on cross-embodiment robot trajectories produces policies that generalise to new objects, verbs and scenes.
Diffusion policy solved the multimodal-demonstration problem. Generating action sequences via conditional denoising preserves alternative valid trajectories rather than averaging them, and the technique has been absorbed into almost every modern VLA.
Humanoid robots are commercial in 2026 but mostly teleoperated for valuable work. Tesla Optimus, Figure 03, Unitree, Apptronik Apollo, Agility Digit and Boston Dynamics' electric Atlas are deployed in factories and warehouses; autonomous policies are improving but reliability and cost remain barriers.
Self-driving has matured unevenly. Waymo runs paid rider-only services in several US cities; Tesla FSD remains supervised; Cruise has been wound down; Mobileye, Wayve and Chinese operators occupy the middle ground. The lesson is that the long tail of safety, regulation and trust dominates the deployment curve.