Decision Transformer (Chen et al., 2021) recasts reinforcement learning (RL) as a conditional sequence-modelling problem: rather than learning a value function or policy gradient, it trains a causal transformer to predict the next action in a trajectory given past states, actions, and a desired return-to-go. At evaluation time, conditioning on a high desired return causes the model to roll out high-reward behaviour. The result is competitive with state-of-the-art offline RL on Atari, OpenAI Gym, and Key-to-Door tasks despite being trained purely by supervised cross-entropy.
Trajectory representation. A trajectory of length $T$ is encoded as a sequence of $3T$ tokens, alternating return-to-go, state, and action:
$$\tau = (\hat{R}_1, s_1, a_1,\; \hat{R}_2, s_2, a_2,\; \ldots,\; \hat{R}_T, s_T, a_T),$$
where the return-to-go at step $t$ is
$$\hat{R}_t = \sum_{t'=t}^{T} r_{t'},$$
the sum of remaining rewards in the trajectory. Each token type is embedded by a separate linear projection plus a learned timestep embedding (one per RL step, shared across the three token types of that step).
Training objective. Standard supervised learning on offline data: for every action token, predict it from the prefix of preceding tokens. The loss is cross-entropy for discrete actions and mean-squared error for continuous actions:
$$\mathcal{L}(\theta) = \mathbb{E}_{\tau \sim \mathcal{D}} \sum_{t=1}^{T} \mathcal{L}_{\text{action}}\big(\hat{a}_t,\; \pi_\theta(\hat{R}_{1:t}, s_{1:t}, a_{1:t-1})\big).$$
No bootstrapping, no Bellman targets, no off-policy correction, just imitation conditioned on returns.
Inference. Choose a target return $\hat{R}_1$ at the start of an episode, observe state $s_1$, query the transformer for $a_1$, execute it in the environment, observe reward $r_1$ and next state $s_2$, update the running return-to-go to $\hat{R}_2 = \hat{R}_1 - r_1$, and continue. The remarkable empirical finding is that conditioning on returns higher than any in the training data still produces strong behaviour, the model extrapolates along the return axis.
Why it works. Three perspectives:
- Return as goal. Treating the return as a goal token makes Decision Transformer a form of goal-conditioned imitation learning. Hindsight relabelling already does something similar in HER.
- Sequence-model priors. Long-context attention captures credit assignment more easily than recurrent value functions, particularly in tasks with long horizons and sparse rewards (Key-to-Door).
- Avoidance of bootstrapping. Offline RL methods like CQL and BCQ struggle with bootstrapping over distributional shift; supervised learning on full trajectories avoids the issue entirely.
Variants and extensions.
- Trajectory Transformer (Janner et al., 2021), concurrent work, uses beam search over predicted reward + state + action tokens for planning.
- Online Decision Transformer mixes online rollouts with the offline objective.
- Multi-game Decision Transformer (Lee et al., 2022) trains a single 200M-parameter transformer on 41 Atari games, achieving 126% of human-normalised score on average.
Limitations.
- Performance ceiling is bounded by the best returns in the offline dataset (without online interaction).
- Conditioning on infeasible returns can produce erratic behaviour off-distribution.
- Standard transformers ignore stochastic dynamics; explicit world-model variants address this partially.
Decision Transformer reframed offline RL as a transformer-shaped problem and helped catalyse a wave of "RL is sequence modelling" research, including chain-of-thought prompting in language models for reasoning tasks.
Video
Related terms: Transformer, Reinforcement Learning, Cross-Entropy Loss, Language Model
Discussed in:
- Chapter 13: Attention & Transformers, Reinforcement Learning