Train your first Decision Transformer
What Happened
Train your first Decision Transformer
Our Take
Decision Transformer reframes offline RL as sequence modeling. Given a target return, past states, and past actions, a GPT-style model predicts the next action. No value functions. No Bellman backups. Just a causal transformer trained on logged trajectories.
If you have behavioral logs — game replays, robotic trajectories, clickstreams — you can train a policy without a live environment. Most teams building recommendation agents still deploy bandit algorithms when a Decision Transformer on existing interaction logs would outperform them. Training on D4RL benchmarks costs under $50 on one A100.
Teams with 100K+ logged episodes should test this before spinning up RL infrastructure. Pure online RL shops can skip it.
What To Do
Train a Decision Transformer on your existing interaction logs instead of standing up a PPO training loop because offline trajectories already encode your reward signal and you skip environment simulation costs entirely.
Cited By
React
Get the weekly AI digest
The stories that matter, with a builder's perspective. Every Thursday.