NeurIPS ‘25 Tutorial: Foundations of Imitation Learning
From Language Modeling to Continuous Control
Adam Block, Dylan Foster, and Max Simchowitz
Overview
This tutorial presents imitation learning (IL) as a unifying framework through which to study the supervised learning paradigm—training a foundation model by imitating a large corpus of domain-specific demonstrations—at the heart of many of the most impressive advances in generative AI, including large language model pre-training, robotic behavior foundation models, and foundation models for chemistry and life sciences. With this lens, the aim of the tutorial is to (1) give an overview of recent theoretical advances that aim to understand when and why imitation learning can succeed with powerful generative models, (2) explain why the field has converged to certain interventions and best practices that are now ubiquitous; and (3) highlight new opportunities for transfer between theory and practice.
A running theme is understanding domain-specific challenges and solutions. We examine how discrete settings (language modeling) and continuous settings (robotics) require different algorithmic interventions, including action chunking, score-matching, and interactive data collection. In parallel, we unify seemingly disparate techniques: next-token prediction in language models becomes behavior cloning with log-loss, while exposure bias in autoregressive generation mirrors the compounding error phenomenon in control.
Panel: The Role of Theory in Modern Machine Learning
Featuring Nathan Srebro (TTIC; moderator), Kianté Brantley (Harvard University), Surbhi Goel (University of Pennsylvania), Audrey Huang (UIUC), Tatsunori Hashimoto (Stanford University).References
Classical Theory
- Stéphane Ross and J. Andrew Bagnell. Efficient reductions for imitation learning. AAAI, 2010.
- Stéphane Ross, Geoffrey J. Gordon, and J. Andrew Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. ICML, 2011.
- Stéphane Ross and J. Andrew Bagnell. Reinforcement and imitation learning via interactive no-regret learning. arXiv preprint arXiv:1406.5979, 2014.
- Jonathan Ho and Stefano Ermon. Generative adversarial imitation learning. NeurIPS, 2016.
- Nived Rajaraman, Lin F. Yang, Jiantao Jiao, Kannan Ramachandran, Toward the Fundamental Limits of Imitation Learning. NeurIPS, 2020.
- Gokul Swamy, Sanjiban Choudhury, J. Andrew Bagnell, and Zhiwei Steven Wu. Of moments and matching: a game-theoretic framework for closing the imitation gap. ICML, 2021.
Imitation Learning in Discrete Settings: Contemporary Results
- Dylan J. Foster, Adam Block, and Dipendra Misra. Is behavior cloning all you need? Understanding horizon in imitation learning. NeurIPS, 2024.
- Dhruv Rohatgi, Adam Block, Audrey Huang, Akshay Krishnamurthy, and Dylan J. Foster. Computational-statistical tradeoffs at the next-token prediction barrier: Autoregressive and imitation learning under misspecification. COLT, 2025.
- Nirmit Joshi, Gal Vardi, Adam Block, Surbhi Goel, Zhiyuan Li, Theodor Misiakiewicz, and Nathan Srebro. A theory of learning with autoregressive chain of thought. COLT, 2025.
Imitation Learning in Continuous Settings: Contemporary Results
- Michael Laskey, Jonathan Lee, Roy Fox, Anca Dragan, and Ken Goldberg. DART: Noise injection for robust imitation learning. CoRL, 2017.
- Daniel Pfrommer, Thomas Zhang, Stephen Tu, and Nikolai Matni. TaSIL: Taylor series imitation learning. NeurIPS, 2022.
- Adam Block, Ali Jadbabaie, Daniel Pfrommer, Max Simchowitz, and Russ Tedrake. Provable guarantees for generative behavior cloning: Bridging low-level stability and high-level behavior. NeurIPS, 2023.
- Max Simchowitz, Daniel Pfrommer, and Ali Jadbabaie. The pitfalls of imitation learning when actions are continuous. Preprint, 2025.
Optimization Perspective
- Boris T. Polyak and Anatoli B. Juditsky. Acceleration of stochastic approximation by averaging. SIAM Journal on Control and Optimization, 30(4):838-855, 1992.
- Pavel Izmailov, Dmitrii Podoprikhin, Timur Garipov, Dmitry Vetrov, and Andrew Gordon Wilson. Averaging weights leads to wider optima and better generalization. UAI, 2018.
- Adam Block, Dylan J. Foster, Akshay Krishnamurthy, Max Simchowitz, and Cyril Zhang. Butterfly effects of SGD noise: Error amplification in behavior cloning and autoregression. ICLR, 2024.
- Adam Block and Cyril Zhang. EMA without the lag: Bias-corrected iterate averaging schemes. Preprint, 2025.