NeurIPS ‘25 Tutorial: Foundations of Imitation Learning
From Language Modeling to Continuous Control
Overview
This tutorial presents imitation learning (IL) as a unifying framework through which to study the supervised learning paradigm—training a foundation model by imitating a large corpus of domain-specific demonstrations—at the heart of many of the most impressive advances in generative AI, including large language model pre-training, robotic behavior foundation models, and foundation models for chemistry and life sciences. With this lens, the aim of the tutorial is to (1) give an overview of recent theoretical advances that aim to understand when and why imitation learning can succeed with powerful generative models, (2) explain why the field has converged to certain interventions and best practices that are now ubiquitous; and (3) highlight new opportunities for transfer between theory and practice.
A running theme is understanding domain-specific challenges and solutions. We examine how discrete settings (language modeling) and continuous settings (robotics) require different algorithmic interventions, including action chunking, score-matching, and interactive data collection. In parallel, we unify seemingly disparate techniques: next-token prediction in language models becomes behavior cloning with log-loss, while exposure bias in autoregressive generation mirrors the compounding error phenomenon in control.
Panel: New Directions in Foundations of Machine Learning
Featuring Nathan Srebro (TTIC; moderator), Kianté Brantley (Harvard University), Surbhi Goel (University of Pennsylvania), Audrey Huang (UIUC), Tatsunori Hashimoto (Stanford University).References
Classical Results
- Efficient reductions for imitation learning. Stéphane Ross and J. Andrew Bagnell. AISTATS, 2010.
- A reduction of imitation learning and structured prediction to no-regret online learning. Stéphane Ross, Geoffrey J. Gordon, and J. Andrew Bagnell. ICML, 2011.
- Reinforcement and imitation learning via interactive no-regret learning. Stéphane Ross and J. Andrew Bagnell. ICML, 2014.
- Generative adversarial imitation learning. Jonathan Ho and Stefano Ermon. NeurIPS, 2016.
- Toward the Fundamental Limits of Imitation Learning. Nived Rajaraman, Lin F. Yang, Jiantao Jiao, Kannan Ramachandran. NeurIPS, 2020.
- Of moments and matching: a game-theoretic framework for closing the imitation gap. Gokul Swamy, Sanjiban Choudhury, J. Andrew Bagnell, and Zhiwei Steven Wu. ICML, 2021.
Imitation Learning in Discrete Settings: Contemporary Results
- Is behavior cloning all you need? Understanding horizon in imitation learning. Dylan J. Foster, Adam Block, and Dipendra Misra. NeurIPS, 2024.
- Computational-statistical tradeoffs at the next-token prediction barrier: Autoregressive and imitation learning under misspecification. Dhruv Rohatgi, Adam Block, Audrey Huang, Akshay Krishnamurthy, and Dylan J. Foster. COLT, 2025.
- A theory of learning with autoregressive chain of thought. Nirmit Joshi, Gal Vardi, Adam Block, Surbhi Goel, Zhiyuan Li, Theodor Misiakiewicz, and Nathan Srebro. COLT, 2025.
- The Coverage Principle: How Pre-Training Enables Post-Training. Fan Chen, Audrey Huang, Noah Golowich, Sadhika Malladi, Adam Block, Jordan T. Ash, Akshay Krishnamurthy, and Dylan J. Foster. Preprint, 2025.
Imitation Learning in Continuous Settings: Contemporary Results
- DART: Noise injection for robust imitation learning. Michael Laskey, Jonathan Lee, Roy Fox, Anca Dragan, and Ken Goldberg. CoRL, 2017.
- TaSIL: Taylor series imitation learning. Daniel Pfrommer, Thomas Zhang, Stephen Tu, and Nikolai Matni. NeurIPS, 2022.
- Provable guarantees for generative behavior cloning: Bridging low-level stability and high-level behavior. Adam Block, Ali Jadbabaie, Daniel Pfrommer, Max Simchowitz, and Russ Tedrake. NeurIPS, 2023.
- The pitfalls of imitation learning when actions are continuous. Max Simchowitz, Daniel Pfrommer, and Ali Jadbabaie. COLT 2025.
Optimization Perspective
- Acceleration of stochastic approximation by averaging. Boris T. Polyak and Anatoli B. Juditsky. SIAM Journal on Control and Optimization, 30(4):838-855, 1992.
- Averaging weights leads to wider optima and better generalization. Pavel Izmailov, Dmitrii Podoprikhin, Timur Garipov, Dmitry Vetrov, and Andrew Gordon Wilson. UAI, 2018.
- Butterfly effects of SGD noise: Error amplification in behavior cloning and autoregression. Adam Block, Dylan J. Foster, Akshay Krishnamurthy, Max Simchowitz, and Cyril Zhang. ICLR, 2024.
- EMA without the lag: Bias-corrected iterate averaging schemes. Adam Block and Cyril Zhang. Preprint, 2025.


