ICML ‘22 Tutorial: Bridging Learning and Decision Making

Overview

This tutorial will give an overview of the theoretical foundations of interactive decision making (high-dimensional/contextual bandits, reinforcement learning, and beyond), a promising paradigm for developing AI systems capable of intelligently exploring unknown environments. The tutorial will focus on connections and parallels between supervised learning/estimation and decision making, and will build on recent research which provides (i) sample complexity measures for interactive decision making that are necessary and sufficient for sample-efficient learning, and (ii) unified algorithm design principles that achieve optimal sample complexity. Using this unified approach as a foundation, the main aim of the tutorial will be to give a bird's-eye view of the statistical landscape of reinforcement learning (e.g., what modeling assumptions lead to sample-efficient algorithms). Topics covered will range from basic challenges and solutions (exploration in tabular RL, contextual bandits) to the current frontier of understanding. We will also highlight practical algorithms.

Recording (ICML)

Slides

References

Primary References

Peter Auer. Using Confidence Bounds for Exploitation-Exploration Trade-offs. JMLR, 2002.
Dylan J. Foster and Alexander Rakhlin. Beyond UCB: Optimal and Efficient Contextual Bandits with Regression Oracles. ICML, 2020.
Yasin Abbasi-Yadkori, Csaba Szepesvari, and David Pal. Improved algorithms for linear stochastic bandits. NeurIPS, 2011.
Dylan J. Foster, Sham M. Kakade, Jian Qian, and Alexander Rakhlin. The statistical complexity of interactive decision making. Preprint.
Chi Jin, Zhuoran Yang, Zhaoran Wang, and Michael I. Jordan. Provably efficient reinforcement learning with linear function approximation. COLT, 2020.
Daniel Russo, and Benjamin Van Roy. Eluder dimension and the sample complexity of optimistic exploration. NeurIPS, 2013.
Nan Jiang, Akshay Krishnamurthy, Alekh Agarwal, John Langford, and Robert E. Schapire. Contextual decision processes with low Bellman rank are PAC learnable. ICML, 2017.

Additional References

Decision-Estimation Coefficient

Dylan J. Foster, Sham M. Kakade, Jian Qian, and Alexander Rakhlin. The statistical complexity of interactive decision making. Preprint, 2021.
Dylan J. Foster, Alexander Rakhlin, Ayush Sekhari, Karthik Sridharan. On the complexity of adversarial decision making. Preprint, 2022.

Multi-Armed Bandits

William R. Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 1933.
Herbert Robbins. Some aspects of the sequential design of experiments. Bulletin of the American Mathematical Society, 1952.
T.L. Lai and Herbert Robbins. Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics, 1985.
Peter Auer. Using Confidence Bounds for Exploitation-Exploration Trade-offs. JMLR, 2002.

Contextual Bandits

Naoki Abe and Philip M. Long. Associative Reinforcement Learning using Linear Probabilistic Concepts. ICML, 1999.
Peter Auer, Nicolo Cesa-Bianchi, Yoav Freund, and Robert E. Schapire. The non-stochastic multi-armed bandit problem. SIAM Journal on Computing, 2002.
John Langford and Tong Zhang. The epoch-greedy algorithm for contextual multi-armed bandits. NeurIPS, 2007.
Lihong Li, Wei Chu, John Langford, and Robert E. Schapire. A contextual-bandit approach to personalized news article recommendation. WWW, 2010.
Alekh Agarwal, Daniel Hsu, Satyen Kale, John Langford, Lihong Li, and Robert E. Schapire. Taming the monster: A fast and simple algorithm for contextual bandits. ICML, 2014.
Dylan J. Foster and Alexander Rakhlin. Beyond UCB: Optimal and Efficient Contextual Bandits with Regression Oracles. ICML, 2020.
David Simchi-Levi and Yunzong Xu. Bypassing the Monster: A Faster and Simpler Optimal Algorithm for Contextual Bandits under Realizability. Mathematics of Operations Research, 2022.
Alberto Bietti, Alekh Agarwal, and John Langford. A contextual bandit bake-off. JMLR, 2021.

Structured Bandits

Baruch Awerbuch and Robert Kleinberg. Online linear optimization and adaptive routing. JCSS, 2008.
Varsha Dani, Thomas P. Hayes, and Sham M. Kakade. Stochastic linear optimization under bandit feedback. COLT, 2008.
Wei Chu, Lihong Li, Lev Reyzin, and Robert E. Schapire., Contextual bandits with linear payoff functions. AISTATS, 2011.
Yasin Abbasi-Yadkori, Csaba Szepesvari, and David Pal. Improved algorithms for linear stochastic bandits. NeurIPS, 2011.
Elad Hazan, Zohar Karnin, and Raghu Meka. Volumetric spanners: An efficient exploration basis for learning. COLT, 2014.
Robert Kleinberg. Nearly tight bounds for the continuum-armed bandit problem. NeuriPS 2004.
Abraham D Flaxman, Adam Tauman Kalai, and H Brendan Mcmahan. Online convex optimization in the bandit setting: Gradient descent without a gradient. SODA, 2005.
Alekh Agarwal, Dean P Foster, Daniel Hsu, Sham M. Kakade, and Alexander Rakhlin. Stochastic convex optimization with bandit feedback. SIAM Journal on Optimization, 2013.
Sebastien Bubeck, Yin Tat Lee, and Ronen Eldan. Kernel-based methods for bandit convex optimization. STOC, 2017.
Peter Auer, Ronald Ortner, and Csaba Szepesvari. Improved rates for the stochastic continuum-armed bandit problem. COLT, 2007.
Robert Kleinberg, Aleksanders Slivkins, and Eli Upfal. Bandits and experts in metric spaces. JACM, 2019.
Daniel Russo and Benjamin Van Roy. Eluder dimension and the sample complexity of optimistic exploration. NeurIPS, 2013.

Reinforcement Learning

Mohammad Gheshlaghi Azar, Ian Osband, and Rémi Munos. Minimax regret bounds for reinforcement learning. ICML, 2017.
Nan Jiang, Akshay Krishnamurthy, Alekh Agarwal, John Langford, and Robert E. Schapire. Contextual decision processes with low Bellman rank are PAC learnable. ICML, 2017.
Wen Sun, Nan Jiang, Akshay Krishnamurthy, Alekh Agarwal, and John Langford. Model-based RL in contextual decision processes: PAC bounds and exponential improvements over model-free approaches. COLT, 2019.
Chi Jin, Zhuoran Yang, Zhaoran Wang, and Michael I. Jordan. Provably efficient reinforcement learning with linear function approximation. COLT, 2020.
Ruosong Wang, Ruslan Salakhutdinov, and Lin F. Yang. Reinforcement learning with general value function approximation: Provably efficient approach via bounded eluder dimension. NeurIPS, 2020.
Simon S. Du, Sham M. Kakade, Ruosong Wang, Lin F. Yang. Is a good representation sufficient for sample efficient reinforcement learning? ICLR, 2020.
Simon S. Du, Sham M. Kakade, Jason D. Lee, Shachar Lovett, Gaurav Mahajan, Wen Sun, Ruosong Wang. Bilinear Classes: A Structural Framework for Provable Generalization in RL. ICML, 2021.
Chi Jin, Qinghua Liu, Sobhan Miryoosefi. Bellman Eluder Dimension: New Rich Classes of RL Problems, and Sample-Efficient Algorithms. NeurIPS, 2021.
Gellert Weisz, Philip Amortila, and Csaba Szepesvári. Exponential lower bounds for planning in MDPs With linearly-realizable optimal action-value functions. ALT, 2021.

Posterior Sampling

William R. Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 1933.
Shipra Agrawal and Navin Goyal. Analysis of Thompson Sampling for the Multi-armed Bandit Problem. COLT, 2012.
Daniel Russo and Benjamin Van Roy. An Information-Theoretic Analysis of Thompson Sampling. JMLR, 2016.
Daniel Russo and Benjamin Van Roy. Learning to Optimize via Information-Directed Sampling. Operations Research, 2018.
Tor Lattimore and Andras Gyorgy. Mirror Descent and the Information Ratio. COLT, 2021.

Dylan Foster and Alexander (Sasha) Rakhlin