Provably Efficient Reinforcement Learning with Linear Function Approximation

COLT, pp. 2137-2143, 2019.

Cited by: 109|Views46
EI
Weibo:
The algorithm is Least-Squares Value Iteration—a classical Reinforcement Learning algorithm commonly studied in the setting of linear function approximation—with a Upper-Confidence Bounds bonus

Abstract:

Modern Reinforcement Learning (RL) is commonly applied to practical problems with an enormous number of states, where function approximation must be deployed to approximate either the value function or the policy. The introduction of function approximation raises a fundamental set of challenges involving computational and statistical ef...More

Code:

Data:

Full Text
Bibtex
Weibo
Introduction
  • Reinforcement Learning (RL) is a control-theoretic problem in which an agent tries to maximize its expected cumulative reward by interacting with an unknown environment over time [41].
  • This introduces, a bias, even in the limit of infinite training data, given that the optimal value function and policy may not be linear [see, e.g., 10, 11, 43]
  • Both in theory and in practice, the design of RL systems must cope with fundamental statistical problems of sparsity and misspecification, all in the context of a dynamical system.
  • The following fundamental question remains open: Is it possible to design provably efficient RL algorithms in the function approximation setting?
Highlights
  • Reinforcement Learning (RL) is a control-theoretic problem in which an agent tries to maximize its expected cumulative reward by interacting with an unknown environment over time [41]
  • Deep neural networks serve as essential components of generic deep Reinforcement Learning algorithms, including Deep Q-Network (DQN) [30], Asynchronous Advantage Actor-Critic (A3C) [31], and Trust Region Policy Optimization (TRPO) [36]
  • We first lay out our algorithm (Algorithm 1)—an optimistic modification of Least-Square Value Iteration (LSVI), where the optimism is realized by Upper-Confidence Bounds (UCB)
  • We have presented the first provable Reinforcement Learning algorithm with both polynomial runtime and polynomial sample complexity for linear Markov Decision Process, without requiring a “simulator” or additional assumptions
  • The algorithm is Least-Squares Value Iteration—a classical Reinforcement Learning algorithm commonly studied in the setting of linear function approximation—with a Upper-Confidence Bounds bonus
  • We hope that our work may serve as a first step towards a better understanding of efficient Reinforcement Learning with function approximation
Results
  • The authors present the main results, which provide sample complexity guarantees for Algorithm 1 in the linear MDP setting (Theorem 3.1) and in a misspecified setting (Theorem 3.2).

    The authors first lay out the algorithm (Algorithm 1)—an optimistic modification of Least-Square Value Iteration (LSVI), where the optimism is realized by Upper-Confidence Bounds (UCB).
  • The authors first present a definition for an approximate linear model.
  • Assumption B (ζ-Approximate Linear MDP).
  • For any ζ ≤ 1, the authors say that MDP(S, A, H, P, r) is a ζapproximate linear MDP with a feature map φ : S × A → Rd, if for any h ∈ [H], there exist d unknown measures μh = (μ(h1), .
  • An MDP is an ζ-approximately linear MDP if there exists a linear MDP such that their Markov transition dynamics and reward functions are close.
  • The closeness between transition dynamics is measured in terms of total variation distance
Conclusion
  • The authors have presented the first provable RL algorithm with both polynomial runtime and polynomial sample complexity for linear MDPs, without requiring a “simulator” or additional assumptions.
  • The algorithm is Least-Squares Value Iteration—a classical RL algorithm commonly studied in the setting of linear function approximation—with a UCB bonus.
  • On th√e optimal dependencies on d and H.
  • Theorem 3.1 claims the total regret to be upper bounded by O( d3H3T ).
  • One immediate question is what the optimal dependencies on d and H are.
  • We believe the H difference between this lower bound and our upper bound is expected because the exploration bonus used in this√paper is intrinsically “Hoeffding-type.” Using a “Bernstein-type” bonus can potentially help shave off one H factor
Summary
  • Introduction:

    Reinforcement Learning (RL) is a control-theoretic problem in which an agent tries to maximize its expected cumulative reward by interacting with an unknown environment over time [41].
  • This introduces, a bias, even in the limit of infinite training data, given that the optimal value function and policy may not be linear [see, e.g., 10, 11, 43]
  • Both in theory and in practice, the design of RL systems must cope with fundamental statistical problems of sparsity and misspecification, all in the context of a dynamical system.
  • The following fundamental question remains open: Is it possible to design provably efficient RL algorithms in the function approximation setting?
  • Results:

    The authors present the main results, which provide sample complexity guarantees for Algorithm 1 in the linear MDP setting (Theorem 3.1) and in a misspecified setting (Theorem 3.2).

    The authors first lay out the algorithm (Algorithm 1)—an optimistic modification of Least-Square Value Iteration (LSVI), where the optimism is realized by Upper-Confidence Bounds (UCB).
  • The authors first present a definition for an approximate linear model.
  • Assumption B (ζ-Approximate Linear MDP).
  • For any ζ ≤ 1, the authors say that MDP(S, A, H, P, r) is a ζapproximate linear MDP with a feature map φ : S × A → Rd, if for any h ∈ [H], there exist d unknown measures μh = (μ(h1), .
  • An MDP is an ζ-approximately linear MDP if there exists a linear MDP such that their Markov transition dynamics and reward functions are close.
  • The closeness between transition dynamics is measured in terms of total variation distance
  • Conclusion:

    The authors have presented the first provable RL algorithm with both polynomial runtime and polynomial sample complexity for linear MDPs, without requiring a “simulator” or additional assumptions.
  • The algorithm is Least-Squares Value Iteration—a classical RL algorithm commonly studied in the setting of linear function approximation—with a UCB bonus.
  • On th√e optimal dependencies on d and H.
  • Theorem 3.1 claims the total regret to be upper bounded by O( d3H3T ).
  • One immediate question is what the optimal dependencies on d and H are.
  • We believe the H difference between this lower bound and our upper bound is expected because the exploration bonus used in this√paper is intrinsically “Hoeffding-type.” Using a “Bernstein-type” bonus can potentially help shave off one H factor
Related work
  • Tabular RL: Tabular RL is well studied in both model-based [20, 33, 8, 17] and model-free settings [39, 22]. See also [24, 6, 7, 25, 37, 45] for a simplified setting with access to a “simulator” (also called a generative model), which is a strong oracle that allows the algorithm to query arbitrary state-action pairs and return the reward and the next state. The “simulator” significantly alleviates the difficulty of exploration, since a naive exploration strategy which queries all state-action pairs uniformly at random already leads to the most efficient algorithm for finding an optimal policy [7].

    In the episodic setting with nonstationary dynamic√s and no “simulators,” t√he best regrets achieved by existing model-based and model-free algorithms are O( H2SAT ) [8] and O( H3SAT ) [22], respectively,

    √ both of which (nearly) attain the minimax lower bound Ω( H2SAT ) [20, 32, 22]. Here S and A denote the numbers of states and actions, respectively. Although these algorithms√are (nearly) minimax-optimal, they can not cope with large state spaces, as their regret scales linearly in S, where S is often exponentially large in practice [see, e.g., 30, 38, 23, 27]. Moreover, the minimax lower bound suggests that, informationtheoretically, a large state space cannot be handled efficiently unless further problem-specific structure is exploited. Compared with this line of work, in the current paper we exploit the linear structure of the reward and transition functions and show that the regret of optimistic LSVI scales polynomially in the ambient dimension d rather than the number of states S.
Funding
  • This work was supported in part by the DARPA program on Lifelong Learning Machines
Reference
  • Y. Abbasi-Yadkori and C. Szepesvari. Regret bounds for the adaptive control of linear quadratic systems. In Conference on Learning Theory, pages 1–26, 2011.
    Google ScholarLocate open access versionFindings
  • Y. Abbasi-Yadkori, D. Pal, and C. Szepesvari. Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems, pages 2312–2320, 2011.
    Google ScholarLocate open access versionFindings
  • Y. Abbasi-Yadkori, N. Lazic, and C. Szepesvari. Model-free linear quadratic control via reduction to expert prediction. In International Conference on Artificial Intelligence and Statistics, pages 3108– 3117, 2019.
    Google ScholarLocate open access versionFindings
  • M. Abeille and A. Lazaric. Improved regret bounds for Thompson sampling in linear quadratic control problems. In International Conference on Machine Learning, pages 1–9, 2018.
    Google ScholarLocate open access versionFindings
  • P. Auer. Using confidence bounds for exploitation-exploration trade-offs. Journal of Machine Learning Research, 3(Nov):397–422, 2002.
    Google ScholarLocate open access versionFindings
  • M. G. Azar, R. Munos, M. Ghavamzadaeh, and H. J. Kappen. Speedy Q-learning. In Advances in Neural Information Processing Systems, 2011.
    Google ScholarLocate open access versionFindings
  • M. G. Azar, R. Munos, and B. Kappen. On the sample complexity of reinforcement learning with a generative model. arXiv preprint arXiv:1206.6461, 2012.
    Findings
  • M. G. Azar, I. Osband, and R. Munos. Minimax regret bounds for reinforcement learning. In International Conference on Machine Learning, pages 263–272, 2017.
    Google ScholarLocate open access versionFindings
  • K. Azizzadenesheli, E. Brunskill, and A. Anandkumar. Efficient exploration through bayesian deep q-networks. In 2018 Information Theory and Applications Workshop (ITA), pages 1–IEEE, 2018.
    Google ScholarLocate open access versionFindings
  • L. Baird. Residual algorithms: Reinforcement learning with function approximation. In International Conference on Machine Learning, pages 30–37, 1995.
    Google ScholarLocate open access versionFindings
  • J. A. Boyan and A. W. Moore. Generalization in reinforcement learning: Safely approximating the value function. In Advances in Neural Information Processing Systems, pages 369–376, 1995.
    Google ScholarLocate open access versionFindings
  • S. J. Bradtke and A. G. Barto. Linear least-squares algorithms for temporal difference learning. Machine Learning, 22(1-3):33–57, 1996.
    Google ScholarLocate open access versionFindings
  • S. Bubeck and N. Cesa-Bianchi. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends R in Machine Learning, 5(1):1–122, 2012.
    Google ScholarLocate open access versionFindings
  • W. Chu, L. Li, L. Reyzin, and R. Schapire. Contextual bandits with linear payoff functions. In International Conference on Artificial Intelligence and Statistics, pages 208–214, 2011. regret. arXiv preprint arXiv:1902.06223, 2019.
    Findings
  • [16] V. Dani, T. P. Hayes, and S. M. Kakade. Stochastic linear optimization under bandit feedback. In Conference on Learning Theory, 2008.
    Google ScholarLocate open access versionFindings
  • [17] C. Dann, T. Lattimore, and E. Brunskill. Unifying pac and regret: Uniform pac bounds for episodic reinforcement learning. In Advances in Neural Information Processing Systems, pages 5713–5723, 2017.
    Google ScholarLocate open access versionFindings
  • [18] S. Dean, H. Mania, N. Matni, B. Recht, and S. Tu. Regret bounds for robust adaptive control of the linear quadratic regulator. In Advances in Neural Information Processing Systems, pages 4188–4197, 2018.
    Google ScholarLocate open access versionFindings
  • [19] S. S. Du, Y. Luo, R. Wang, and H. Zhang. Provably efficient Q-learning with function approximation via distribution shift error checking oracle. arXiv preprint arXiv:1906.06321, 2019.
    Findings
  • [20] T. Jaksch, R. Ortner, and P. Auer. Near-optimal regret bounds for reinforcement learning. Journal of Machine Learning Research, 11(4):1563–1600, 2010.
    Google ScholarLocate open access versionFindings
  • [21] N. Jiang, A. Krishnamurthy, A. Agarwal, J. Langford, and R. E. Schapire. Contextual decision processes with low bellman rank are pac-learnable. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1704–1713. JMLR. org, 2017.
    Google ScholarLocate open access versionFindings
  • [22] C. Jin, Z. Allen-Zhu, S. Bubeck, and M. I. Jordan. Is Q-learning provably efficient? In Advances in Neural Information Processing Systems, pages 4863–4873, 2018.
    Google ScholarLocate open access versionFindings
  • [23] J. Kober and J. Peters. Reinforcement learning in robotics: A survey. In Reinforcement Learning, pages 579–610.
    Google ScholarLocate open access versionFindings
  • [24] S. Koenig and R. G. Simmons. Complexity analysis of real-time reinforcement learning. In Association for the Advancement of Artificial Intelligence, pages 99–107, 1993.
    Google ScholarLocate open access versionFindings
  • [25] T. Lattimore and M. Hutter. PAC bounds for discounted MDPs. In International Conference on Algorithmic Learning Theory, pages 320–334, 2012.
    Google ScholarLocate open access versionFindings
  • [26] T. Lattimore and C. Szepesvari. Bandit algorithms. preprint, 2018.
    Google ScholarFindings
  • [27] J. Li, W. Monroe, A. Ritter, M. Galley, J. Gao, and D. Jurafsky. Deep reinforcement learning for dialogue generation. arXiv preprint arXiv:1606.01541, 2016.
    Findings
  • [28] L. Li, W. Chu, J. Langford, and R. E. Schapire. A contextual-bandit approach to personalized news article recommendation. In International Conference on World Wide Web, pages 661–670, 2010.
    Google ScholarLocate open access versionFindings
  • [29] F. S. Melo and M. I. Ribeiro. Q-learning with linear function approximation. In International Conference on Computational Learning Theory, pages 308–322.
    Google ScholarLocate open access versionFindings
  • [30] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller. Playing Atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013.
    Findings
  • [31] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In International Conference on Machine Learning, pages 1928–1937, 2016.
    Google ScholarLocate open access versionFindings
  • [32] I. Osband and B. Van Roy. On lower bounds for regret in reinforcement learning. arXiv preprint arXiv:1608.02732, 2016.
    Findings
  • [33] I. Osband, B. Van Roy, and Z. Wen. Generalization and exploration via randomized value functions. arXiv preprint arXiv:1402.0635, 2014.
    Findings
  • [34] M. L. Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons, 2014.
    Google ScholarFindings
  • [35] P. Rusmevichientong and J. N. Tsitsiklis. Linearly parameterized bandits. Mathematics of Operations Research, 35(2):395–411, 2010.
    Google ScholarLocate open access versionFindings
  • [36] J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz. Trust region policy optimization. In International Conference on Machine Learning, pages 1889–1897, 2015.
    Google ScholarLocate open access versionFindings
  • [37] A. Sidford, M. Wang, X. Wu, and Y. Ye. Variance reduced value iteration and faster algorithms for solving Markov decision processes. In ACM-SIAM Symposium on Discrete Algorithms, pages 770–787, 2018.
    Google ScholarLocate open access versionFindings
  • [38] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, et al. Mastering the game of Go with deep neural networks and tree search. Nature, 529(7587):484–489, 2016.
    Google ScholarLocate open access versionFindings
  • [39] A. L. Strehl, L. Li, E. Wiewiora, J. Langford, and M. L. Littman. PAC model-free reinforcement learning. In International Conference on Machine Learning, pages 881–888, 2006.
    Google ScholarLocate open access versionFindings
  • [40] R. S. Sutton. Learning to predict by the methods of temporal differences. Machine Learning, 3(1): 9–44, 1988.
    Google ScholarLocate open access versionFindings
  • [41] R. S. Sutton and A. G. Barto. Reinforcement Learning: An Introduction. MIT Press, 2011.
    Google ScholarFindings
  • [42] C. Szepesvari. Algorithms for reinforcement learning. Synthesis Lectures on Artificial Intelligence and Machine Learning, 4(1):1–103, 2010.
    Google ScholarLocate open access versionFindings
  • [43] J. N. Tsitsiklis and B. Van Roy. Analysis of temporal-diffference learning with function approximation. In Advances in Neural Information Processing Systems, pages 1075–1081, 1997.
    Google ScholarLocate open access versionFindings
  • [44] R. Vershynin. Introduction to the non-asymptotic analysis of random matrices. arXiv preprint arXiv:1011.3027, 2010.
    Findings
  • [45] M. J. Wainwright. Variance-reduced Q-learning is minimax optimal. arXiv preprint arXiv:1906.04697, 2019.
    Findings
  • [46] T. Wang, W. Ye, D. Geng, and C. Rudin. Towards practical Lipschitz stochastic bandits. arXiv preprint arXiv:1901.09277, 2019.
    Findings
  • [47] Z. Wen and B. Van Roy. Efficient exploration and value function generalization in deterministic systems. In Advances in Neural Information Processing Systems, pages 3021–3029, 2013.
    Google ScholarLocate open access versionFindings
  • [48] Z. Wen and B. Van Roy. Efficient reinforcement learning in deterministic systems with value function generalization. Mathematics of Operations Research, 42(3):762–782, 2017.
    Google ScholarLocate open access versionFindings
  • [49] L. Yang and M. Wang. Sample-optimal parametric Q-learning using linearly additive features. In International Conference on Machine Learning, pages 6995–7004, 2019.
    Google ScholarLocate open access versionFindings
  • [50] L. F. Yang and M. Wang. Reinforcement leaning in feature space: Matrix bandit, kernels, and regret bound. arXiv preprint arXiv:1905.10389, 2019.
    Findings
  • [51] X. Zhu and D. B. Dunson. Lipschitz bandit optimization with improved efficiency. arXiv preprint arXiv:1904.11131, 2019.
    Findings
Your rating :
0

 

Tags
Comments