# Provably Efficient Reinforcement Learning with Linear Function Approximation

COLT, pp. 2137-2143, 2019.

EI

Weibo:

Abstract:

Modern Reinforcement Learning (RL) is commonly applied to practical problems with an enormous number of states, where function approximation must be deployed to approximate either the value function or the policy. The introduction of function approximation raises a fundamental set of challenges involving computational and statistical ef...More

Code:

Data:

Introduction

- Reinforcement Learning (RL) is a control-theoretic problem in which an agent tries to maximize its expected cumulative reward by interacting with an unknown environment over time [41].
- This introduces, a bias, even in the limit of infinite training data, given that the optimal value function and policy may not be linear [see, e.g., 10, 11, 43]
- Both in theory and in practice, the design of RL systems must cope with fundamental statistical problems of sparsity and misspecification, all in the context of a dynamical system.
- The following fundamental question remains open: Is it possible to design provably efficient RL algorithms in the function approximation setting?

Highlights

- Reinforcement Learning (RL) is a control-theoretic problem in which an agent tries to maximize its expected cumulative reward by interacting with an unknown environment over time [41]
- Deep neural networks serve as essential components of generic deep Reinforcement Learning algorithms, including Deep Q-Network (DQN) [30], Asynchronous Advantage Actor-Critic (A3C) [31], and Trust Region Policy Optimization (TRPO) [36]
- We first lay out our algorithm (Algorithm 1)—an optimistic modification of Least-Square Value Iteration (LSVI), where the optimism is realized by Upper-Confidence Bounds (UCB)
- We have presented the first provable Reinforcement Learning algorithm with both polynomial runtime and polynomial sample complexity for linear Markov Decision Process, without requiring a “simulator” or additional assumptions
- The algorithm is Least-Squares Value Iteration—a classical Reinforcement Learning algorithm commonly studied in the setting of linear function approximation—with a Upper-Confidence Bounds bonus
- We hope that our work may serve as a first step towards a better understanding of efficient Reinforcement Learning with function approximation

Results

- The authors present the main results, which provide sample complexity guarantees for Algorithm 1 in the linear MDP setting (Theorem 3.1) and in a misspecified setting (Theorem 3.2).

The authors first lay out the algorithm (Algorithm 1)—an optimistic modification of Least-Square Value Iteration (LSVI), where the optimism is realized by Upper-Confidence Bounds (UCB). - The authors first present a definition for an approximate linear model.
- Assumption B (ζ-Approximate Linear MDP).
- For any ζ ≤ 1, the authors say that MDP(S, A, H, P, r) is a ζapproximate linear MDP with a feature map φ : S × A → Rd, if for any h ∈ [H], there exist d unknown measures μh = (μ(h1), .
- An MDP is an ζ-approximately linear MDP if there exists a linear MDP such that their Markov transition dynamics and reward functions are close.
- The closeness between transition dynamics is measured in terms of total variation distance

Conclusion

- The authors have presented the first provable RL algorithm with both polynomial runtime and polynomial sample complexity for linear MDPs, without requiring a “simulator” or additional assumptions.
- The algorithm is Least-Squares Value Iteration—a classical RL algorithm commonly studied in the setting of linear function approximation—with a UCB bonus.
- On th√e optimal dependencies on d and H.
- Theorem 3.1 claims the total regret to be upper bounded by O( d3H3T ).
- One immediate question is what the optimal dependencies on d and H are.
- We believe the H difference between this lower bound and our upper bound is expected because the exploration bonus used in this√paper is intrinsically “Hoeffding-type.” Using a “Bernstein-type” bonus can potentially help shave off one H factor

Summary

## Introduction:

Reinforcement Learning (RL) is a control-theoretic problem in which an agent tries to maximize its expected cumulative reward by interacting with an unknown environment over time [41].- This introduces, a bias, even in the limit of infinite training data, given that the optimal value function and policy may not be linear [see, e.g., 10, 11, 43]
- Both in theory and in practice, the design of RL systems must cope with fundamental statistical problems of sparsity and misspecification, all in the context of a dynamical system.
- The following fundamental question remains open: Is it possible to design provably efficient RL algorithms in the function approximation setting?
## Results:

The authors present the main results, which provide sample complexity guarantees for Algorithm 1 in the linear MDP setting (Theorem 3.1) and in a misspecified setting (Theorem 3.2).

The authors first lay out the algorithm (Algorithm 1)—an optimistic modification of Least-Square Value Iteration (LSVI), where the optimism is realized by Upper-Confidence Bounds (UCB).- The authors first present a definition for an approximate linear model.
- Assumption B (ζ-Approximate Linear MDP).
- For any ζ ≤ 1, the authors say that MDP(S, A, H, P, r) is a ζapproximate linear MDP with a feature map φ : S × A → Rd, if for any h ∈ [H], there exist d unknown measures μh = (μ(h1), .
- An MDP is an ζ-approximately linear MDP if there exists a linear MDP such that their Markov transition dynamics and reward functions are close.
- The closeness between transition dynamics is measured in terms of total variation distance
## Conclusion:

The authors have presented the first provable RL algorithm with both polynomial runtime and polynomial sample complexity for linear MDPs, without requiring a “simulator” or additional assumptions.- The algorithm is Least-Squares Value Iteration—a classical RL algorithm commonly studied in the setting of linear function approximation—with a UCB bonus.
- On th√e optimal dependencies on d and H.
- Theorem 3.1 claims the total regret to be upper bounded by O( d3H3T ).
- One immediate question is what the optimal dependencies on d and H are.
- We believe the H difference between this lower bound and our upper bound is expected because the exploration bonus used in this√paper is intrinsically “Hoeffding-type.” Using a “Bernstein-type” bonus can potentially help shave off one H factor

Related work

- Tabular RL: Tabular RL is well studied in both model-based [20, 33, 8, 17] and model-free settings [39, 22]. See also [24, 6, 7, 25, 37, 45] for a simplified setting with access to a “simulator” (also called a generative model), which is a strong oracle that allows the algorithm to query arbitrary state-action pairs and return the reward and the next state. The “simulator” significantly alleviates the difficulty of exploration, since a naive exploration strategy which queries all state-action pairs uniformly at random already leads to the most efficient algorithm for finding an optimal policy [7].

In the episodic setting with nonstationary dynamic√s and no “simulators,” t√he best regrets achieved by existing model-based and model-free algorithms are O( H2SAT ) [8] and O( H3SAT ) [22], respectively,

√ both of which (nearly) attain the minimax lower bound Ω( H2SAT ) [20, 32, 22]. Here S and A denote the numbers of states and actions, respectively. Although these algorithms√are (nearly) minimax-optimal, they can not cope with large state spaces, as their regret scales linearly in S, where S is often exponentially large in practice [see, e.g., 30, 38, 23, 27]. Moreover, the minimax lower bound suggests that, informationtheoretically, a large state space cannot be handled efficiently unless further problem-specific structure is exploited. Compared with this line of work, in the current paper we exploit the linear structure of the reward and transition functions and show that the regret of optimistic LSVI scales polynomially in the ambient dimension d rather than the number of states S.

Funding

- This work was supported in part by the DARPA program on Lifelong Learning Machines

Reference

- Y. Abbasi-Yadkori and C. Szepesvari. Regret bounds for the adaptive control of linear quadratic systems. In Conference on Learning Theory, pages 1–26, 2011.
- Y. Abbasi-Yadkori, D. Pal, and C. Szepesvari. Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems, pages 2312–2320, 2011.
- Y. Abbasi-Yadkori, N. Lazic, and C. Szepesvari. Model-free linear quadratic control via reduction to expert prediction. In International Conference on Artificial Intelligence and Statistics, pages 3108– 3117, 2019.
- M. Abeille and A. Lazaric. Improved regret bounds for Thompson sampling in linear quadratic control problems. In International Conference on Machine Learning, pages 1–9, 2018.
- P. Auer. Using confidence bounds for exploitation-exploration trade-offs. Journal of Machine Learning Research, 3(Nov):397–422, 2002.
- M. G. Azar, R. Munos, M. Ghavamzadaeh, and H. J. Kappen. Speedy Q-learning. In Advances in Neural Information Processing Systems, 2011.
- M. G. Azar, R. Munos, and B. Kappen. On the sample complexity of reinforcement learning with a generative model. arXiv preprint arXiv:1206.6461, 2012.
- M. G. Azar, I. Osband, and R. Munos. Minimax regret bounds for reinforcement learning. In International Conference on Machine Learning, pages 263–272, 2017.
- K. Azizzadenesheli, E. Brunskill, and A. Anandkumar. Efficient exploration through bayesian deep q-networks. In 2018 Information Theory and Applications Workshop (ITA), pages 1–IEEE, 2018.
- L. Baird. Residual algorithms: Reinforcement learning with function approximation. In International Conference on Machine Learning, pages 30–37, 1995.
- J. A. Boyan and A. W. Moore. Generalization in reinforcement learning: Safely approximating the value function. In Advances in Neural Information Processing Systems, pages 369–376, 1995.
- S. J. Bradtke and A. G. Barto. Linear least-squares algorithms for temporal difference learning. Machine Learning, 22(1-3):33–57, 1996.
- S. Bubeck and N. Cesa-Bianchi. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends R in Machine Learning, 5(1):1–122, 2012.
- W. Chu, L. Li, L. Reyzin, and R. Schapire. Contextual bandits with linear payoff functions. In International Conference on Artificial Intelligence and Statistics, pages 208–214, 2011. regret. arXiv preprint arXiv:1902.06223, 2019.
- [16] V. Dani, T. P. Hayes, and S. M. Kakade. Stochastic linear optimization under bandit feedback. In Conference on Learning Theory, 2008.
- [17] C. Dann, T. Lattimore, and E. Brunskill. Unifying pac and regret: Uniform pac bounds for episodic reinforcement learning. In Advances in Neural Information Processing Systems, pages 5713–5723, 2017.
- [18] S. Dean, H. Mania, N. Matni, B. Recht, and S. Tu. Regret bounds for robust adaptive control of the linear quadratic regulator. In Advances in Neural Information Processing Systems, pages 4188–4197, 2018.
- [19] S. S. Du, Y. Luo, R. Wang, and H. Zhang. Provably efficient Q-learning with function approximation via distribution shift error checking oracle. arXiv preprint arXiv:1906.06321, 2019.
- [20] T. Jaksch, R. Ortner, and P. Auer. Near-optimal regret bounds for reinforcement learning. Journal of Machine Learning Research, 11(4):1563–1600, 2010.
- [21] N. Jiang, A. Krishnamurthy, A. Agarwal, J. Langford, and R. E. Schapire. Contextual decision processes with low bellman rank are pac-learnable. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1704–1713. JMLR. org, 2017.
- [22] C. Jin, Z. Allen-Zhu, S. Bubeck, and M. I. Jordan. Is Q-learning provably efficient? In Advances in Neural Information Processing Systems, pages 4863–4873, 2018.
- [23] J. Kober and J. Peters. Reinforcement learning in robotics: A survey. In Reinforcement Learning, pages 579–610.
- [24] S. Koenig and R. G. Simmons. Complexity analysis of real-time reinforcement learning. In Association for the Advancement of Artificial Intelligence, pages 99–107, 1993.
- [25] T. Lattimore and M. Hutter. PAC bounds for discounted MDPs. In International Conference on Algorithmic Learning Theory, pages 320–334, 2012.
- [26] T. Lattimore and C. Szepesvari. Bandit algorithms. preprint, 2018.
- [27] J. Li, W. Monroe, A. Ritter, M. Galley, J. Gao, and D. Jurafsky. Deep reinforcement learning for dialogue generation. arXiv preprint arXiv:1606.01541, 2016.
- [28] L. Li, W. Chu, J. Langford, and R. E. Schapire. A contextual-bandit approach to personalized news article recommendation. In International Conference on World Wide Web, pages 661–670, 2010.
- [29] F. S. Melo and M. I. Ribeiro. Q-learning with linear function approximation. In International Conference on Computational Learning Theory, pages 308–322.
- [30] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller. Playing Atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013.
- [31] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In International Conference on Machine Learning, pages 1928–1937, 2016.
- [32] I. Osband and B. Van Roy. On lower bounds for regret in reinforcement learning. arXiv preprint arXiv:1608.02732, 2016.
- [33] I. Osband, B. Van Roy, and Z. Wen. Generalization and exploration via randomized value functions. arXiv preprint arXiv:1402.0635, 2014.
- [34] M. L. Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons, 2014.
- [35] P. Rusmevichientong and J. N. Tsitsiklis. Linearly parameterized bandits. Mathematics of Operations Research, 35(2):395–411, 2010.
- [36] J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz. Trust region policy optimization. In International Conference on Machine Learning, pages 1889–1897, 2015.
- [37] A. Sidford, M. Wang, X. Wu, and Y. Ye. Variance reduced value iteration and faster algorithms for solving Markov decision processes. In ACM-SIAM Symposium on Discrete Algorithms, pages 770–787, 2018.
- [38] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, et al. Mastering the game of Go with deep neural networks and tree search. Nature, 529(7587):484–489, 2016.
- [39] A. L. Strehl, L. Li, E. Wiewiora, J. Langford, and M. L. Littman. PAC model-free reinforcement learning. In International Conference on Machine Learning, pages 881–888, 2006.
- [40] R. S. Sutton. Learning to predict by the methods of temporal differences. Machine Learning, 3(1): 9–44, 1988.
- [41] R. S. Sutton and A. G. Barto. Reinforcement Learning: An Introduction. MIT Press, 2011.
- [42] C. Szepesvari. Algorithms for reinforcement learning. Synthesis Lectures on Artificial Intelligence and Machine Learning, 4(1):1–103, 2010.
- [43] J. N. Tsitsiklis and B. Van Roy. Analysis of temporal-diffference learning with function approximation. In Advances in Neural Information Processing Systems, pages 1075–1081, 1997.
- [44] R. Vershynin. Introduction to the non-asymptotic analysis of random matrices. arXiv preprint arXiv:1011.3027, 2010.
- [45] M. J. Wainwright. Variance-reduced Q-learning is minimax optimal. arXiv preprint arXiv:1906.04697, 2019.
- [46] T. Wang, W. Ye, D. Geng, and C. Rudin. Towards practical Lipschitz stochastic bandits. arXiv preprint arXiv:1901.09277, 2019.
- [47] Z. Wen and B. Van Roy. Efficient exploration and value function generalization in deterministic systems. In Advances in Neural Information Processing Systems, pages 3021–3029, 2013.
- [48] Z. Wen and B. Van Roy. Efficient reinforcement learning in deterministic systems with value function generalization. Mathematics of Operations Research, 42(3):762–782, 2017.
- [49] L. Yang and M. Wang. Sample-optimal parametric Q-learning using linearly additive features. In International Conference on Machine Learning, pages 6995–7004, 2019.
- [50] L. F. Yang and M. Wang. Reinforcement leaning in feature space: Matrix bandit, kernels, and regret bound. arXiv preprint arXiv:1905.10389, 2019.
- [51] X. Zhu and D. B. Dunson. Lipschitz bandit optimization with improved efficiency. arXiv preprint arXiv:1904.11131, 2019.

Tags

Comments