# Provably Efficient Reinforcement Learning with Kernel and Neural Function Approximations

NIPS 2020, 2020.

EI

Weibo:

Abstract:

Reinforcement learning (RL) algorithms combined with modern function approximators such as kernel functions and deep neural networks have achieved significant empirical successes in large-scale application problems with a massive number of states. From a theoretical perspective, however, RL with functional approximation poses a fundamenta...More

Code:

Data:

Introduction

- Reinforcement learning (RL) algorithms combined with modern function approximators such as kernel functions and deep neural networks have achieved tremendous empirical successes in a variety of application problems [e.g., 27, 60, 61, 72, 70].
- Most existing provably efficient RL algorithms are only applicable only to the tabular setting [see, e.g., 33, 52, 6, 35, 50, 56] where both the state and action spaces are discrete and the value function can be represented as a table.
- The authors introduce the optimistic least-squares value iteration algorithm where the actionvalue functions are estimated using a class of functions defined on Z = S ⇥ A.

Highlights

- Reinforcement learning (RL) algorithms combined with modern function approximators such as kernel functions and deep neural networks have achieved tremendous empirical successes in a variety of application problems [e.g., 27, 60, 61, 72, 70]
- From a theoretical perspective, function approximation brings statistical estimation into the scope of RL, resulting in the need to balance the bias-variance tradeoff that is innate in statistical estimation and the exploration-exploitation tradeoff that is inherent in RL at the same time
- Most existing provably efficient RL algorithms are only applicable only to the tabular setting [see, e.g., 33, 52, 6, 35, 50, 56] where both the state and action spaces are discrete and the value function can be represented as a table
- We have presented the algorithmic framework of optimistic least-squares value iteration for RL with general function approximation, where we propose to add an additional bonus term to the solution to each least-squares value estimation problem to promote exploration
- Reinforcement learning is a tool that is increasingly used in practical machine learning applications, especially in the setting where nonlinear function approximation is involved
- Theoretical explorations related to reinforcement learning with function approximation may help provide frameworks through which to reason about, and design safer and more reliable practical systems

Results

- Algorithm 1 Optimistic Least-Squares Value Iteration with Function Approximation
- The authors consider the case where function class F is an RKHS H with kernel K.
- KOVI reduces to the LSVI-UCB algorithm proposed in [36] for linear value functions.
- Neural Optimistic Least-Squares Value Iteration (NOVI) algorithm, whose details are stated in
- 4 Theory of Kernel Optimistic Least-Squares Value Iteration p the authors prove that KOVI achieves O( HH2 T )-regret bounds, where H characterizes the intrinsic complexity of the RKHS H that is used to approximate {Q?h}h2[H].
- With respect to the1-norm on Z, which is determined by the spectral structure of H and characterizes the complexity of the value functions constructed by KOVI.
- It hinges on (i) Assumption 4.1, which postulates that the RKHS-norm ball {f 2 H : kf kH RQH} contains the image of the Bellman operator, and (ii) the inequality in (4.5) admits a solution BT , which is set to be in Algorithm 2.
- The authors set to be sufficiently large so as to dominate the uncertainty of Qbt , h whereas to quantify such uncertainty, the authors utilize the uniform concentration over the value function class Qucb(h + 1, RT , ) whose complexity metric, the1-covering number, in turn depends on .
- Under the -exponential eigenvalue decay condition, as the authors will show in §H, the log-covering number and the effective dimension are bounded by1+2/ and1+1/ , respectively.
- This such a regret isd/2 worse than that in [62] for kernel contextual bandits, which is due to bounding the log-covering number.

Conclusion

- The authors have presented the algorithmic framework of optimistic least-squares value iteration for RL with general function approximation, where the authors propose to add an additional bonus term to the solution to each least-squares value estimation problem to promote exploration.
- NOVI, respectively, that To the best of the knowledge, both provably achieve the authors have developed the first provably efficient RL algorithms under the settings of kernel and neural function approximations.
- Theoretical explorations related to reinforcement learning with function approximation may help provide frameworks through which to reason about, and design safer and more reliable practical systems

Summary

- Reinforcement learning (RL) algorithms combined with modern function approximators such as kernel functions and deep neural networks have achieved tremendous empirical successes in a variety of application problems [e.g., 27, 60, 61, 72, 70].
- Most existing provably efficient RL algorithms are only applicable only to the tabular setting [see, e.g., 33, 52, 6, 35, 50, 56] where both the state and action spaces are discrete and the value function can be represented as a table.
- The authors introduce the optimistic least-squares value iteration algorithm where the actionvalue functions are estimated using a class of functions defined on Z = S ⇥ A.
- Algorithm 1 Optimistic Least-Squares Value Iteration with Function Approximation
- The authors consider the case where function class F is an RKHS H with kernel K.
- KOVI reduces to the LSVI-UCB algorithm proposed in [36] for linear value functions.
- Neural Optimistic Least-Squares Value Iteration (NOVI) algorithm, whose details are stated in
- 4 Theory of Kernel Optimistic Least-Squares Value Iteration p the authors prove that KOVI achieves O( HH2 T )-regret bounds, where H characterizes the intrinsic complexity of the RKHS H that is used to approximate {Q?h}h2[H].
- With respect to the1-norm on Z, which is determined by the spectral structure of H and characterizes the complexity of the value functions constructed by KOVI.
- It hinges on (i) Assumption 4.1, which postulates that the RKHS-norm ball {f 2 H : kf kH RQH} contains the image of the Bellman operator, and (ii) the inequality in (4.5) admits a solution BT , which is set to be in Algorithm 2.
- The authors set to be sufficiently large so as to dominate the uncertainty of Qbt , h whereas to quantify such uncertainty, the authors utilize the uniform concentration over the value function class Qucb(h + 1, RT , ) whose complexity metric, the1-covering number, in turn depends on .
- Under the -exponential eigenvalue decay condition, as the authors will show in §H, the log-covering number and the effective dimension are bounded by1+2/ and1+1/ , respectively.
- This such a regret isd/2 worse than that in [62] for kernel contextual bandits, which is due to bounding the log-covering number.
- The authors have presented the algorithmic framework of optimistic least-squares value iteration for RL with general function approximation, where the authors propose to add an additional bonus term to the solution to each least-squares value estimation problem to promote exploration.
- NOVI, respectively, that To the best of the knowledge, both provably achieve the authors have developed the first provably efficient RL algorithms under the settings of kernel and neural function approximations.
- Theoretical explorations related to reinforcement learning with function approximation may help provide frameworks through which to reason about, and design safer and more reliable practical systems

- Table1: Summary of the main results. Here H is the length of each episode, T is the number of episodes in total, and 2m is the number of neurons of the overparameterized networks in the neural setting. For kernel and neural settings, deff denotes the effective dimension of the RKHS H and the neural tangent kernel respectively, and N1(✏⇤) is the1-covering number of the value function class, where ✏⇤ = H/T . To obtain concrete bounds, we also apply the general result to RKHS with two specific eigenvalue-decay conditions: -finite spectrum and -exponential decay

Related work

- Due to space limit, we defer the discussions on related work to §A in the appendix. 2

Funding

- Mengdi Wang gratefully acknowledges funding from the U.S National Science Foundation (NSF) grant CMMI1653435, Air Force Office of Scientific Research (AFOSR) grant FA9550-19-1-020, and C3.ai DTI

Reference

- Y. Abbasi-Yadkori, D. Pal, and C. Szepesvari. Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems, pages 2312–2320, 2011.
- Z. Allen-Zhu, Y. Li, and Y. Liang. Learning and generalization in overparameterized neural networks, going beyond two layers. arXiv preprint arXiv:1811.04918, 2018.
- Z. Allen-Zhu, Y. Li, and Z. Song. A convergence theory for deep learning via overparameterization. arXiv preprint arXiv:1811.03962, 2018.
- S. Arora, S. S. Du, W. Hu, Z. Li, and R. Wang. Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks. arXiv preprint arXiv:1901.08584, 2019.
- A. Ayoub, Z. Jia, C. Szepesvari, M. Wang, and L. F. Yang. Model-based reinforcement learning with value-targeted regression. arXiv preprint arXiv:2006.01107, 2020.
- M. G. Azar, I. Osband, and R. Munos. Minimax regret bounds for reinforcement learning. In International Conference on Machine Learning, pages 263–272, 2017.
- K. Azuma. Weighted sums of certain dependent random variables. Tohoku Mathematical Journal, Second Series, 19(3):357–367, 1967.
- F. Bach. Breaking the curse of dimensionality with convex neural networks. The Journal of Machine Learning Research, 18(1):629–681, 2017.
- Y. Bai and J. D. Lee. Beyond linearization: On quadratic and higher-order approximation of wide neural networks. arXiv preprint arXiv:1910.01619, 2019.
- S. J. Bradtke and A. G. Barto. Linear least-squares algorithms for temporal difference learning. Machine learning, 22(1-3):33–57, 1996.
- S. Bubeck and N. Cesa-Bianchi. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends R in Machine Learning, 5(1):1–122, 2012.
- Q. Cai, Z. Yang, C. Jin, and Z. Wang. Provably efficient exploration in policy optimization. arXiv preprint arXiv:1912.05830, 2019.
- Q. Cai, Z. Yang, J. D. Lee, and Z. Wang. Neural temporal-difference learning converges to global optima. In Advances in Neural Information Processing Systems, pages 11312–11322, 2019.
- D. Calandriello, L. Carratino, A. Lazaric, M. Valko, and L. Rosasco. Gaussian process optimization with adaptive sketching: Scalable and no regret. arXiv preprint arXiv:1903.05594, 2019.
- Y. Cao and Q. Gu. Generalization bounds of stochastic gradient descent for wide and deep neural networks. arXiv preprint arXiv:1905.13210, 2019.
- Y. Cao and Q. Gu. A generalization theory of gradient descent for learning over-parameterized deep ReLU networks. arXiv preprint arXiv:1902.01384, 2019.
- L. Chizat and F. Bach. A note on lazy training in supervised differentiable programming. arXiv preprint arXiv:1812.07956, 2018.
- S. R. Chowdhury and A. Gopalan. On kernelized multi-armed bandits. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 844–853. JMLR. org, 2017.
- A. Daniely. SGD learns the conjugate kernel class of the network. In Advances in Neural Information Processing Systems, 2017.
- C. Dann, N. Jiang, A. Krishnamurthy, A. Agarwal, J. Langford, and R. E. Schapire. On oracleefficient pac rl with rich observations. In Advances in Neural Information Processing Systems, pages 1422–1432, 2018.
- C. Dann, T. Lattimore, and E. Brunskill. Unifying pac and regret: Uniform pac bounds for episodic reinforcement learning. In Advances in Neural Information Processing Systems, pages 5713–5723, 2017. p
- K. Dong, J. Peng, Y. Wang, and Y. Zhou. n-regret for learning in markov decision processes with function approximation and low bellman rank. arXiv preprint arXiv:1909.02506, 2019.
- S. S. Du, S. M. Kakade, R. Wang, and L. F. Yang. Is a good representation sufficient for sample efficient reinforcement learning? arXiv preprint arXiv:1910.03016, 2019.
- S. S. Du, A. Krishnamurthy, N. Jiang, A. Agarwal, M. Dudık, and J. Langford. Provably efficient RL with rich observations via latent state decoding. arXiv preprint arXiv:1901.09018, 2019.
- S. S. Du, J. D. Lee, H. Li, L. Wang, and X. Zhai. Gradient descent finds global minima of deep neural networks. arXiv preprint arXiv:1811.03804, 2018.
- S. S. Du, X. Zhai, B. Poczos, and A. Singh. Gradient descent provably optimizes overparameterized neural networks. arXiv preprint arXiv:1810.02054, 2018.
- Y. Duan, X. Chen, R. Houthooft, J. Schulman, and P. Abbeel. Benchmarking deep reinforcement learning for continuous control. In International Conference on Machine Learning, pages 1329– 1338, 2016.
- A. Durand, O.-A. Maillard, and J. Pineau. Streaming kernel regression with provably adaptive mean, variance, and regularization. The Journal of Machine Learning Research, 19(1):650–683, 2018.
- Y. Efroni, L. Shani, A. Rosenberg, and S. Mannor. Optimistic policy optimization with bandit feedback. arXiv preprint arXiv:2002.08243, 2020.
- R. Gao, T. Cai, H. Li, C.-J. Hsieh, L. Wang, and J. D. Lee. Convergence of adversarial training in overparametrized neural networks. In Advances in Neural Information Processing Systems, pages 13009–13020, 2019.
- T. Hofmann, B. Scholkopf, and A. J. Smola. Kernel methods in machine learning. The annals of statistics, pages 1171–1220, 2008.
- A. Jacot, F. Gabriel, and C. Hongler. Neural tangent kernel: Convergence and generalization in neural networks. In Advances in Neural Information Processing Systems, 2018.
- T. Jaksch, R. Ortner, and P. Auer. Near-optimal regret bounds for reinforcement learning. Journal of Machine Learning Research, 11(4):1563–1600, 2010.
- N. Jiang, A. Krishnamurthy, A. Agarwal, J. Langford, and R. E. Schapire. Contextual decision processes with low bellman rank are pac-learnable. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1704–1713. JMLR. org, 2017.
- C. Jin, Z. Allen-Zhu, S. Bubeck, and M. I. Jordan. Is Q-learning provably efficient? In Advances in Neural Information Processing Systems, pages 4863–4873, 2018.
- C. Jin, Z. Yang, Z. Wang, and M. I. Jordan. Provably efficient reinforcement learning with linear function approximation. arXiv preprint arXiv:1907.05388, 2019.
- S. Kakade, A. Krishnamurthy, K. Lowrey, M. Ohnishi, and W. Sun. Information theoretic regret bounds for online nonlinear control. arXiv preprint arXiv:2006.12466, 2020.
- A. Krause and C. S. Ong. Contextual gaussian process bandit optimization. In Advances in neural information processing systems, pages 2447–2455, 2011.
- A. Krishnamurthy, A. Agarwal, and J. Langford. Pac reinforcement learning with rich observations. In Advances in Neural Information Processing Systems, pages 1840–1848, 2016.
- J. Lafferty and G. Lebanon. Diffusion kernels on statistical manifolds. Journal of Machine Learning Research, 6(Jan):129–163, 2005.
- T. Lattimore and C. Szepesvari. Bandit algorithms. preprint, 2018.
- T. Lattimore and C. Szepesvari. Learning with good feature representations in bandits and in rl with a generative model. arXiv preprint arXiv:1911.07676, 2019.
- J. Lee, L. Xiao, S. S. Schoenholz, Y. Bahri, J. Sohl-Dickstein, and J. Pennington. Wide neural networks of any depth evolve as linear models under gradient descent. arXiv preprint arXiv:1902.06720, 2019.
- Y. Li and Y. Liang. Learning overparameterized neural networks via stochastic gradient descent on structured data. In Advances in Neural Information Processing Systems, 2018.
- J. Lu, G. Cheng, and H. Liu. Nonparametric heterogeneity testing for massive data. arXiv preprint arXiv:1601.06212, 2016.
- S. Mendelson, J. Neeman, et al. Regularization in kernel learning. The Annals of Statistics, 38(1):526–565, 2010.
- H. Q. Minh, P. Niyogi, and Y. Yao. Mercer’s theorem, feature maps, and smoothing. In International Conference on Computational Learning Theory, pages 154–168.
- [49] B. Neyshabur and Z. Li. Towards understanding the role of over-parametrization in generalization of neural networks. In International Conference on Learning Representations (ICLR), 2019.
- [50] I. Osband, J. Aslanides, and A. Cassirer. Randomized prior functions for deep reinforcement learning. In Advances in Neural Information Processing Systems, pages 8617–8629, 2018.
- [51] I. Osband and B. Van Roy. Model-based reinforcement learning and the eluder dimension. In Advances in Neural Information Processing Systems, pages 1466–1474, 2014.
- [52] I. Osband, B. Van Roy, and Z. Wen. Generalization and exploration via randomized value functions. arXiv preprint arXiv:1402.0635, 2014.
- [53] M. L. Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons, 2014.
- [54] A. Rahimi and B. Recht. Random features for large-scale kernel machines. In Advances in Neural Information Processing Systems, pages 1177–1184, 2008.
- [56] D. Russo. Worst-case regret bounds for exploration via randomized value functions. In Advances in Neural Information Processing Systems, pages 14410–14420, 2019.
- [57] D. Russo and B. Van Roy. Eluder dimension and the sample complexity of optimistic exploration. In Advances in Neural Information Processing Systems, pages 2256–2264, 2013.
- [58] P. G. Sessa, I. Bogunovic, M. Kamgarpour, and A. Krause. No-regret learning in unknown games with correlated payoffs. In Advances in Neural Information Processing Systems, pages 13602–13611, 2019.
- [59] Z. Shang, G. Cheng, et al. Local and global asymptotic inference in smoothing spline models. Annals of Statistics, 41(5):2608–2638, 2013.
- [60] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, et al. Mastering the game of Go with deep neural networks and tree search. Nature, 529(7587):484–489, 2016.
- [61] D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton, et al. Mastering the game of Go without human knowledge. Nature, 550(7676):354, 2017.
- [62] N. Srinivas, A. Krause, S. M. Kakade, and M. Seeger. Gaussian process optimization in the bandit setting: No regret and experimental design. arXiv preprint arXiv:0912.3995, 2009.
- [63] N. Srinivas, A. Krause, S. M. Kakade, and M. W. Seeger. Information-theoretic regret bounds for gaussian process optimization in the bandit setting. IEEE Transactions on Information Theory, 58(5):3250–3265, 2012.
- [64] I. Steinwart and A. Christmann. Support vector machines. Springer Science & Business Media, 2008.
- [65] A. L. Strehl, L. Li, E. Wiewiora, J. Langford, and M. L. Littman. PAC model-free reinforcement learning. In International Conference on Machine Learning, pages 881–888, 2006.
- [66] R. S. Sutton and A. G. Barto. Reinforcement learning: An introduction. MIT press, 2018.
- [67] M. Valko, N. Korda, R. Munos, I. Flaounas, and N. Cristianini. Finite-time analysis of kernelised contextual bandits. arXiv preprint arXiv:1309.6869, 2013.
- [68] B. Van Roy and S. Dong. Comments on the du-kakade-wang-yang lower bounds. arXiv preprint arXiv:1911.07910, 2019.
- [69] R. Vershynin. High-dimensional probability: An introduction with applications in data science, volume 47. Cambridge university press, 2018.
- [70] O. Vinyals, I. Babuschkin, W. M. Czarnecki, M. Mathieu, A. Dudzik, J. Chung, D. H. Choi, R. Powell, T. Ewalds, P. Georgiev, et al. Grandmaster level in starcraft ii using multi-agent reinforcement learning. Nature, 575(7782):350–354, 2019.
- [71] R. Wang, R. Salakhutdinov, and L. F. Yang. Provably efficient reinforcement learning with general value function approximation. arXiv preprint arXiv:2005.10804, 2020.
- [72] W. Y. Wang, J. Li, and X. He. Deep reinforcement learning for nlp. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics: Tutorial Abstracts, pages 19–21, 2018.
- [73] Y. Wang, R. Wang, S. S. Du, and A. Krishnamurthy. Optimism in reinforcement learning with generalized linear function approximation. arXiv preprint arXiv:1912.04136, 2019.
- [74] Z. Wen and B. Van Roy. Efficient exploration and value function generalization in deterministic systems. In Advances in Neural Information Processing Systems, pages 3021–3029, 2013.
- [75] Z. Wen and B. Van Roy. Efficient reinforcement learning in deterministic systems with value function generalization. Mathematics of Operations Research, 42(3):762–782, 2017.
- [76] L. Wu, C. Ma, and E. Weinan. How SGD selects the global minima in over-parameterized learning: A dynamical stability perspective. In Advances in Neural Information Processing Systems, 2018.
- [77] L. Yang and M. Wang. Sample-optimal parametric Q-learning using linearly additive features. In International Conference on Machine Learning, pages 6995–7004, 2019.
- [78] L. F. Yang and M. Wang. Reinforcement leaning in feature space: Matrix bandit, kernels, and regret bound. arXiv preprint arXiv:1905.10389, 2019.
- [79] Y. Yang, A. Bhattacharya, and D. Pati. Frequentist coverage and sup-norm convergence rate in gaussian process regression. arXiv preprint arXiv:1708.04753, 2017.
- [80] A. Zanette, D. Brandfonbrener, E. Brunskill, M. Pirotta, and A. Lazaric. Frequentist regret bounds for randomized least-squares value iteration. In International Conference on Artificial Intelligence and Statistics, pages 1954–1964, 2020.
- [81] A. Zanette, A. Lazaric, M. Kochenderfer, and E. Brunskill. Learning near optimal policies with low inherent bellman error. arXiv preprint arXiv:2003.00153, 2020.
- [82] Y. Zhang, J. Duchi, and M. Wainwright. Divide and conquer kernel ridge regression: A distributed algorithm with minimax optimal rates. Journal of Machine Learning Research, 16(1):3299–3340, 2015.
- [83] D. Zhou, J. He, and Q. Gu. Provably efficient reinforcement learning for discounted mdps with feature mapping. arXiv preprint arXiv:2006.13165, 2020.
- [84] D. Zhou, L. Li, and Q. Gu. Neural contextual bandits with upper confidence bound-based exploration. arXiv preprint arXiv:1911.04462, 2019.
- [85] D. Zou, Y. Cao, D. Zhou, and Q. Gu. Stochastic gradient descent optimizes over-parameterized deep ReLU networks. arXiv preprint arXiv:1811.08888, 2018.

Tags

Comments