Provably Efficient Reinforcement Learning with Kernel and Neural Function Approximations

NIPS 2020, 2020.

Cited by: 0|Views644
EI
Weibo:
To address such a challenge, focusing on the episodic setting where the action-value functions are represented by a kernel function or over-parametrized neural network, we propose the first provable Reinforcement learning

Abstract:

Reinforcement learning (RL) algorithms combined with modern function approximators such as kernel functions and deep neural networks have achieved significant empirical successes in large-scale application problems with a massive number of states. From a theoretical perspective, however, RL with functional approximation poses a fundamenta...More

Code:

Data:

Full Text
Bibtex
Weibo
Introduction
  • Reinforcement learning (RL) algorithms combined with modern function approximators such as kernel functions and deep neural networks have achieved tremendous empirical successes in a variety of application problems [e.g., 27, 60, 61, 72, 70].
  • Most existing provably efficient RL algorithms are only applicable only to the tabular setting [see, e.g., 33, 52, 6, 35, 50, 56] where both the state and action spaces are discrete and the value function can be represented as a table.
  • The authors introduce the optimistic least-squares value iteration algorithm where the actionvalue functions are estimated using a class of functions defined on Z = S ⇥ A.
Highlights
  • Reinforcement learning (RL) algorithms combined with modern function approximators such as kernel functions and deep neural networks have achieved tremendous empirical successes in a variety of application problems [e.g., 27, 60, 61, 72, 70]
  • From a theoretical perspective, function approximation brings statistical estimation into the scope of RL, resulting in the need to balance the bias-variance tradeoff that is innate in statistical estimation and the exploration-exploitation tradeoff that is inherent in RL at the same time
  • Most existing provably efficient RL algorithms are only applicable only to the tabular setting [see, e.g., 33, 52, 6, 35, 50, 56] where both the state and action spaces are discrete and the value function can be represented as a table
  • We have presented the algorithmic framework of optimistic least-squares value iteration for RL with general function approximation, where we propose to add an additional bonus term to the solution to each least-squares value estimation problem to promote exploration
  • Reinforcement learning is a tool that is increasingly used in practical machine learning applications, especially in the setting where nonlinear function approximation is involved
  • Theoretical explorations related to reinforcement learning with function approximation may help provide frameworks through which to reason about, and design safer and more reliable practical systems
Results
  • Algorithm 1 Optimistic Least-Squares Value Iteration with Function Approximation
  • The authors consider the case where function class F is an RKHS H with kernel K.
  • KOVI reduces to the LSVI-UCB algorithm proposed in [36] for linear value functions.
  • Neural Optimistic Least-Squares Value Iteration (NOVI) algorithm, whose details are stated in
  • 4 Theory of Kernel Optimistic Least-Squares Value Iteration p the authors prove that KOVI achieves O( HH2 T )-regret bounds, where H characterizes the intrinsic complexity of the RKHS H that is used to approximate {Q?h}h2[H].
  • With respect to the1-norm on Z, which is determined by the spectral structure of H and characterizes the complexity of the value functions constructed by KOVI.
  • It hinges on (i) Assumption 4.1, which postulates that the RKHS-norm ball {f 2 H : kf kH RQH} contains the image of the Bellman operator, and (ii) the inequality in (4.5) admits a solution BT , which is set to be in Algorithm 2.
  • The authors set to be sufficiently large so as to dominate the uncertainty of Qbt , h whereas to quantify such uncertainty, the authors utilize the uniform concentration over the value function class Qucb(h + 1, RT , ) whose complexity metric, the1-covering number, in turn depends on .
  • Under the -exponential eigenvalue decay condition, as the authors will show in §H, the log-covering number and the effective dimension are bounded by1+2/ and1+1/ , respectively.
  • This such a regret isd/2 worse than that in [62] for kernel contextual bandits, which is due to bounding the log-covering number.
Conclusion
  • The authors have presented the algorithmic framework of optimistic least-squares value iteration for RL with general function approximation, where the authors propose to add an additional bonus term to the solution to each least-squares value estimation problem to promote exploration.
  • NOVI, respectively, that To the best of the knowledge, both provably achieve the authors have developed the first provably efficient RL algorithms under the settings of kernel and neural function approximations.
  • Theoretical explorations related to reinforcement learning with function approximation may help provide frameworks through which to reason about, and design safer and more reliable practical systems
Summary
  • Reinforcement learning (RL) algorithms combined with modern function approximators such as kernel functions and deep neural networks have achieved tremendous empirical successes in a variety of application problems [e.g., 27, 60, 61, 72, 70].
  • Most existing provably efficient RL algorithms are only applicable only to the tabular setting [see, e.g., 33, 52, 6, 35, 50, 56] where both the state and action spaces are discrete and the value function can be represented as a table.
  • The authors introduce the optimistic least-squares value iteration algorithm where the actionvalue functions are estimated using a class of functions defined on Z = S ⇥ A.
  • Algorithm 1 Optimistic Least-Squares Value Iteration with Function Approximation
  • The authors consider the case where function class F is an RKHS H with kernel K.
  • KOVI reduces to the LSVI-UCB algorithm proposed in [36] for linear value functions.
  • Neural Optimistic Least-Squares Value Iteration (NOVI) algorithm, whose details are stated in
  • 4 Theory of Kernel Optimistic Least-Squares Value Iteration p the authors prove that KOVI achieves O( HH2 T )-regret bounds, where H characterizes the intrinsic complexity of the RKHS H that is used to approximate {Q?h}h2[H].
  • With respect to the1-norm on Z, which is determined by the spectral structure of H and characterizes the complexity of the value functions constructed by KOVI.
  • It hinges on (i) Assumption 4.1, which postulates that the RKHS-norm ball {f 2 H : kf kH RQH} contains the image of the Bellman operator, and (ii) the inequality in (4.5) admits a solution BT , which is set to be in Algorithm 2.
  • The authors set to be sufficiently large so as to dominate the uncertainty of Qbt , h whereas to quantify such uncertainty, the authors utilize the uniform concentration over the value function class Qucb(h + 1, RT , ) whose complexity metric, the1-covering number, in turn depends on .
  • Under the -exponential eigenvalue decay condition, as the authors will show in §H, the log-covering number and the effective dimension are bounded by1+2/ and1+1/ , respectively.
  • This such a regret isd/2 worse than that in [62] for kernel contextual bandits, which is due to bounding the log-covering number.
  • The authors have presented the algorithmic framework of optimistic least-squares value iteration for RL with general function approximation, where the authors propose to add an additional bonus term to the solution to each least-squares value estimation problem to promote exploration.
  • NOVI, respectively, that To the best of the knowledge, both provably achieve the authors have developed the first provably efficient RL algorithms under the settings of kernel and neural function approximations.
  • Theoretical explorations related to reinforcement learning with function approximation may help provide frameworks through which to reason about, and design safer and more reliable practical systems
Tables
  • Table1: Summary of the main results. Here H is the length of each episode, T is the number of episodes in total, and 2m is the number of neurons of the overparameterized networks in the neural setting. For kernel and neural settings, deff denotes the effective dimension of the RKHS H and the neural tangent kernel respectively, and N1(✏⇤) is the1-covering number of the value function class, where ✏⇤ = H/T . To obtain concrete bounds, we also apply the general result to RKHS with two specific eigenvalue-decay conditions: -finite spectrum and -exponential decay
Download tables as Excel
Related work
  • Due to space limit, we defer the discussions on related work to §A in the appendix. 2
Funding
  • Mengdi Wang gratefully acknowledges funding from the U.S National Science Foundation (NSF) grant CMMI1653435, Air Force Office of Scientific Research (AFOSR) grant FA9550-19-1-020, and C3.ai DTI
Reference
  • Y. Abbasi-Yadkori, D. Pal, and C. Szepesvari. Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems, pages 2312–2320, 2011.
    Google ScholarLocate open access versionFindings
  • Z. Allen-Zhu, Y. Li, and Y. Liang. Learning and generalization in overparameterized neural networks, going beyond two layers. arXiv preprint arXiv:1811.04918, 2018.
    Findings
  • Z. Allen-Zhu, Y. Li, and Z. Song. A convergence theory for deep learning via overparameterization. arXiv preprint arXiv:1811.03962, 2018.
    Findings
  • S. Arora, S. S. Du, W. Hu, Z. Li, and R. Wang. Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks. arXiv preprint arXiv:1901.08584, 2019.
    Findings
  • A. Ayoub, Z. Jia, C. Szepesvari, M. Wang, and L. F. Yang. Model-based reinforcement learning with value-targeted regression. arXiv preprint arXiv:2006.01107, 2020.
    Findings
  • M. G. Azar, I. Osband, and R. Munos. Minimax regret bounds for reinforcement learning. In International Conference on Machine Learning, pages 263–272, 2017.
    Google ScholarLocate open access versionFindings
  • K. Azuma. Weighted sums of certain dependent random variables. Tohoku Mathematical Journal, Second Series, 19(3):357–367, 1967.
    Google ScholarLocate open access versionFindings
  • F. Bach. Breaking the curse of dimensionality with convex neural networks. The Journal of Machine Learning Research, 18(1):629–681, 2017.
    Google ScholarLocate open access versionFindings
  • Y. Bai and J. D. Lee. Beyond linearization: On quadratic and higher-order approximation of wide neural networks. arXiv preprint arXiv:1910.01619, 2019.
    Findings
  • S. J. Bradtke and A. G. Barto. Linear least-squares algorithms for temporal difference learning. Machine learning, 22(1-3):33–57, 1996.
    Google ScholarLocate open access versionFindings
  • S. Bubeck and N. Cesa-Bianchi. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends R in Machine Learning, 5(1):1–122, 2012.
    Google ScholarLocate open access versionFindings
  • Q. Cai, Z. Yang, C. Jin, and Z. Wang. Provably efficient exploration in policy optimization. arXiv preprint arXiv:1912.05830, 2019.
    Findings
  • Q. Cai, Z. Yang, J. D. Lee, and Z. Wang. Neural temporal-difference learning converges to global optima. In Advances in Neural Information Processing Systems, pages 11312–11322, 2019.
    Google ScholarLocate open access versionFindings
  • D. Calandriello, L. Carratino, A. Lazaric, M. Valko, and L. Rosasco. Gaussian process optimization with adaptive sketching: Scalable and no regret. arXiv preprint arXiv:1903.05594, 2019.
    Findings
  • Y. Cao and Q. Gu. Generalization bounds of stochastic gradient descent for wide and deep neural networks. arXiv preprint arXiv:1905.13210, 2019.
    Findings
  • Y. Cao and Q. Gu. A generalization theory of gradient descent for learning over-parameterized deep ReLU networks. arXiv preprint arXiv:1902.01384, 2019.
    Findings
  • L. Chizat and F. Bach. A note on lazy training in supervised differentiable programming. arXiv preprint arXiv:1812.07956, 2018.
    Findings
  • S. R. Chowdhury and A. Gopalan. On kernelized multi-armed bandits. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 844–853. JMLR. org, 2017.
    Google ScholarLocate open access versionFindings
  • A. Daniely. SGD learns the conjugate kernel class of the network. In Advances in Neural Information Processing Systems, 2017.
    Google ScholarLocate open access versionFindings
  • C. Dann, N. Jiang, A. Krishnamurthy, A. Agarwal, J. Langford, and R. E. Schapire. On oracleefficient pac rl with rich observations. In Advances in Neural Information Processing Systems, pages 1422–1432, 2018.
    Google ScholarLocate open access versionFindings
  • C. Dann, T. Lattimore, and E. Brunskill. Unifying pac and regret: Uniform pac bounds for episodic reinforcement learning. In Advances in Neural Information Processing Systems, pages 5713–5723, 2017. p
    Google ScholarLocate open access versionFindings
  • K. Dong, J. Peng, Y. Wang, and Y. Zhou. n-regret for learning in markov decision processes with function approximation and low bellman rank. arXiv preprint arXiv:1909.02506, 2019.
    Findings
  • S. S. Du, S. M. Kakade, R. Wang, and L. F. Yang. Is a good representation sufficient for sample efficient reinforcement learning? arXiv preprint arXiv:1910.03016, 2019.
    Findings
  • S. S. Du, A. Krishnamurthy, N. Jiang, A. Agarwal, M. Dudık, and J. Langford. Provably efficient RL with rich observations via latent state decoding. arXiv preprint arXiv:1901.09018, 2019.
    Findings
  • S. S. Du, J. D. Lee, H. Li, L. Wang, and X. Zhai. Gradient descent finds global minima of deep neural networks. arXiv preprint arXiv:1811.03804, 2018.
    Findings
  • S. S. Du, X. Zhai, B. Poczos, and A. Singh. Gradient descent provably optimizes overparameterized neural networks. arXiv preprint arXiv:1810.02054, 2018.
    Findings
  • Y. Duan, X. Chen, R. Houthooft, J. Schulman, and P. Abbeel. Benchmarking deep reinforcement learning for continuous control. In International Conference on Machine Learning, pages 1329– 1338, 2016.
    Google ScholarLocate open access versionFindings
  • A. Durand, O.-A. Maillard, and J. Pineau. Streaming kernel regression with provably adaptive mean, variance, and regularization. The Journal of Machine Learning Research, 19(1):650–683, 2018.
    Google ScholarLocate open access versionFindings
  • Y. Efroni, L. Shani, A. Rosenberg, and S. Mannor. Optimistic policy optimization with bandit feedback. arXiv preprint arXiv:2002.08243, 2020.
    Findings
  • R. Gao, T. Cai, H. Li, C.-J. Hsieh, L. Wang, and J. D. Lee. Convergence of adversarial training in overparametrized neural networks. In Advances in Neural Information Processing Systems, pages 13009–13020, 2019.
    Google ScholarLocate open access versionFindings
  • T. Hofmann, B. Scholkopf, and A. J. Smola. Kernel methods in machine learning. The annals of statistics, pages 1171–1220, 2008.
    Google ScholarLocate open access versionFindings
  • A. Jacot, F. Gabriel, and C. Hongler. Neural tangent kernel: Convergence and generalization in neural networks. In Advances in Neural Information Processing Systems, 2018.
    Google ScholarLocate open access versionFindings
  • T. Jaksch, R. Ortner, and P. Auer. Near-optimal regret bounds for reinforcement learning. Journal of Machine Learning Research, 11(4):1563–1600, 2010.
    Google ScholarLocate open access versionFindings
  • N. Jiang, A. Krishnamurthy, A. Agarwal, J. Langford, and R. E. Schapire. Contextual decision processes with low bellman rank are pac-learnable. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1704–1713. JMLR. org, 2017.
    Google ScholarLocate open access versionFindings
  • C. Jin, Z. Allen-Zhu, S. Bubeck, and M. I. Jordan. Is Q-learning provably efficient? In Advances in Neural Information Processing Systems, pages 4863–4873, 2018.
    Google ScholarLocate open access versionFindings
  • C. Jin, Z. Yang, Z. Wang, and M. I. Jordan. Provably efficient reinforcement learning with linear function approximation. arXiv preprint arXiv:1907.05388, 2019.
    Findings
  • S. Kakade, A. Krishnamurthy, K. Lowrey, M. Ohnishi, and W. Sun. Information theoretic regret bounds for online nonlinear control. arXiv preprint arXiv:2006.12466, 2020.
    Findings
  • A. Krause and C. S. Ong. Contextual gaussian process bandit optimization. In Advances in neural information processing systems, pages 2447–2455, 2011.
    Google ScholarLocate open access versionFindings
  • A. Krishnamurthy, A. Agarwal, and J. Langford. Pac reinforcement learning with rich observations. In Advances in Neural Information Processing Systems, pages 1840–1848, 2016.
    Google ScholarLocate open access versionFindings
  • J. Lafferty and G. Lebanon. Diffusion kernels on statistical manifolds. Journal of Machine Learning Research, 6(Jan):129–163, 2005.
    Google ScholarLocate open access versionFindings
  • T. Lattimore and C. Szepesvari. Bandit algorithms. preprint, 2018.
    Google ScholarFindings
  • T. Lattimore and C. Szepesvari. Learning with good feature representations in bandits and in rl with a generative model. arXiv preprint arXiv:1911.07676, 2019.
    Findings
  • J. Lee, L. Xiao, S. S. Schoenholz, Y. Bahri, J. Sohl-Dickstein, and J. Pennington. Wide neural networks of any depth evolve as linear models under gradient descent. arXiv preprint arXiv:1902.06720, 2019.
    Findings
  • Y. Li and Y. Liang. Learning overparameterized neural networks via stochastic gradient descent on structured data. In Advances in Neural Information Processing Systems, 2018.
    Google ScholarLocate open access versionFindings
  • J. Lu, G. Cheng, and H. Liu. Nonparametric heterogeneity testing for massive data. arXiv preprint arXiv:1601.06212, 2016.
    Findings
  • S. Mendelson, J. Neeman, et al. Regularization in kernel learning. The Annals of Statistics, 38(1):526–565, 2010.
    Google ScholarLocate open access versionFindings
  • H. Q. Minh, P. Niyogi, and Y. Yao. Mercer’s theorem, feature maps, and smoothing. In International Conference on Computational Learning Theory, pages 154–168.
    Google ScholarLocate open access versionFindings
  • [49] B. Neyshabur and Z. Li. Towards understanding the role of over-parametrization in generalization of neural networks. In International Conference on Learning Representations (ICLR), 2019.
    Google ScholarLocate open access versionFindings
  • [50] I. Osband, J. Aslanides, and A. Cassirer. Randomized prior functions for deep reinforcement learning. In Advances in Neural Information Processing Systems, pages 8617–8629, 2018.
    Google ScholarLocate open access versionFindings
  • [51] I. Osband and B. Van Roy. Model-based reinforcement learning and the eluder dimension. In Advances in Neural Information Processing Systems, pages 1466–1474, 2014.
    Google ScholarLocate open access versionFindings
  • [52] I. Osband, B. Van Roy, and Z. Wen. Generalization and exploration via randomized value functions. arXiv preprint arXiv:1402.0635, 2014.
    Findings
  • [53] M. L. Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons, 2014.
    Google ScholarFindings
  • [54] A. Rahimi and B. Recht. Random features for large-scale kernel machines. In Advances in Neural Information Processing Systems, pages 1177–1184, 2008.
    Google ScholarLocate open access versionFindings
  • [56] D. Russo. Worst-case regret bounds for exploration via randomized value functions. In Advances in Neural Information Processing Systems, pages 14410–14420, 2019.
    Google ScholarLocate open access versionFindings
  • [57] D. Russo and B. Van Roy. Eluder dimension and the sample complexity of optimistic exploration. In Advances in Neural Information Processing Systems, pages 2256–2264, 2013.
    Google ScholarLocate open access versionFindings
  • [58] P. G. Sessa, I. Bogunovic, M. Kamgarpour, and A. Krause. No-regret learning in unknown games with correlated payoffs. In Advances in Neural Information Processing Systems, pages 13602–13611, 2019.
    Google ScholarLocate open access versionFindings
  • [59] Z. Shang, G. Cheng, et al. Local and global asymptotic inference in smoothing spline models. Annals of Statistics, 41(5):2608–2638, 2013.
    Google ScholarLocate open access versionFindings
  • [60] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, et al. Mastering the game of Go with deep neural networks and tree search. Nature, 529(7587):484–489, 2016.
    Google ScholarLocate open access versionFindings
  • [61] D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton, et al. Mastering the game of Go without human knowledge. Nature, 550(7676):354, 2017.
    Google ScholarLocate open access versionFindings
  • [62] N. Srinivas, A. Krause, S. M. Kakade, and M. Seeger. Gaussian process optimization in the bandit setting: No regret and experimental design. arXiv preprint arXiv:0912.3995, 2009.
    Findings
  • [63] N. Srinivas, A. Krause, S. M. Kakade, and M. W. Seeger. Information-theoretic regret bounds for gaussian process optimization in the bandit setting. IEEE Transactions on Information Theory, 58(5):3250–3265, 2012.
    Google ScholarLocate open access versionFindings
  • [64] I. Steinwart and A. Christmann. Support vector machines. Springer Science & Business Media, 2008.
    Google ScholarFindings
  • [65] A. L. Strehl, L. Li, E. Wiewiora, J. Langford, and M. L. Littman. PAC model-free reinforcement learning. In International Conference on Machine Learning, pages 881–888, 2006.
    Google ScholarLocate open access versionFindings
  • [66] R. S. Sutton and A. G. Barto. Reinforcement learning: An introduction. MIT press, 2018.
    Google ScholarFindings
  • [67] M. Valko, N. Korda, R. Munos, I. Flaounas, and N. Cristianini. Finite-time analysis of kernelised contextual bandits. arXiv preprint arXiv:1309.6869, 2013.
    Findings
  • [68] B. Van Roy and S. Dong. Comments on the du-kakade-wang-yang lower bounds. arXiv preprint arXiv:1911.07910, 2019.
    Findings
  • [69] R. Vershynin. High-dimensional probability: An introduction with applications in data science, volume 47. Cambridge university press, 2018.
    Google ScholarFindings
  • [70] O. Vinyals, I. Babuschkin, W. M. Czarnecki, M. Mathieu, A. Dudzik, J. Chung, D. H. Choi, R. Powell, T. Ewalds, P. Georgiev, et al. Grandmaster level in starcraft ii using multi-agent reinforcement learning. Nature, 575(7782):350–354, 2019.
    Google ScholarLocate open access versionFindings
  • [71] R. Wang, R. Salakhutdinov, and L. F. Yang. Provably efficient reinforcement learning with general value function approximation. arXiv preprint arXiv:2005.10804, 2020.
    Findings
  • [72] W. Y. Wang, J. Li, and X. He. Deep reinforcement learning for nlp. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics: Tutorial Abstracts, pages 19–21, 2018.
    Google ScholarLocate open access versionFindings
  • [73] Y. Wang, R. Wang, S. S. Du, and A. Krishnamurthy. Optimism in reinforcement learning with generalized linear function approximation. arXiv preprint arXiv:1912.04136, 2019.
    Findings
  • [74] Z. Wen and B. Van Roy. Efficient exploration and value function generalization in deterministic systems. In Advances in Neural Information Processing Systems, pages 3021–3029, 2013.
    Google ScholarLocate open access versionFindings
  • [75] Z. Wen and B. Van Roy. Efficient reinforcement learning in deterministic systems with value function generalization. Mathematics of Operations Research, 42(3):762–782, 2017.
    Google ScholarLocate open access versionFindings
  • [76] L. Wu, C. Ma, and E. Weinan. How SGD selects the global minima in over-parameterized learning: A dynamical stability perspective. In Advances in Neural Information Processing Systems, 2018.
    Google ScholarLocate open access versionFindings
  • [77] L. Yang and M. Wang. Sample-optimal parametric Q-learning using linearly additive features. In International Conference on Machine Learning, pages 6995–7004, 2019.
    Google ScholarLocate open access versionFindings
  • [78] L. F. Yang and M. Wang. Reinforcement leaning in feature space: Matrix bandit, kernels, and regret bound. arXiv preprint arXiv:1905.10389, 2019.
    Findings
  • [79] Y. Yang, A. Bhattacharya, and D. Pati. Frequentist coverage and sup-norm convergence rate in gaussian process regression. arXiv preprint arXiv:1708.04753, 2017.
    Findings
  • [80] A. Zanette, D. Brandfonbrener, E. Brunskill, M. Pirotta, and A. Lazaric. Frequentist regret bounds for randomized least-squares value iteration. In International Conference on Artificial Intelligence and Statistics, pages 1954–1964, 2020.
    Google ScholarLocate open access versionFindings
  • [81] A. Zanette, A. Lazaric, M. Kochenderfer, and E. Brunskill. Learning near optimal policies with low inherent bellman error. arXiv preprint arXiv:2003.00153, 2020.
    Findings
  • [82] Y. Zhang, J. Duchi, and M. Wainwright. Divide and conquer kernel ridge regression: A distributed algorithm with minimax optimal rates. Journal of Machine Learning Research, 16(1):3299–3340, 2015.
    Google ScholarLocate open access versionFindings
  • [83] D. Zhou, J. He, and Q. Gu. Provably efficient reinforcement learning for discounted mdps with feature mapping. arXiv preprint arXiv:2006.13165, 2020.
    Findings
  • [84] D. Zhou, L. Li, and Q. Gu. Neural contextual bandits with upper confidence bound-based exploration. arXiv preprint arXiv:1911.04462, 2019.
    Findings
  • [85] D. Zou, Y. Cao, D. Zhou, and Q. Gu. Stochastic gradient descent optimizes over-parameterized deep ReLU networks. arXiv preprint arXiv:1811.08888, 2018.
    Findings
Your rating :
0

 

Tags
Comments