AI帮你理解科学

AI 生成解读视频

AI抽取解析论文重点内容自动生成视频


pub
生成解读视频

AI 溯源

AI解析本论文相关学术脉络


Master Reading Tree
生成 溯源树

AI 精读

AI抽取本论文的概要总结


微博一下
We identify a novel set of conditions that ensure convergence with probability 1 of Q-learning with linear function approximation, by proposing a two time-scale variation thereof

A new convergent variant of Q-learning with linear function approximation

NIPS 2020, (2020)

被引用0|浏览107
EI
下载 PDF 全文
引用
微博一下

摘要

In this work, we identify a novel set of conditions that ensure convergence with probability 1 of Q-learning with linear function approximation, by proposing a two time-scale variation thereof. In the faster time scale, the algorithm features an update similar to that of DQN, where the impact of bootstrapping is attenuated by using a Q-va...更多

代码

数据

0
简介
  • The authors investigate the convergence of reinforcement learning with linear function approximation in control settings.
  • The authors analyze the convergence of Q-learning when combined with linear function approximation.
  • The function R : X × A → R is the expected reward for performing action a in state x.
  • The authors write rt to denote the reward at time step t.
  • It is a r.v. with expected value R(xt, at) and the authors assume throughout that |rt| ≤ ρ for some value ρ > 0.
  • Γ is a discount factor taking values in [0, 1)
重点内容
  • We investigate the convergence of reinforcement learning with linear function approximation in control settings
  • We analyze the convergence of Q-learning when combined with linear function approximation
  • We address the problem of control in reinforcement learning with function approximation, where the optimal Q-function, Q∗, cannot be represented exactly and, some form of approximation must be used
  • We evaluated the CQL algorithm on three domains with increasing complexity
  • By proposing a two time-scale variant of Q-learning able to combine linear function approximation and off-policy sampling of trajectories, and establishing its convergence under general assumptions, we revived the discussion of convergence for this broadly employed algorithm and introduced a theoretical foundation regarding the use of DQN
结果
  • The authors evaluated the CQL algorithm on three domains with increasing complexity.
  • The first was the θ → 2θ example [21] and the second was the 7-star version of the star counterexample [2].
  • Both problems are known two cause divergence of Q-learning with linear function approximation.
  • The authors performed online learning on the second and third tests, showing that the use of a replay buffer satisfying Assumption (I) is not necessary for convergence
结论
  • Conclusions and future work

    By proposing a two time-scale variant of Q-learning able to combine linear function approximation and off-policy sampling of trajectories, and establishing its convergence under general assumptions, the authors revived the discussion of convergence for this broadly employed algorithm and introduced a theoretical foundation regarding the use of DQN.
  • Within the classical application domains—such as robotics, natural language processing, computer vision, predictive models, and others, AI algorithms are a part of the daily lives.
  • In some of those applications, AI and ML-driven algorithms can surpass human-level performance.
  • Deep learning algorithms are being used to predict poverty from satellite images [10], and to predict and manage traffic patterns to avoid pollution and congestion in cities [11]
表格
  • Table1: Results on the mountain car problem. For each architecture, the best result is bolden
Download tables as Excel
基金
  • Acknowledgments and Disclosure of Funding This work was partially supported by national funds through Fundação para a Ciência e Tecnologia under project SLICE with reference PTDC/CCI-COM/30787/2017 and INESC-ID multi annual funding with reference UIDB/50021/2020
引用论文
  • J. Achiam, E. Knight, and P. Abbeel. Towards characterizing divergence in deep Q-learning. CoRR, abs/1903.08894, 2019.
    Findings
  • L. Baird. Residual algorithms: Reinforcement learning with function approximation. In Proceedings of the 12th International Conference on Machine Learning, pages 30–37, 1995.
    Google ScholarLocate open access versionFindings
  • D. Bertsekas and J. Tsitsiklis. Neuro-Dynamic Programming. Athena Scientific, 1996.
    Google ScholarLocate open access versionFindings
  • V. Borkar. Stochastic Approximation: A Dynamical Systems Viewpoint. Cambridge University Press, 2008.
    Google ScholarFindings
  • V. Borkar and S. Meyn. The O.D.E. method for convergence of stochastic approximation and reinforcement learning. SIAM J. Control and Optimization, 38(2):447–469, 2000.
    Google ScholarLocate open access versionFindings
  • J. Boyan and A. Moore. Generalization in reinforcement learning: Safely approximating the value function. In Advances in Neural Information Processing Systems 7, pages 369–376, 1995.
    Google ScholarLocate open access versionFindings
  • Jinglin Chen and Nan Jiang. Information-theoretic considerations in batch reinforcement learning. In Proceedings of the 36th International Conference on Machine Learning, pages 1042–1051, 2019.
    Google ScholarLocate open access versionFindings
  • Z. Chen, S. Zhang, T. T. Doan, S. T. Maguluri, and J. Clarke. Performance of q-learning with linear function approximation: Stability and finite-time analysis. arXiv preprint arXiv:1905.11425, 2019.
    Findings
  • D. Ernst, P. Geurts, and L. Wehenkel. Tree-based batch mode reinforcement learning. Journal of Machine Learning Research, 6(Apr):503–556, 2005.
    Google ScholarLocate open access versionFindings
  • N. Jean, M. Burke, M. Xie, W. M. Davis, D. B. Lobell, and S. Ermon. Combining satellite imagery and machine learning to predict poverty. Science, 353(6301):790–794, 2016.
    Google ScholarLocate open access versionFindings
  • L. Li, Y. Lv, and F. Wang. Traffic signal timing via deep reinforcement learning. IEEE/CAA Journal of Automatica Sinica, 3(3):247–254, 2016.
    Google ScholarLocate open access versionFindings
  • H. R. Maei, C. Szepesvári, S. Bhatnagar, and R. S. Sutton. Toward off-policy learning control with function approximation. In Proceedings of the 27th International Conference on Machine Learning, pages 719–726, 2010.
    Google ScholarLocate open access versionFindings
  • F.S. Melo, S. Meyn, and M.I. Ribeiro. An analysis of reinforcement learning with function approximation. In Proceedings of the 25th International Conference on Machine learning, pages 664–671, 2008.
    Google ScholarLocate open access versionFindings
  • V. Mnih, K. Kavukcuoglu, D. Silver, A. Rusu, J. Veness, M. Bellemare, A. Graves, M. Riedmiller, A. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis. Human-level control through deep reinforcement learning. Nature, 518:529–533, 2015.
    Google ScholarLocate open access versionFindings
  • A. Moore. Efficient memory-based learning for robot control. Technical Report UCAM-CLTR-209, University of Cambridge, 1990.
    Google ScholarFindings
  • R. Munos, T. Stepleton, A. Harutyunyan, and M. Bellemare. Safe and efficient off-policy reinforcement learning. In Advances in Neural Information Processing Systems, pages 1054– 1062, 2016.
    Google ScholarLocate open access versionFindings
  • T. J. Perkins and D. Precup. A convergent form of approximate policy iteration. In Advances in neural information processing systems, pages 1627–1634, 2003.
    Google ScholarLocate open access versionFindings
  • R. Sutton and A. Barto. Reinforcement Learning: An Introduction. MIT Press, 2nd edition, 2018.
    Google ScholarFindings
  • C. Szepesvári and R. Munos. Finite time bounds for sampling based fitted value iteration. In Proceedings of the 22nd international conference on Machine learning, pages 880–887, 2005.
    Google ScholarLocate open access versionFindings
  • C. Szepesvári and W. Smart. Interpolation-based Q-learning. In Proceedings of the 21st International Conference on Machine learning, pages 100–107, 2004.
    Google ScholarLocate open access versionFindings
  • J. Tsitsiklis and B. Van Roy. Feature-based methods for large scale dynamic programming. Machine Learning, 22:59–94, 1996.
    Google ScholarLocate open access versionFindings
  • J. Tsitsiklis and B. Van Roy. An analysis of temporal-difference learning with function approximation. IEEE Trans. Automatic Control, 42(5):674–690, 1996.
    Google ScholarLocate open access versionFindings
  • H. van Hasselt. Double Q-learning. In Advances in Neural Information Processing Systems, pages 2613–2621, 2010.
    Google ScholarLocate open access versionFindings
  • H. van Hasselt, Y. Doron, F. Strub, M. Hessel, N. Sonnerat, and J. Modayil. Deep reinforcement learning and the deadly triad. CoRR, abs/1812.02648, 2018.
    Findings
  • Z. Yang, Y. Xie, and Z. Wang. A theoretical analysis of deep q-learning. arXiv preprint arXiv:1901.00137, 2019.
    Findings
  • S. Zou, T. Xu, and Y. Liang. Finite-sample analysis for sarsa with linear function approximation. In Advances in Neural Information Processing Systems, pages 8665–8675, 2019.
    Google ScholarLocate open access versionFindings
作者
您的评分 :
0

 

标签
评论
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn
小科