We identify a novel set of conditions that ensure convergence with probability 1 of Q-learning with linear function approximation, by proposing a two time-scale variation thereof
A new convergent variant of Q-learning with linear function approximation
NIPS 2020, (2020)
In this work, we identify a novel set of conditions that ensure convergence with probability 1 of Q-learning with linear function approximation, by proposing a two time-scale variation thereof. In the faster time scale, the algorithm features an update similar to that of DQN, where the impact of bootstrapping is attenuated by using a Q-va...更多
下载 PDF 全文
- The authors investigate the convergence of reinforcement learning with linear function approximation in control settings.
- The authors analyze the convergence of Q-learning when combined with linear function approximation.
- The function R : X × A → R is the expected reward for performing action a in state x.
- The authors write rt to denote the reward at time step t.
- It is a r.v. with expected value R(xt, at) and the authors assume throughout that |rt| ≤ ρ for some value ρ > 0.
- Γ is a discount factor taking values in [0, 1)
- We investigate the convergence of reinforcement learning with linear function approximation in control settings
- We analyze the convergence of Q-learning when combined with linear function approximation
- We address the problem of control in reinforcement learning with function approximation, where the optimal Q-function, Q∗, cannot be represented exactly and, some form of approximation must be used
- We evaluated the CQL algorithm on three domains with increasing complexity
- By proposing a two time-scale variant of Q-learning able to combine linear function approximation and off-policy sampling of trajectories, and establishing its convergence under general assumptions, we revived the discussion of convergence for this broadly employed algorithm and introduced a theoretical foundation regarding the use of DQN
- The authors evaluated the CQL algorithm on three domains with increasing complexity.
- The first was the θ → 2θ example  and the second was the 7-star version of the star counterexample .
- Both problems are known two cause divergence of Q-learning with linear function approximation.
- The authors performed online learning on the second and third tests, showing that the use of a replay buffer satisfying Assumption (I) is not necessary for convergence
- Conclusions and future work
By proposing a two time-scale variant of Q-learning able to combine linear function approximation and off-policy sampling of trajectories, and establishing its convergence under general assumptions, the authors revived the discussion of convergence for this broadly employed algorithm and introduced a theoretical foundation regarding the use of DQN.
- Within the classical application domains—such as robotics, natural language processing, computer vision, predictive models, and others, AI algorithms are a part of the daily lives.
- In some of those applications, AI and ML-driven algorithms can surpass human-level performance.
- Deep learning algorithms are being used to predict poverty from satellite images , and to predict and manage traffic patterns to avoid pollution and congestion in cities 
- Table1: Results on the mountain car problem. For each architecture, the best result is bolden
- Acknowledgments and Disclosure of Funding This work was partially supported by national funds through Fundação para a Ciência e Tecnologia under project SLICE with reference PTDC/CCI-COM/30787/2017 and INESC-ID multi annual funding with reference UIDB/50021/2020
- J. Achiam, E. Knight, and P. Abbeel. Towards characterizing divergence in deep Q-learning. CoRR, abs/1903.08894, 2019.
- L. Baird. Residual algorithms: Reinforcement learning with function approximation. In Proceedings of the 12th International Conference on Machine Learning, pages 30–37, 1995.
- D. Bertsekas and J. Tsitsiklis. Neuro-Dynamic Programming. Athena Scientific, 1996.
- V. Borkar. Stochastic Approximation: A Dynamical Systems Viewpoint. Cambridge University Press, 2008.
- V. Borkar and S. Meyn. The O.D.E. method for convergence of stochastic approximation and reinforcement learning. SIAM J. Control and Optimization, 38(2):447–469, 2000.
- J. Boyan and A. Moore. Generalization in reinforcement learning: Safely approximating the value function. In Advances in Neural Information Processing Systems 7, pages 369–376, 1995.
- Jinglin Chen and Nan Jiang. Information-theoretic considerations in batch reinforcement learning. In Proceedings of the 36th International Conference on Machine Learning, pages 1042–1051, 2019.
- Z. Chen, S. Zhang, T. T. Doan, S. T. Maguluri, and J. Clarke. Performance of q-learning with linear function approximation: Stability and finite-time analysis. arXiv preprint arXiv:1905.11425, 2019.
- D. Ernst, P. Geurts, and L. Wehenkel. Tree-based batch mode reinforcement learning. Journal of Machine Learning Research, 6(Apr):503–556, 2005.
- N. Jean, M. Burke, M. Xie, W. M. Davis, D. B. Lobell, and S. Ermon. Combining satellite imagery and machine learning to predict poverty. Science, 353(6301):790–794, 2016.
- L. Li, Y. Lv, and F. Wang. Traffic signal timing via deep reinforcement learning. IEEE/CAA Journal of Automatica Sinica, 3(3):247–254, 2016.
- H. R. Maei, C. Szepesvári, S. Bhatnagar, and R. S. Sutton. Toward off-policy learning control with function approximation. In Proceedings of the 27th International Conference on Machine Learning, pages 719–726, 2010.
- F.S. Melo, S. Meyn, and M.I. Ribeiro. An analysis of reinforcement learning with function approximation. In Proceedings of the 25th International Conference on Machine learning, pages 664–671, 2008.
- V. Mnih, K. Kavukcuoglu, D. Silver, A. Rusu, J. Veness, M. Bellemare, A. Graves, M. Riedmiller, A. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis. Human-level control through deep reinforcement learning. Nature, 518:529–533, 2015.
- A. Moore. Efficient memory-based learning for robot control. Technical Report UCAM-CLTR-209, University of Cambridge, 1990.
- R. Munos, T. Stepleton, A. Harutyunyan, and M. Bellemare. Safe and efficient off-policy reinforcement learning. In Advances in Neural Information Processing Systems, pages 1054– 1062, 2016.
- T. J. Perkins and D. Precup. A convergent form of approximate policy iteration. In Advances in neural information processing systems, pages 1627–1634, 2003.
- R. Sutton and A. Barto. Reinforcement Learning: An Introduction. MIT Press, 2nd edition, 2018.
- C. Szepesvári and R. Munos. Finite time bounds for sampling based fitted value iteration. In Proceedings of the 22nd international conference on Machine learning, pages 880–887, 2005.
- C. Szepesvári and W. Smart. Interpolation-based Q-learning. In Proceedings of the 21st International Conference on Machine learning, pages 100–107, 2004.
- J. Tsitsiklis and B. Van Roy. Feature-based methods for large scale dynamic programming. Machine Learning, 22:59–94, 1996.
- J. Tsitsiklis and B. Van Roy. An analysis of temporal-difference learning with function approximation. IEEE Trans. Automatic Control, 42(5):674–690, 1996.
- H. van Hasselt. Double Q-learning. In Advances in Neural Information Processing Systems, pages 2613–2621, 2010.
- H. van Hasselt, Y. Doron, F. Strub, M. Hessel, N. Sonnerat, and J. Modayil. Deep reinforcement learning and the deadly triad. CoRR, abs/1812.02648, 2018.
- Z. Yang, Y. Xie, and Z. Wang. A theoretical analysis of deep q-learning. arXiv preprint arXiv:1901.00137, 2019.
- S. Zou, T. Xu, and Y. Liang. Finite-sample analysis for sarsa with linear function approximation. In Advances in Neural Information Processing Systems, pages 8665–8675, 2019.