AI帮你理解科学

AI 生成解读视频

AI抽取解析论文重点内容自动生成视频


pub
生成解读视频

AI 溯源

AI解析本论文相关学术脉络


Master Reading Tree
生成 溯源树

AI 精读

AI抽取本论文的概要总结


微博一下
TD with Regularized Corrections is built on TD with Corrections, and, as we prove, inherits its soundness guarantees

Gradient Temporal-Difference Learning with Regularized Corrections

ICML, pp.3524-3534, (2020)

被引用7|浏览32
EI
下载 PDF 全文
引用
微博一下

摘要

It is still common to use Q-learning and temporal difference (TD) learning-even though they have divergence issues and sound Gradient TD alternatives exist-because divergence seems rare and they typically perform well. However, recent work with large neural network learning systems reveals that instability is more common than previously...更多

代码

数据

0
简介
  • Off-policy learning—the ability to learn the policy or value function for one policy while following another—underlies many practical implementations of reinforcement learning.
  • Many systems use experience replay, where the value function is updated using previous experiences under many different policies.
  • One of the most widely-used algorithms, Q-learning—a temporal difference (TD) algorithm—is off-policy by design: updating toward the maximum value action in the current state, regardless of which action the agent selected.
  • Based on the agent’s action At and the transition dynamics, P : S × A × S → [0, 1], the environment transitions into a new state, St+1, and emits a scalar reward Rt+1.
重点内容
  • Off-policy learning—the ability to learn the policy or value function for one policy while following another—underlies many practical implementations of reinforcement learning
  • In this paper we introduce a new Gradient temporal difference method, called temporal difference with Regularized Corrections (TDRC)
  • We demonstrate that TD with Corrections frequently outperforms the saddlepoint variant of Gradient temporal difference, motivating why we build on TD with Corrections and the utility of being able to shift between temporal difference and TD with Corrections by setting the regularization parameter
  • We introduced a simple modification of the TD with Corrections algorithm that achieves performance much closer to that of temporal difference
  • TD with Regularized Corrections is built on TD with Corrections, and, as we prove, inherits its soundness guarantees
  • With extensions to non-linear function approximation, we find that the resulting algorithm, QRC, performs as well as Q-learning and in some cases notably better
方法
  • Experiments in the Prediction Setting

    The authors first establish the performance of TDRC across several small linear prediction tasks where the authors can carefully sweep hyper-parameters, analyze sensitivity, and average over many runs.
  • The first problem, Boyan’s chain (Boyan, 2002), is a 13 state Markov chain where each state is represented by a compact feature representation.
  • Like TD, TDRC was developed for prediction, under linear function approximation.
  • Like TD, there are natural— though in some cases heuristic—extensions to the control setting and to non-linear function approximation.
  • The authors first investigate TDRC in control with linear function approximation, where the extension is more straightforward.
  • For the first time, that gradient TD methods can outperform Q-learning when using neural networks, in two classic control domains and two visual games
结果
  • The authors introduced a simple modification of the TDC algorithm that achieves performance much closer to that of TD.
结论
  • The authors introduced a simple modification of the TDC algorithm that achieves performance much closer to that of TD.
  • With extensions to non-linear function approximation, the authors find that the resulting algorithm, QRC, performs as well as Q-learning and in some cases notably better.
  • This constitutes the first demonstration of Gradient-TD methods outperforming Q-learning, and suggests this simple modification to the standard Q-learning update—to give QRC—could provide a more general purpose algorithm
表格
  • Table1: Average area under the RMSPBE learning curve for each problem using the Adagrad stepsize selection algorithm. Bolded values highlight the lowest RMSPBE obtained for a given problem. All TD, HTD, and VTrace appear to converge very slowly with Adagrad. HTD still exhibits oscillating behavior and TD and VTrace show significant bias in final performance. These values correspond to the bar graphs in Figure 1
  • Table2: Average area under the RMSPBE learning curve for each problem using the Adam stepsize selection algorithm. Bolded values highlight the lowest RMSPBE obtained for a given problem. All TD, HTD, and VTrace appear to be able to converge while using Adam, though convergence is very slow and not monotonic. These values correspond to the bar graphs in Figure 8
  • Table3: Average area under the RMSPBE learning curve for each problem using the a constant stepsize. Bolded values highlight the lowest RMSPBE obtained for a given problem. These values correspond to the bar graphs in Figure 13
Download tables as Excel
基金
  • This work was funded by NSERC and CIFAR, particularly through funding the Alberta Machine Intelligence Institute (Amii) and the CCAI Chair program
  • The authors also gratefully acknowledge funding from JPMorgan Chase & Co. and Google DeepMind
引用论文
  • Glorot, X., Bengio, Y. (2010). Understanding the difficulty of training deep feedforward neural networks. In International Conference on Artificial Intelligence and Statistics (pp. 249-256).
    Google ScholarLocate open access versionFindings
  • Hackman, L. (2012). Faster Gradient TD Algorithms. M.Sc. thesis, University of Alberta, Edmonton.
    Google ScholarFindings
  • Baird, L. C. (1995). Residual algorithms: Reinforcement learning with function approximation. In International Conference on Machine Learning, pp. 30–37. Morgan Kaufmann, San Francisco.
    Google ScholarLocate open access versionFindings
  • Barto, Andrew G., Richard S. Sutton, Charles W. Anderson. (1983). Neuronlike adaptive elements that can solve difficult learning control problems. IEEE transactions on systems, man, and cybernetics, 5, 834-846.
    Google ScholarLocate open access versionFindings
  • Bellemare, M. G., Naddaf, Y., Veness, J., Bowling, M. (2013). The arcade learning environment: An evaluation platform for general agents. Journal of Artificial Intelligence Research, 47, 253-279.
    Google ScholarLocate open access versionFindings
  • Juditsky, A., Nemirovski, A. (2011). Optimization for Machine Learning.
    Google ScholarFindings
  • Kingma, D. P., Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
    Findings
  • Liu B, Liu J, Ghavamzadeh M, Mahadevan S, Petrik M (2015). Finite-Sample Analysis of Proximal Gradient TD Algorithms. In International Conference on Uncertainty in Artificial Intelligence, pp. 504-513.
    Google ScholarLocate open access versionFindings
  • Liu B, Liu J, Ghavamzadeh M, Mahadevan S, Petrik M (2016). Proximal Gradient Temporal Difference Learning Algorithms. In International Joint Conference on Artificial Intelligence, pp. 4195-4199.
    Google ScholarLocate open access versionFindings
  • Borkar, V. S., Meyn, S.P. (2000). The O.D.E. Method for Convergence of Stochastic Approximation and Reinforcement Learning. SIAM J. Control and Optimization.
    Google ScholarLocate open access versionFindings
  • Boyan, J.A. (2002). Technical Update: Least-Squares Temporal Difference Learning. Machine Learning.
    Google ScholarFindings
  • Du, S. S., Chen, J., Li, L., Xiao, L., and Zhou, D. (2017) Stochastic variance reduction methods for policy evaluation. In International Conference on Machine Learning.
    Google ScholarLocate open access versionFindings
  • Duchi, J., Hazan, E., Singer, Y. (2011). Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research 12:21212159.
    Google ScholarLocate open access versionFindings
  • Espeholt, L., Soyer, H., Munos, R., Simonyan, K., Mnih, V., Ward, T., Doron, Y., Firoiu, V., Harley, T., Dunning, I. and Legg, S. (2018) IMPALA: Scalable distributed Deep-RL with importance weighted actor-learner architectures. In International Conference on Machine Learning.
    Google ScholarLocate open access versionFindings
  • Feng, Y., Li, L., Liu, Q. (2019). A kernel loss for solving the bellman equation. In Advances in Neural Information Processing Systems (pp. 15430-15441).
    Google ScholarLocate open access versionFindings
  • Mahadevan, S., Liu, B., Thomas, P., Dabney, W., Giguere, S., Jacek, N., Gemp, I., Liu, J. (2014). Proximal reinforcement learning: A new theory of sequential decision making in primal-dual spaces. arXiv:1405.6757.
    Findings
  • Mahmood, A. R., Yu, H., Sutton, R. S. (2017). Multi-step off-policy learning without importance sampling ratios. arXiv:1702.03006.
    Findings
  • Maei, H. R. (2011). Gradient temporal-difference learning algorithms. Ph.D. thesis, University of Alberta, Edmonton.
    Google ScholarFindings
  • Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap, T., Harley, T., Silver, D., Kavukcuoglu, K. (2016). Asynchronous methods for deep reinforcement learning. In International conference on machine learning (pp. 19281937).
    Google ScholarLocate open access versionFindings
  • Munos, R., Stepleton, T., Harutyunyan, A., Bellemare, M. (2016). Safe and efficient off-policy reinforcement learning. In Advances in Neural Information Processing Systems 29, pp. 1046–1054.
    Google ScholarLocate open access versionFindings
  • Reddi, S. J., Kale, S., Kumar, S. (2019). On the convergence of adam and beyond. arXiv:1904.09237.
    Findings
  • Schaul, T., Quan, J., Antonoglou, I., Silver, D. (2016). Prioritized experience replay. In International Conference on Learning Representations.
    Google ScholarLocate open access versionFindings
  • Sutton, R. S., Maei, H. R., Precup, D., Bhatnagar, S., Silver, D., Szepesvari, Cs., Wiewiora, E. (2009). Fast gradientdescent methods for temporal-difference learning with linear function approximation. In International Conference on Machine Learning, pp. 993–1000, ACM.
    Google ScholarLocate open access versionFindings
  • Sutton, R. S., Mahmood A. R., and White M. (2016) An emphatic approach to the problem of off-policy temporaldifference learning. The Journal of Machine Learning Research.
    Google ScholarLocate open access versionFindings
  • Sutton, R. S., Barto, A. G. (2018). Reinforcement Learning: An Introduction, Second Edition. MIT Press.
    Google ScholarFindings
  • Touati, A., Bacon, P. L., Precup, D., Vincent, P. (2018). Convergent tree-backup and retrace with function approximation. arXiv:1705.09322.
    Findings
  • van Hasselt, H. P., Guez, A., Hessel, M., Mnih, V., Silver, D. (2016). Learning values across many orders of magnitude. In Advances in Neural Information Processing Systems (pp. 4287-4295).
    Google ScholarLocate open access versionFindings
  • van Hasselt, H., Doron, Y., Strub, F., Hessel, M., Sonnerat, N., Modayil, J. (2018). Deep Reinforcement Learning and the Deadly Triad. arXiv:1812.02648
    Findings
  • White, A., White, M. (2016). Investigating Practical Linear Temporal Difference Learning. In International Conference on Autonomous Agents & Multiagent Systems.
    Google ScholarLocate open access versionFindings
  • Young, K., Tian, T. (2019). MinAtar: An Atari-Inspired Testbed for Thorough and Reproducible Reinforcement Learning Experiments. arXiv:1903.03176.
    Findings
  • The proofs of convergence for many of the methods require independent samples for the updates. This condition is not generally met in the fully online learning setting that we consider throughout the rest of the paper. In Figure 7 we show results for all methods in the fully offline batch setting, demonstrating that—on the small problems that we consider—the conclusions do not change when transferring from the batch setting to the online setting. We include two additional methods in the batch setting, the Kernel Residual Gradient methods (Feng et al., 2019), which do not have a clear fully online implementation.
    Google ScholarFindings
  • The Residual Gradient (RG) family of algorithms provide an alternative gradient-based strategy for performing temporal difference learning. The RG methods minimize the Mean Squared Bellman Error (MSBE), while the gradient TD family of algorithms minimize a particular form of the MSBE, the Mean Squared Projected Bellman Error (MSPBE). The RG family of methods generally suffer from difficulty in obtaining independent samples from the environment, leading towards stochastic optimization algorithms which find a biased solution (Sutton & Barto, 2018). However, very recent work has generalized the MSBE and proposed an algorithmic strategy to perform unbiased stochastic updates (Feng et al., 2019). Because our results suggest that RG methods generally underperform the gradient TD family of methods, we choose to focus our extension on gradient TD methods for this paper.
    Google ScholarLocate open access versionFindings
  • TD fixed point under very similar conditions as TDC (Maei, 2011). We show the key steps here (for details see Maei
    Google ScholarLocate open access versionFindings
  • (2011) or Appendix G). The G matrix for TDC++ is G =
    Google ScholarFindings
作者
Sina Ghiassian
Sina Ghiassian
Andrew Patterson
Andrew Patterson
Shivam Garg
Shivam Garg
Dhawal Gutpa
Dhawal Gutpa
您的评分 :
0

 

标签
评论
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn
小科