AI帮你理解科学

AI 生成解读视频

AI抽取解析论文重点内容自动生成视频


pub
生成解读视频

AI 溯源

AI解析本论文相关学术脉络


Master Reading Tree
生成 溯源树

AI 精读

AI抽取本论文的概要总结


微博一下
We proposed a class of efficient momentumbased policy gradient methods, which use adaptive learning rates and do not require any large batches

Momentum-Based Policy Gradient Methods

ICML, pp.4422-4433, (2020)

被引用4|浏览96
EI
下载 PDF 全文
引用
微博一下

摘要

In the paper, we propose a class of efficient momentum-based policy gradient methods for the model-free reinforcement learning, which use adaptive learning rates and do not require any large batches. Specifically, we propose a fast important-sampling momentum-based policy gradient (IS-MBPG) method based on a new momentum-based variance ...更多

代码

数据

0
简介
  • Reinforcement Learning (RL) has achieved great success in solving many sequential decision-making problems such as autonomous driving (Shalev-Shwartz et al, 2016), robot manipulation (Deisenroth et al, 2013), the game of Go (Silver et al, 2017) and natural language processing (Wang et al, 2018).
  • RL involves a Markov decision process (MDP), where an agent takes actions dictated by a policy in a stochastic environment over a sequence of time steps, and maximizes the long-term cumulative rewards to obtain an optimal policy.
  • To obtain the optimal policy, policy gradient methods directly maximize the expected total reward ( called as performance function J(θ)) via using the stochastic first-order gradient of cumulative rewards.
  • The policy π(a|s) at the state s is represented by a conditional probability distribution πθ(a|s) associated to the parameter θ ∈ Rd
重点内容
  • Reinforcement Learning (RL) has achieved great success in solving many sequential decision-making problems such as autonomous driving (Shalev-Shwartz et al, 2016), robot manipulation (Deisenroth et al, 2013), the game of Go (Silver et al, 2017) and natural language processing (Wang et al, 2018)
  • Since the classic policy gradient methods (e.g., REINFORCE (Williams, 1992), PGT (Sutton et al, 2000), GPOMDP (Baxter & Bartlett, 2001) and TRPO (Schulman et al, 2015a)) approximate the gradient of the expected total reward based on a batch of sampled trajectories, they generally suffer from large variance in the estimated gradients, which results in a poor convergence
  • Our main contributions are summarized as follows: 1) We propose a fast important-sampling momentumbased policy gradient (IS-MBPG) method with adaptive learning rate, which builds on a new momentumbased variance reduction technique of STORM/HybridSGD (Cutkosky & Orabona, 2019; Tran-Dinh et al, 2019) and the importance sampling technique
  • Hessian-aided momentum-based policy gradient performs similar compared to Stochastic Recursive Variance Reduced Policy Gradient and Hessian Aided Policy Gradient, though it has an advantage at the beginning
  • We proved that the important-sampling momentum-based policy gradient* reaches the best known sample complexity of O( −3) only required one trajectory at each iteration
方法
  • The authors demonstrate the performance of the algorithms on four standard reinforcement learning tasks, which are CartPole, Walker, HalfCheetah and Hopper.
  • The first one is a discrete task from classic control, and the later three tasks are continuous RL task, which are popular MuJoCo environments (Todorov et al, 2012)
  • Detailed description of these environments is shown in Fig. 1.
  • The authors' code is publicly available on https://github.com/gaosh/MBPG
结果
  • The results of experiments are presented in Fig. 2.
  • In the CartPole environment, the IS-MBPG and HA-MBPG algorithms have better performances than the other methods.
  • The authors' IS-MBPG algorithm achieves the best final performance with a obvious margin.
  • HA-MBPG performs similar compared to SRVR-PG and HAPG, though it has an advantage at the beginning.
  • In Hopper environment, the ISMBPG and HA-MBPG algorithms are significantly faster compared to all other methods, while the final average reward are similar for different algorithms.
  • In HalfCheetah environment, IS-MBPG, HA-MBPG and SRVR-PG performs at the beginning.
  • One possible reason for this observation is that the authors use the estimated Hessian vector product instead of the exact Hessian vector product in HA-MBPG algorithm, which brings additional estimation error to the algorithm
结论
  • The authors proposed a class of efficient momentumbased policy gradient methods (i.e., IS-MBPG and HAMBPG), which use adaptive learning rates and do not require any large batches.
  • The authors proved that both IS-MBPG and HA-MBPG methods reach the best known sample complexity of O( −3), which only require one trajectory at each iteration.
  • The authors proved that the IS-MBPG* reaches the best known sample complexity of O( −3) only required one trajectory at each iteration
总结
  • Introduction:

    Reinforcement Learning (RL) has achieved great success in solving many sequential decision-making problems such as autonomous driving (Shalev-Shwartz et al, 2016), robot manipulation (Deisenroth et al, 2013), the game of Go (Silver et al, 2017) and natural language processing (Wang et al, 2018).
  • RL involves a Markov decision process (MDP), where an agent takes actions dictated by a policy in a stochastic environment over a sequence of time steps, and maximizes the long-term cumulative rewards to obtain an optimal policy.
  • To obtain the optimal policy, policy gradient methods directly maximize the expected total reward ( called as performance function J(θ)) via using the stochastic first-order gradient of cumulative rewards.
  • The policy π(a|s) at the state s is represented by a conditional probability distribution πθ(a|s) associated to the parameter θ ∈ Rd
  • Methods:

    The authors demonstrate the performance of the algorithms on four standard reinforcement learning tasks, which are CartPole, Walker, HalfCheetah and Hopper.
  • The first one is a discrete task from classic control, and the later three tasks are continuous RL task, which are popular MuJoCo environments (Todorov et al, 2012)
  • Detailed description of these environments is shown in Fig. 1.
  • The authors' code is publicly available on https://github.com/gaosh/MBPG
  • Results:

    The results of experiments are presented in Fig. 2.
  • In the CartPole environment, the IS-MBPG and HA-MBPG algorithms have better performances than the other methods.
  • The authors' IS-MBPG algorithm achieves the best final performance with a obvious margin.
  • HA-MBPG performs similar compared to SRVR-PG and HAPG, though it has an advantage at the beginning.
  • In Hopper environment, the ISMBPG and HA-MBPG algorithms are significantly faster compared to all other methods, while the final average reward are similar for different algorithms.
  • In HalfCheetah environment, IS-MBPG, HA-MBPG and SRVR-PG performs at the beginning.
  • One possible reason for this observation is that the authors use the estimated Hessian vector product instead of the exact Hessian vector product in HA-MBPG algorithm, which brings additional estimation error to the algorithm
  • Conclusion:

    The authors proposed a class of efficient momentumbased policy gradient methods (i.e., IS-MBPG and HAMBPG), which use adaptive learning rates and do not require any large batches.
  • The authors proved that both IS-MBPG and HA-MBPG methods reach the best known sample complexity of O( −3), which only require one trajectory at each iteration.
  • The authors proved that the IS-MBPG* reaches the best known sample complexity of O( −3) only required one trajectory at each iteration
表格
  • Table1: Convergence properties of the representative variance-reduced policy algorithms on the non-oblivious model-free RL problem for finding an -stationary point of the nonconcave performance function J(θ), i.e., E ∇J(θ) ≤ . Our algorithms (IS-MBPG, IS-MBPG* and HA-MBPG) and REINFORCE are single-loop algorithms, while the other algorithms are double-loops, which need the outer-loop and inner-loop mini-batch sizes. Note that <a class="ref-link" id="cPapini_et+al_2018_a" href="#rPapini_et+al_2018_a">Papini et al (2018</a>) only remarked that apply the ADAM algorithm (<a class="ref-link" id="cKingma_2014_a" href="#rKingma_2014_a">Kingma & Ba, 2014</a>) to the SVRPG algorithm to obtain an adaptive learning rate, but did not provide any theoretical analysis about this learning rate
Download tables as Excel
基金
  • This work was partially supported by U.S NSF IIS 1836945, IIS 1836938, IIS 1845666, IIS 1852606, IIS 1838627, IIS 1837956
引用论文
  • Allen-Zhu, Z. and Hazan, E. Variance reduction for faster non-convex optimization. In ICML, pp. 699–707, 2016.
    Google ScholarLocate open access versionFindings
  • Arjevani, Y., Carmon, Y., Duchi, J. C., Foster, D. J., Srebro, N., and Woodworth, B. Lower bounds for non-convex stochastic optimization. arXiv preprint arXiv:1912.02365, 2019.
    Findings
  • Baxter, J. and Bartlett, P. L. Infinite-horizon policy-gradient estimation. Journal of Artificial Intelligence Research, 15:319–350, 2001.
    Google ScholarLocate open access versionFindings
  • Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., and Zaremba, W. Openai gym, 2016.
    Google ScholarLocate open access versionFindings
  • Cheng, C.-A., Yan, X., and Boots, B. Trajectory-wise control variates for variance reduction in policy gradient methods. arXiv preprint arXiv:1908.03263, 2019a.
    Findings
  • Cheng, C.-A., Yan, X., Ratliff, N., and Boots, B. Predictorcorrector policy optimization. In International Conference on Machine Learning, pp. 1151–1161, 2019b.
    Google ScholarLocate open access versionFindings
  • Cutkosky, A. and Orabona, F. Momentum-based variance reduction in non-convex sgd. In Advances in Neural Information Processing Systems, pp. 15210–15219, 2019.
    Google ScholarLocate open access versionFindings
  • Defazio, A., Bach, F., and Lacoste-Julien, S. Saga: A fast incremental gradient method with support for nonstrongly convex composite objectives. In Advances in neural information processing systems, pp. 1646–1654, 2014.
    Google ScholarLocate open access versionFindings
  • Deisenroth, M. P., Neumann, G., Peters, J., et al. A survey on policy search for robotics. Foundations and Trends R in Robotics, 2(1–2):1–142, 2013.
    Google ScholarLocate open access versionFindings
  • Du, S. S., Chen, J., Li, L., Xiao, L., and Zhou, D. Stochastic variance reduction methods for policy evaluation. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1049–1058. JMLR. org, 2017.
    Google ScholarLocate open access versionFindings
  • Fang, C., Li, C. J., Lin, Z., and Zhang, T. Spider: Nearoptimal non-convex optimization via stochastic pathintegrated differential estimator. In Advances in Neural Information Processing Systems, pp. 689–699, 2018.
    Google ScholarLocate open access versionFindings
  • Fellows, M., Ciosek, K., and Whiteson, S. Fourier policy gradients. In International Conference on Machine Learning, pp. 1486–1495, 2018.
    Google ScholarLocate open access versionFindings
  • Fujimoto, S., Hoof, H., and Meger, D. Addressing function approximation error in actor-critic methods. In ICML, pp. 1587–1596, 2018.
    Google ScholarLocate open access versionFindings
  • Furmston, T., Lever, G., and Barber, D. Approximate newton methods for policy search in markov decision processes. The Journal of Machine Learning Research, 17 (1):8055–8105, 2016.
    Google ScholarLocate open access versionFindings
  • garage contributors, T. Garage: A toolkit for reproducible reinforcement learning research. https://github.com/rlworkgroup/garage, 2019.
    Findings
  • Ghadimi, S. and Lan, G. Stochastic first-and zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization, 23(4):2341–2368, 2013.
    Google ScholarLocate open access versionFindings
  • Greensmith, E., Bartlett, P. L., and Baxter, J. Variance reduction techniques for gradient estimates in reinforcement learning. Journal of Machine Learning Research, 5(Nov): 1471–1530, 2004.
    Google ScholarLocate open access versionFindings
  • Gu, S., Lillicrap, T., Ghahramani, Z., Turner, R. E., and Levine, S. Q-prop: Sample-efficient policy gradient with an off-policy critic. arXiv preprint arXiv:1611.02247, 2016.
    Findings
  • Haarnoja, T., Zhou, A., Abbeel, P., and Levine, S. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International Conference on Machine Learning, pp. 1861–1870, 2018.
    Google ScholarLocate open access versionFindings
  • Johnson, R. and Zhang, T. Accelerating stochastic gradient descent using predictive variance reduction. In NIPS, pp. 315–323, 2013.
    Google ScholarLocate open access versionFindings
  • Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
    Findings
  • Li, Y. Deep reinforcement learning: An overview. arXiv preprint arXiv:1701.07274, 2017.
    Findings
  • Mao, H., Venkatakrishnan, S. B., Schwarzkopf, M., and Alizadeh, M. Variance reduction for reinforcement learning in input-driven environments. arXiv preprint arXiv:1807.02264, 2018.
    Findings
  • Metelli, A. M., Papini, M., Faccio, F., and Restelli, M. Policy optimization via importance sampling. In Advances in Neural Information Processing Systems, pp. 5442–5454, 2018.
    Google ScholarLocate open access versionFindings
  • Nguyen, L. M., Liu, J., Scheinberg, K., and Takac, M. Sarah: A novel method for machine learning problems using stochastic recursive gradient. In ICML, pp. 2613–2621, 2017.
    Google ScholarLocate open access versionFindings
  • Palaniappan, B. and Bach, F. Stochastic variance reduction methods for saddle-point problems. In Advances in Neural Information Processing Systems, pp. 1416–1424, 2016.
    Google ScholarLocate open access versionFindings
  • Papini, M., Binaghi, D., Canonaco, G., Pirotta, M., and Restelli, M. Stochastic variance-reduced policy gradient. In 35th International Conference on Machine Learning, volume 80, pp. 4026–4035, 2018.
    Google ScholarLocate open access versionFindings
  • Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., et al. Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems, pp. 8024–8035, 2019.
    Google ScholarLocate open access versionFindings
  • Peters, J. and Schaal, S. Reinforcement learning of motor skills with policy gradients. Neural networks, 21(4):682– 697, 2008.
    Google ScholarLocate open access versionFindings
  • Pham, N. H., Nguyen, L. M., Phan, D. T., Nguyen, P. H., van Dijk, M., and Tran-Dinh, Q. A hybrid stochastic policy gradient algorithm for reinforcement learning. arXiv preprint arXiv:2003.00430, 2020.
    Findings
  • Pirotta, M., Restelli, M., and Bascetta, L. Adaptive stepsize for policy gradient methods. In Advances in Neural Information Processing Systems, pp. 1394–1402, 2013.
    Google ScholarLocate open access versionFindings
  • Reddi, S. J., Hefny, A., Sra, S., Poczos, B., and Smola, A. Stochastic variance reduction for nonconvex optimization. In International conference on machine learning, pp. 314– 323, 2016.
    Google ScholarLocate open access versionFindings
  • Robbins, H. and Monro, S. A stochastic approximation method. The annals of mathematical statistics, pp. 400– 407, 1951.
    Google ScholarFindings
  • Schulman, J., Levine, S., Abbeel, P., Jordan, M., and Moritz, P. Trust region policy optimization. In International conference on machine learning, pp. 1889–1897, 2015a.
    Google ScholarLocate open access versionFindings
  • Schulman, J., Moritz, P., Levine, S., Jordan, M., and Abbeel, P. High-dimensional continuous control using generalized advantage estimation. arXiv preprint arXiv:1506.02438, 2015b.
    Findings
  • Shalev-Shwartz, S., Shammah, S., and Shashua, A. Safe, multi-agent, reinforcement learning for autonomous driving. arXiv preprint arXiv:1610.03295, 2016.
    Findings
  • Shen, Z., Ribeiro, A., Hassani, H., Qian, H., and Mi, C. Hessian aided policy gradient. In International Conference on Machine Learning, pp. 5729–5738, 2019.
    Google ScholarLocate open access versionFindings
  • Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A., Hubert, T., Baker, L., Lai, M., Bolton, A., et al. Mastering the game of go without human knowledge. nature, 550(7676):354–359, 2017.
    Google ScholarLocate open access versionFindings
  • Sutton, R. S., McAllester, D. A., Singh, S. P., and Mansour, Y. Policy gradient methods for reinforcement learning with function approximation. In Advances in neural information processing systems, pp. 1057–1063, 2000.
    Google ScholarLocate open access versionFindings
  • Todorov, E., Erez, T., and Tassa, Y. Mujoco: A physics engine for model-based control. In IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 5026–5033, 2012.
    Google ScholarLocate open access versionFindings
  • Tran-Dinh, Q., Pham, N. H., Phan, D. T., and Nguyen, L. M. A hybrid stochastic optimization framework for stochastic composite nonconvex optimization. arXiv preprint arXiv:1907.03793, 2019.
    Findings
  • Wai, H.-T., Hong, M., Yang, Z., Wang, Z., and Tang, K. Variance reduced policy evaluation with smooth function approximation. In Advances in Neural Information Processing Systems, pp. 5776–5787, 2019.
    Google ScholarLocate open access versionFindings
  • Wang, L., Cai, Q., Yang, Z., and Wang, Z. Neural policy gradient methods: Global optimality and rates of convergence. arXiv preprint arXiv:1909.01150, 2019a.
    Findings
  • Wang, W. Y., Li, J., and He, X. Deep reinforcement learning for nlp. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics: Tutorial Abstracts, pp. 19–21, 2018.
    Google ScholarLocate open access versionFindings
  • Wang, Z., Ji, K., Zhou, Y., Liang, Y., and Tarokh, V. Spiderboost and momentum: Faster variance reduction algorithms. In Advances in Neural Information Processing Systems, pp. 2403–2413, 2019b.
    Google ScholarLocate open access versionFindings
  • Williams, R. J. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8(3-4):229–256, 1992.
    Google ScholarLocate open access versionFindings
  • Wu, C., Rajeswaran, A., Duan, Y., Kumar, V., Bayen, A. M., Kakade, S., Mordatch, I., and Abbeel, P. Variance reduction for policy gradient with action-dependent factorized baselines. arXiv preprint arXiv:1803.07246, 2018.
    Findings
  • Xiong, H., Xu, T., Liang, Y., and Zhang, W. Non-asymptotic convergence of adam-type reinforcement learning algorithms under markovian sampling. arXiv preprint arXiv:2002.06286, 2020.
    Findings
  • Xu, P., Gao, F., and Gu, Q. An improved convergence analysis of stochastic variance-reduced policy gradient. In Proceedings of the Thirty-Fifth Conference on Uncertainty in Artificial Intelligence, pp. 191, 2019a.
    Google ScholarLocate open access versionFindings
  • Xu, P., Gao, F., and Gu, Q. Sample efficient policy gradient methods with recursive variance reduction. arXiv preprint arXiv:1909.08610, 2019b.
    Findings
  • Xu, T., Liu, Q., and Peng, J. Stochastic variance reduction for policy gradient estimation. arXiv preprint arXiv:1710.06034, 2017.
    Findings
  • Yuan, H., Lian, X., Liu, J., and Zhou, Y. Stochastic recursive momentum for policy gradient methods. arXiv preprint arXiv:2003.04302, 2020.
    Findings
您的评分 :
0

 

标签
评论
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn
小科