# Momentum-Based Policy Gradient Methods

ICML, pp. 4422-4433, 2020.

EI

Weibo:

Abstract:

In the paper, we propose a class of efficient momentum-based policy gradient methods for the model-free reinforcement learning, which use adaptive learning rates and do not require any large batches. Specifically, we propose a fast important-sampling momentum-based policy gradient (IS-MBPG) method based on a new momentum-based variance ...More

Code:

Data:

Introduction

- Reinforcement Learning (RL) has achieved great success in solving many sequential decision-making problems such as autonomous driving (Shalev-Shwartz et al, 2016), robot manipulation (Deisenroth et al, 2013), the game of Go (Silver et al, 2017) and natural language processing (Wang et al, 2018).
- RL involves a Markov decision process (MDP), where an agent takes actions dictated by a policy in a stochastic environment over a sequence of time steps, and maximizes the long-term cumulative rewards to obtain an optimal policy.
- To obtain the optimal policy, policy gradient methods directly maximize the expected total reward ( called as performance function J(θ)) via using the stochastic first-order gradient of cumulative rewards.
- The policy π(a|s) at the state s is represented by a conditional probability distribution πθ(a|s) associated to the parameter θ ∈ Rd

Highlights

- Reinforcement Learning (RL) has achieved great success in solving many sequential decision-making problems such as autonomous driving (Shalev-Shwartz et al, 2016), robot manipulation (Deisenroth et al, 2013), the game of Go (Silver et al, 2017) and natural language processing (Wang et al, 2018)
- Since the classic policy gradient methods (e.g., REINFORCE (Williams, 1992), PGT (Sutton et al, 2000), GPOMDP (Baxter & Bartlett, 2001) and TRPO (Schulman et al, 2015a)) approximate the gradient of the expected total reward based on a batch of sampled trajectories, they generally suffer from large variance in the estimated gradients, which results in a poor convergence
- Our main contributions are summarized as follows: 1) We propose a fast important-sampling momentumbased policy gradient (IS-MBPG) method with adaptive learning rate, which builds on a new momentumbased variance reduction technique of STORM/HybridSGD (Cutkosky & Orabona, 2019; Tran-Dinh et al, 2019) and the importance sampling technique
- Hessian-aided momentum-based policy gradient performs similar compared to Stochastic Recursive Variance Reduced Policy Gradient and Hessian Aided Policy Gradient, though it has an advantage at the beginning
- We proved that the important-sampling momentum-based policy gradient* reaches the best known sample complexity of O( −3) only required one trajectory at each iteration

Methods

- The authors demonstrate the performance of the algorithms on four standard reinforcement learning tasks, which are CartPole, Walker, HalfCheetah and Hopper.
- The first one is a discrete task from classic control, and the later three tasks are continuous RL task, which are popular MuJoCo environments (Todorov et al, 2012)
- Detailed description of these environments is shown in Fig. 1.
- The authors' code is publicly available on https://github.com/gaosh/MBPG

Results

- The results of experiments are presented in Fig. 2.
- In the CartPole environment, the IS-MBPG and HA-MBPG algorithms have better performances than the other methods.
- The authors' IS-MBPG algorithm achieves the best final performance with a obvious margin.
- HA-MBPG performs similar compared to SRVR-PG and HAPG, though it has an advantage at the beginning.
- In Hopper environment, the ISMBPG and HA-MBPG algorithms are significantly faster compared to all other methods, while the final average reward are similar for different algorithms.
- In HalfCheetah environment, IS-MBPG, HA-MBPG and SRVR-PG performs at the beginning.
- One possible reason for this observation is that the authors use the estimated Hessian vector product instead of the exact Hessian vector product in HA-MBPG algorithm, which brings additional estimation error to the algorithm

Conclusion

- The authors proposed a class of efficient momentumbased policy gradient methods (i.e., IS-MBPG and HAMBPG), which use adaptive learning rates and do not require any large batches.
- The authors proved that both IS-MBPG and HA-MBPG methods reach the best known sample complexity of O( −3), which only require one trajectory at each iteration.
- The authors proved that the IS-MBPG* reaches the best known sample complexity of O( −3) only required one trajectory at each iteration

Summary

## Introduction:

Reinforcement Learning (RL) has achieved great success in solving many sequential decision-making problems such as autonomous driving (Shalev-Shwartz et al, 2016), robot manipulation (Deisenroth et al, 2013), the game of Go (Silver et al, 2017) and natural language processing (Wang et al, 2018).- RL involves a Markov decision process (MDP), where an agent takes actions dictated by a policy in a stochastic environment over a sequence of time steps, and maximizes the long-term cumulative rewards to obtain an optimal policy.
- To obtain the optimal policy, policy gradient methods directly maximize the expected total reward ( called as performance function J(θ)) via using the stochastic first-order gradient of cumulative rewards.
- The policy π(a|s) at the state s is represented by a conditional probability distribution πθ(a|s) associated to the parameter θ ∈ Rd
## Methods:

The authors demonstrate the performance of the algorithms on four standard reinforcement learning tasks, which are CartPole, Walker, HalfCheetah and Hopper.- The first one is a discrete task from classic control, and the later three tasks are continuous RL task, which are popular MuJoCo environments (Todorov et al, 2012)
- Detailed description of these environments is shown in Fig. 1.
- The authors' code is publicly available on https://github.com/gaosh/MBPG
## Results:

The results of experiments are presented in Fig. 2.- In the CartPole environment, the IS-MBPG and HA-MBPG algorithms have better performances than the other methods.
- The authors' IS-MBPG algorithm achieves the best final performance with a obvious margin.
- HA-MBPG performs similar compared to SRVR-PG and HAPG, though it has an advantage at the beginning.
- In Hopper environment, the ISMBPG and HA-MBPG algorithms are significantly faster compared to all other methods, while the final average reward are similar for different algorithms.
- In HalfCheetah environment, IS-MBPG, HA-MBPG and SRVR-PG performs at the beginning.
- One possible reason for this observation is that the authors use the estimated Hessian vector product instead of the exact Hessian vector product in HA-MBPG algorithm, which brings additional estimation error to the algorithm
## Conclusion:

The authors proposed a class of efficient momentumbased policy gradient methods (i.e., IS-MBPG and HAMBPG), which use adaptive learning rates and do not require any large batches.- The authors proved that both IS-MBPG and HA-MBPG methods reach the best known sample complexity of O( −3), which only require one trajectory at each iteration.
- The authors proved that the IS-MBPG* reaches the best known sample complexity of O( −3) only required one trajectory at each iteration

- Table1: Convergence properties of the representative variance-reduced policy algorithms on the non-oblivious model-free RL problem for finding an -stationary point of the nonconcave performance function J(θ), i.e., E ∇J(θ) ≤ . Our algorithms (IS-MBPG, IS-MBPG* and HA-MBPG) and REINFORCE are single-loop algorithms, while the other algorithms are double-loops, which need the outer-loop and inner-loop mini-batch sizes. Note that <a class="ref-link" id="cPapini_et+al_2018_a" href="#rPapini_et+al_2018_a">Papini et al (2018</a>) only remarked that apply the ADAM algorithm (<a class="ref-link" id="cKingma_2014_a" href="#rKingma_2014_a">Kingma & Ba, 2014</a>) to the SVRPG algorithm to obtain an adaptive learning rate, but did not provide any theoretical analysis about this learning rate

Funding

- This work was partially supported by U.S NSF IIS 1836945, IIS 1836938, IIS 1845666, IIS 1852606, IIS 1838627, IIS 1837956

Reference

- Allen-Zhu, Z. and Hazan, E. Variance reduction for faster non-convex optimization. In ICML, pp. 699–707, 2016.
- Arjevani, Y., Carmon, Y., Duchi, J. C., Foster, D. J., Srebro, N., and Woodworth, B. Lower bounds for non-convex stochastic optimization. arXiv preprint arXiv:1912.02365, 2019.
- Baxter, J. and Bartlett, P. L. Infinite-horizon policy-gradient estimation. Journal of Artificial Intelligence Research, 15:319–350, 2001.
- Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., and Zaremba, W. Openai gym, 2016.
- Cheng, C.-A., Yan, X., and Boots, B. Trajectory-wise control variates for variance reduction in policy gradient methods. arXiv preprint arXiv:1908.03263, 2019a.
- Cheng, C.-A., Yan, X., Ratliff, N., and Boots, B. Predictorcorrector policy optimization. In International Conference on Machine Learning, pp. 1151–1161, 2019b.
- Cutkosky, A. and Orabona, F. Momentum-based variance reduction in non-convex sgd. In Advances in Neural Information Processing Systems, pp. 15210–15219, 2019.
- Defazio, A., Bach, F., and Lacoste-Julien, S. Saga: A fast incremental gradient method with support for nonstrongly convex composite objectives. In Advances in neural information processing systems, pp. 1646–1654, 2014.
- Deisenroth, M. P., Neumann, G., Peters, J., et al. A survey on policy search for robotics. Foundations and Trends R in Robotics, 2(1–2):1–142, 2013.
- Du, S. S., Chen, J., Li, L., Xiao, L., and Zhou, D. Stochastic variance reduction methods for policy evaluation. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1049–1058. JMLR. org, 2017.
- Fang, C., Li, C. J., Lin, Z., and Zhang, T. Spider: Nearoptimal non-convex optimization via stochastic pathintegrated differential estimator. In Advances in Neural Information Processing Systems, pp. 689–699, 2018.
- Fellows, M., Ciosek, K., and Whiteson, S. Fourier policy gradients. In International Conference on Machine Learning, pp. 1486–1495, 2018.
- Fujimoto, S., Hoof, H., and Meger, D. Addressing function approximation error in actor-critic methods. In ICML, pp. 1587–1596, 2018.
- Furmston, T., Lever, G., and Barber, D. Approximate newton methods for policy search in markov decision processes. The Journal of Machine Learning Research, 17 (1):8055–8105, 2016.
- garage contributors, T. Garage: A toolkit for reproducible reinforcement learning research. https://github.com/rlworkgroup/garage, 2019.
- Ghadimi, S. and Lan, G. Stochastic first-and zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization, 23(4):2341–2368, 2013.
- Greensmith, E., Bartlett, P. L., and Baxter, J. Variance reduction techniques for gradient estimates in reinforcement learning. Journal of Machine Learning Research, 5(Nov): 1471–1530, 2004.
- Gu, S., Lillicrap, T., Ghahramani, Z., Turner, R. E., and Levine, S. Q-prop: Sample-efficient policy gradient with an off-policy critic. arXiv preprint arXiv:1611.02247, 2016.
- Haarnoja, T., Zhou, A., Abbeel, P., and Levine, S. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International Conference on Machine Learning, pp. 1861–1870, 2018.
- Johnson, R. and Zhang, T. Accelerating stochastic gradient descent using predictive variance reduction. In NIPS, pp. 315–323, 2013.
- Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- Li, Y. Deep reinforcement learning: An overview. arXiv preprint arXiv:1701.07274, 2017.
- Mao, H., Venkatakrishnan, S. B., Schwarzkopf, M., and Alizadeh, M. Variance reduction for reinforcement learning in input-driven environments. arXiv preprint arXiv:1807.02264, 2018.
- Metelli, A. M., Papini, M., Faccio, F., and Restelli, M. Policy optimization via importance sampling. In Advances in Neural Information Processing Systems, pp. 5442–5454, 2018.
- Nguyen, L. M., Liu, J., Scheinberg, K., and Takac, M. Sarah: A novel method for machine learning problems using stochastic recursive gradient. In ICML, pp. 2613–2621, 2017.
- Palaniappan, B. and Bach, F. Stochastic variance reduction methods for saddle-point problems. In Advances in Neural Information Processing Systems, pp. 1416–1424, 2016.
- Papini, M., Binaghi, D., Canonaco, G., Pirotta, M., and Restelli, M. Stochastic variance-reduced policy gradient. In 35th International Conference on Machine Learning, volume 80, pp. 4026–4035, 2018.
- Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., et al. Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems, pp. 8024–8035, 2019.
- Peters, J. and Schaal, S. Reinforcement learning of motor skills with policy gradients. Neural networks, 21(4):682– 697, 2008.
- Pham, N. H., Nguyen, L. M., Phan, D. T., Nguyen, P. H., van Dijk, M., and Tran-Dinh, Q. A hybrid stochastic policy gradient algorithm for reinforcement learning. arXiv preprint arXiv:2003.00430, 2020.
- Pirotta, M., Restelli, M., and Bascetta, L. Adaptive stepsize for policy gradient methods. In Advances in Neural Information Processing Systems, pp. 1394–1402, 2013.
- Reddi, S. J., Hefny, A., Sra, S., Poczos, B., and Smola, A. Stochastic variance reduction for nonconvex optimization. In International conference on machine learning, pp. 314– 323, 2016.
- Robbins, H. and Monro, S. A stochastic approximation method. The annals of mathematical statistics, pp. 400– 407, 1951.
- Schulman, J., Levine, S., Abbeel, P., Jordan, M., and Moritz, P. Trust region policy optimization. In International conference on machine learning, pp. 1889–1897, 2015a.
- Schulman, J., Moritz, P., Levine, S., Jordan, M., and Abbeel, P. High-dimensional continuous control using generalized advantage estimation. arXiv preprint arXiv:1506.02438, 2015b.
- Shalev-Shwartz, S., Shammah, S., and Shashua, A. Safe, multi-agent, reinforcement learning for autonomous driving. arXiv preprint arXiv:1610.03295, 2016.
- Shen, Z., Ribeiro, A., Hassani, H., Qian, H., and Mi, C. Hessian aided policy gradient. In International Conference on Machine Learning, pp. 5729–5738, 2019.
- Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A., Hubert, T., Baker, L., Lai, M., Bolton, A., et al. Mastering the game of go without human knowledge. nature, 550(7676):354–359, 2017.
- Sutton, R. S., McAllester, D. A., Singh, S. P., and Mansour, Y. Policy gradient methods for reinforcement learning with function approximation. In Advances in neural information processing systems, pp. 1057–1063, 2000.
- Todorov, E., Erez, T., and Tassa, Y. Mujoco: A physics engine for model-based control. In IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 5026–5033, 2012.
- Tran-Dinh, Q., Pham, N. H., Phan, D. T., and Nguyen, L. M. A hybrid stochastic optimization framework for stochastic composite nonconvex optimization. arXiv preprint arXiv:1907.03793, 2019.
- Wai, H.-T., Hong, M., Yang, Z., Wang, Z., and Tang, K. Variance reduced policy evaluation with smooth function approximation. In Advances in Neural Information Processing Systems, pp. 5776–5787, 2019.
- Wang, L., Cai, Q., Yang, Z., and Wang, Z. Neural policy gradient methods: Global optimality and rates of convergence. arXiv preprint arXiv:1909.01150, 2019a.
- Wang, W. Y., Li, J., and He, X. Deep reinforcement learning for nlp. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics: Tutorial Abstracts, pp. 19–21, 2018.
- Wang, Z., Ji, K., Zhou, Y., Liang, Y., and Tarokh, V. Spiderboost and momentum: Faster variance reduction algorithms. In Advances in Neural Information Processing Systems, pp. 2403–2413, 2019b.
- Williams, R. J. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8(3-4):229–256, 1992.
- Wu, C., Rajeswaran, A., Duan, Y., Kumar, V., Bayen, A. M., Kakade, S., Mordatch, I., and Abbeel, P. Variance reduction for policy gradient with action-dependent factorized baselines. arXiv preprint arXiv:1803.07246, 2018.
- Xiong, H., Xu, T., Liang, Y., and Zhang, W. Non-asymptotic convergence of adam-type reinforcement learning algorithms under markovian sampling. arXiv preprint arXiv:2002.06286, 2020.
- Xu, P., Gao, F., and Gu, Q. An improved convergence analysis of stochastic variance-reduced policy gradient. In Proceedings of the Thirty-Fifth Conference on Uncertainty in Artificial Intelligence, pp. 191, 2019a.
- Xu, P., Gao, F., and Gu, Q. Sample efficient policy gradient methods with recursive variance reduction. arXiv preprint arXiv:1909.08610, 2019b.
- Xu, T., Liu, Q., and Peng, J. Stochastic variance reduction for policy gradient estimation. arXiv preprint arXiv:1710.06034, 2017.
- Yuan, H., Lian, X., Liu, J., and Zhou, Y. Stochastic recursive momentum for policy gradient methods. arXiv preprint arXiv:2003.04302, 2020.

Full Text

Tags

Comments