## AI helps you reading Science

## AI Insight

AI extracts a summary of this paper

Weibo:

# Deterministic Policy Gradient Algorithms.

ICML, pp.387-395, (2014)

EI

Keywords

Abstract

In this paper we consider deterministic policy gradient algorithms for reinforcement learning with continuous actions. The deterministic policy gradient has a particularly appealing form: it is the expected gradient of the action-value function. This simple form means that the deterministic policy gradient can be estimated much more effic...More

Code:

Data:

Introduction

- Policy gradient algorithms are widely used in reinforcement learning problems with continuous action spaces.
- Policy gradient algorithms typically proceed by sampling this stochastic policy and adjusting the policy parameters in the direction of greater cumulative reward.
- It was previously believed that the deterministic policy gradient did not exist, or could only be obtained when using a model (Peters, 2010).
- The authors show that the deterministic policy gradient does exist, and it has a simple model-free form that follows the gradient of the action-value function.

Highlights

- Policy gradient algorithms are widely used in reinforcement learning problems with continuous action spaces
- To ensure that our deterministic policy gradient algorithms continue to explore satisfactorily, we introduce an off-policy learning algorithm
- We study reinforcement learning and control problems in which an agent acts in a stochastic environment by sequentially choosing actions over a sequence of time steps, in order to maximise a cumulative reward
- We model the problem as a Markov decision process (MDP) which comprises: a state space S, an action space A, an initial state distribution with density p1(s1), a stationary transition dynamics distribution with conditional density p satisfying the Markov property p = p, for any trajectory s1, a1, s2, a2, ..., sT , aT in state-action space, and a reward function r : S ×A → R
- We develop an actor-critic algorithm that updates the policy in the direction of the off-policy deterministic policy gradient
- We have presented a framework for deterministic policy gradient algorithms

Methods

- The authors' first experiment focuses on a direct comparison between the stochastic policy gradient and the deterministic policy gradient.
- The problem is a continuous bandit problem with a high-dimensional quadratic cost function, −r(a) = (a − a∗) C(a − a∗).
- The authors consider action dimensions of m = 10, 25, 50.
- This problem could be solved analytically, given full knowledge of the quadratic, the authors are interested here in the relative performance of model-free stochastic and deterministic policy gradient algorithms

Conclusion

**Discussion and Related**

Work

Using a stochastic policy gradient algorithm, the policy becomes more deterministic as the algorithm homes in on a good strategy.- The variance of the stochastic policy gradient for a Gaussian policy N (μ, σ2) is proportional to 1/σ2 (Zhao et al, 2012), which grows to infinity as the policy becomes deterministic
- This problem is compounded in high dimensions, as illustrated by the continuous bandit task.
- The deterministic policy gradient can be computed immediately in closed form.The authors have presented a framework for deterministic policy gradient algorithms
- These gradients can be estimated more efficiently than their stochastic counterparts, avoiding a problematic integral over the action space.
- The deterministic actor-critic significantly outperformed its stochastic counterpart by several orders of magnitude in a bandit with 50 continuous action dimensions, and solved a challenging reinforcement learning problem with 20 continuous action dimensions and 50 state dimensions

Funding

- This work was supported by the European Community Seventh Framework Programme (FP7/2007-2013) under grant agreement 270327 (CompLACS), the Gatsby Charitable Foundation, the Royal Society, the ANR MACSi project, INRIA Bordeaux SudOuest, Mesocentre de Calcul Intensif Aquitain, and the French National Grid Infrastructure via France Grille

Reference

- Bagnell, J. A. D. and Schneider, J. (2003). Covariant policy search. In Proceeding of the International Joint Conference on Artifical Intelligence.
- Bhatnagar, S., Sutton, R. S., Ghavamzadeh, M., and Lee, M. (2007). Incremental natural actor-critic algorithms. In Neural Information Processing Systems 21.
- Degris, T., Pilarski, P. M., and Sutton, R. S. (2012a). Model-free reinforcement learning with continuous action in practice. In American Control Conference.
- Degris, T., White, M., and Sutton, R. S. (2012b). Linear off-policy actor-critic. In 29th International Conference on Machine Learning.
- Engel, Y., Szabo, P., and Volkinshtein, D. (2005). Learning to control an octopus arm with gaussian process temporal difference methods. In Neural Information Processing Systems 18.
- Hafner, R. and Riedmiller, M. (2011). Reinforcement learning in feedback control. Machine Learning, 84(12):137–169.
- Heess, N., Silver, D., and Teh, Y. (2012). Actor-critic reinforcement learning with energy-based policies. JMLR Workshop and Conference Proceedings: EWRL 2012, 24:43–58.
- Sutton, R. S., McAllester, D. A., Singh, S. P., and Mansour, Y. (1999). Policy gradient methods for reinforcement learning with function approximation. In Neural Information Processing Systems 12, pages 1057–1063.
- (2000). Comparing policy-gradient algorithms.
- http://webdocs.cs.ualberta.ca/
- Toussaint, M. (2012). Some notes on gradient descent. http://ipvs.informatik.uni-stuttgart.de/mlr/marc/notes/gradientDescent.pdf.
- Watkins, C. and Dayan, P. (1992). Q-learning. Machine Learning, 8(3):279–292.
- Werbos, P. J. (1990). A menu of designs for reinforcement learning over time. In Neural networks for control, pages 67–95. Bradford.
- Williams, R. J. (1992). Simple statistical gradientfollowing algorithms for connectionist reinforcement learning. Machine Learning, 8:229–256.
- Zhao, T., Hachiya, H., Niu, G., and Sugiyama, M. (2012). Analysis and improvement of policy gradient estimation. Neural Networks, 26:118–129.
- Kakade, S. (2001). A natural policy gradient. In Neural Information Processing Systems 14, pages 1531–1538.
- Lagoudakis, M. G. and Parr, R. (2003). Least-squares policy iteration. Journal of Machine Learning Research, 4:1107–1149.
- Maei, H. R., Szepesvari, C., Bhatnagar, S., and Sutton, R. S. (2010). Toward off-policy learning control with function approximation. In 27th International Conference on Machine Learning, pages 719–726.
- Peters, J. (2010). Policy gradient methods. Scholarpedia, 5(11):3698.
- Peters, J., Vijayakumar, S., and Schaal, S. (2005). Natural actor-critic. In 16th European Conference on Machine Learning, pages 280–291.
- Sutton, R. and Barto, A. (1998). Reinforcement Learning: an Introduction. MIT Press.
- Sutton, R. S., Maei, H. R., Precup, D., Bhatnagar, S., Silver, D., Szepesvari, C., and Wiewiora, E. (2009). Fast gradient-descent methods for temporal-difference learning with linear function approximation. In 26th International Conference on Machine Learning, page 125.

Tags

Comments

数据免责声明

页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果，我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问，可以通过电子邮件方式联系我们：report@aminer.cn