AI helps you reading Science
AI generates interpretation videos
AI extracts and analyses the key points of the paper to generate videos automatically
AI parses the academic lineage of this thesis
AI extracts a summary of this paper
We have presented a framework for deterministic policy gradient algorithms
Deterministic Policy Gradient Algorithms.
ICML, pp.387-395, (2014)
In this paper we consider deterministic policy gradient algorithms for reinforcement learning with continuous actions. The deterministic policy gradient has a particularly appealing form: it is the expected gradient of the action-value function. This simple form means that the deterministic policy gradient can be estimated much more effic...More
PPT (Upload PPT)
- Policy gradient algorithms are widely used in reinforcement learning problems with continuous action spaces.
- Policy gradient algorithms typically proceed by sampling this stochastic policy and adjusting the policy parameters in the direction of greater cumulative reward.
- It was previously believed that the deterministic policy gradient did not exist, or could only be obtained when using a model (Peters, 2010).
- The authors show that the deterministic policy gradient does exist, and it has a simple model-free form that follows the gradient of the action-value function.
- Policy gradient algorithms are widely used in reinforcement learning problems with continuous action spaces
- To ensure that our deterministic policy gradient algorithms continue to explore satisfactorily, we introduce an off-policy learning algorithm
- We study reinforcement learning and control problems in which an agent acts in a stochastic environment by sequentially choosing actions over a sequence of time steps, in order to maximise a cumulative reward
- We model the problem as a Markov decision process (MDP) which comprises: a state space S, an action space A, an initial state distribution with density p1(s1), a stationary transition dynamics distribution with conditional density p satisfying the Markov property p = p, for any trajectory s1, a1, s2, a2, ..., sT , aT in state-action space, and a reward function r : S ×A → R
- We develop an actor-critic algorithm that updates the policy in the direction of the off-policy deterministic policy gradient
- We have presented a framework for deterministic policy gradient algorithms
- The authors' first experiment focuses on a direct comparison between the stochastic policy gradient and the deterministic policy gradient.
- The problem is a continuous bandit problem with a high-dimensional quadratic cost function, −r(a) = (a − a∗) C(a − a∗).
- The authors consider action dimensions of m = 10, 25, 50.
- This problem could be solved analytically, given full knowledge of the quadratic, the authors are interested here in the relative performance of model-free stochastic and deterministic policy gradient algorithms
- Discussion and Related
Using a stochastic policy gradient algorithm, the policy becomes more deterministic as the algorithm homes in on a good strategy.
- The variance of the stochastic policy gradient for a Gaussian policy N (μ, σ2) is proportional to 1/σ2 (Zhao et al, 2012), which grows to infinity as the policy becomes deterministic
- This problem is compounded in high dimensions, as illustrated by the continuous bandit task.
- The deterministic policy gradient can be computed immediately in closed form.The authors have presented a framework for deterministic policy gradient algorithms
- These gradients can be estimated more efficiently than their stochastic counterparts, avoiding a problematic integral over the action space.
- The deterministic actor-critic significantly outperformed its stochastic counterpart by several orders of magnitude in a bandit with 50 continuous action dimensions, and solved a challenging reinforcement learning problem with 20 continuous action dimensions and 50 state dimensions
- This work was supported by the European Community Seventh Framework Programme (FP7/2007-2013) under grant agreement 270327 (CompLACS), the Gatsby Charitable Foundation, the Royal Society, the ANR MACSi project, INRIA Bordeaux SudOuest, Mesocentre de Calcul Intensif Aquitain, and the French National Grid Infrastructure via France Grille
- Bagnell, J. A. D. and Schneider, J. (2003). Covariant policy search. In Proceeding of the International Joint Conference on Artifical Intelligence.
- Bhatnagar, S., Sutton, R. S., Ghavamzadeh, M., and Lee, M. (2007). Incremental natural actor-critic algorithms. In Neural Information Processing Systems 21.
- Degris, T., Pilarski, P. M., and Sutton, R. S. (2012a). Model-free reinforcement learning with continuous action in practice. In American Control Conference.
- Degris, T., White, M., and Sutton, R. S. (2012b). Linear off-policy actor-critic. In 29th International Conference on Machine Learning.
- Engel, Y., Szabo, P., and Volkinshtein, D. (2005). Learning to control an octopus arm with gaussian process temporal difference methods. In Neural Information Processing Systems 18.
- Hafner, R. and Riedmiller, M. (2011). Reinforcement learning in feedback control. Machine Learning, 84(12):137–169.
- Heess, N., Silver, D., and Teh, Y. (2012). Actor-critic reinforcement learning with energy-based policies. JMLR Workshop and Conference Proceedings: EWRL 2012, 24:43–58.
- Sutton, R. S., McAllester, D. A., Singh, S. P., and Mansour, Y. (1999). Policy gradient methods for reinforcement learning with function approximation. In Neural Information Processing Systems 12, pages 1057–1063.
- (2000). Comparing policy-gradient algorithms.
- Toussaint, M. (2012). Some notes on gradient descent. http://ipvs.informatik.uni-stuttgart.de/mlr/marc/notes/gradientDescent.pdf.
- Watkins, C. and Dayan, P. (1992). Q-learning. Machine Learning, 8(3):279–292.
- Werbos, P. J. (1990). A menu of designs for reinforcement learning over time. In Neural networks for control, pages 67–95. Bradford.
- Williams, R. J. (1992). Simple statistical gradientfollowing algorithms for connectionist reinforcement learning. Machine Learning, 8:229–256.
- Zhao, T., Hachiya, H., Niu, G., and Sugiyama, M. (2012). Analysis and improvement of policy gradient estimation. Neural Networks, 26:118–129.
- Kakade, S. (2001). A natural policy gradient. In Neural Information Processing Systems 14, pages 1531–1538.
- Lagoudakis, M. G. and Parr, R. (2003). Least-squares policy iteration. Journal of Machine Learning Research, 4:1107–1149.
- Maei, H. R., Szepesvari, C., Bhatnagar, S., and Sutton, R. S. (2010). Toward off-policy learning control with function approximation. In 27th International Conference on Machine Learning, pages 719–726.
- Peters, J. (2010). Policy gradient methods. Scholarpedia, 5(11):3698.
- Peters, J., Vijayakumar, S., and Schaal, S. (2005). Natural actor-critic. In 16th European Conference on Machine Learning, pages 280–291.
- Sutton, R. and Barto, A. (1998). Reinforcement Learning: an Introduction. MIT Press.
- Sutton, R. S., Maei, H. R., Precup, D., Bhatnagar, S., Silver, D., Szepesvari, C., and Wiewiora, E. (2009). Fast gradient-descent methods for temporal-difference learning with linear function approximation. In 26th International Conference on Machine Learning, page 125.