AI helps you reading Science

AI generates interpretation videos

AI extracts and analyses the key points of the paper to generate videos automatically


pub
Go Generating

AI Traceability

AI parses the academic lineage of this thesis


Master Reading Tree
Generate MRT

AI Insight

AI extracts a summary of this paper


Weibo:
We have presented a framework for deterministic policy gradient algorithms

Deterministic Policy Gradient Algorithms.

ICML, pp.387-395, (2014)

Cited: 2799|Views288
EI
Full Text
Bibtex
Weibo

Abstract

In this paper we consider deterministic policy gradient algorithms for reinforcement learning with continuous actions. The deterministic policy gradient has a particularly appealing form: it is the expected gradient of the action-value function. This simple form means that the deterministic policy gradient can be estimated much more effic...More

Code:

Data:

Introduction
  • Policy gradient algorithms are widely used in reinforcement learning problems with continuous action spaces.
  • Policy gradient algorithms typically proceed by sampling this stochastic policy and adjusting the policy parameters in the direction of greater cumulative reward.
  • It was previously believed that the deterministic policy gradient did not exist, or could only be obtained when using a model (Peters, 2010).
  • The authors show that the deterministic policy gradient does exist, and it has a simple model-free form that follows the gradient of the action-value function.
Highlights
  • Policy gradient algorithms are widely used in reinforcement learning problems with continuous action spaces
  • To ensure that our deterministic policy gradient algorithms continue to explore satisfactorily, we introduce an off-policy learning algorithm
  • We study reinforcement learning and control problems in which an agent acts in a stochastic environment by sequentially choosing actions over a sequence of time steps, in order to maximise a cumulative reward
  • We model the problem as a Markov decision process (MDP) which comprises: a state space S, an action space A, an initial state distribution with density p1(s1), a stationary transition dynamics distribution with conditional density p satisfying the Markov property p = p, for any trajectory s1, a1, s2, a2, ..., sT , aT in state-action space, and a reward function r : S ×A → R
  • We develop an actor-critic algorithm that updates the policy in the direction of the off-policy deterministic policy gradient
  • We have presented a framework for deterministic policy gradient algorithms
Methods
  • The authors' first experiment focuses on a direct comparison between the stochastic policy gradient and the deterministic policy gradient.
  • The problem is a continuous bandit problem with a high-dimensional quadratic cost function, −r(a) = (a − a∗) C(a − a∗).
  • The authors consider action dimensions of m = 10, 25, 50.
  • This problem could be solved analytically, given full knowledge of the quadratic, the authors are interested here in the relative performance of model-free stochastic and deterministic policy gradient algorithms
Conclusion
  • Discussion and Related

    Work

    Using a stochastic policy gradient algorithm, the policy becomes more deterministic as the algorithm homes in on a good strategy.
  • The variance of the stochastic policy gradient for a Gaussian policy N (μ, σ2) is proportional to 1/σ2 (Zhao et al, 2012), which grows to infinity as the policy becomes deterministic
  • This problem is compounded in high dimensions, as illustrated by the continuous bandit task.
  • The deterministic policy gradient can be computed immediately in closed form.The authors have presented a framework for deterministic policy gradient algorithms
  • These gradients can be estimated more efficiently than their stochastic counterparts, avoiding a problematic integral over the action space.
  • The deterministic actor-critic significantly outperformed its stochastic counterpart by several orders of magnitude in a bandit with 50 continuous action dimensions, and solved a challenging reinforcement learning problem with 20 continuous action dimensions and 50 state dimensions
Funding
  • This work was supported by the European Community Seventh Framework Programme (FP7/2007-2013) under grant agreement 270327 (CompLACS), the Gatsby Charitable Foundation, the Royal Society, the ANR MACSi project, INRIA Bordeaux SudOuest, Mesocentre de Calcul Intensif Aquitain, and the French National Grid Infrastructure via France Grille
Reference
  • Bagnell, J. A. D. and Schneider, J. (2003). Covariant policy search. In Proceeding of the International Joint Conference on Artifical Intelligence.
    Google ScholarLocate open access versionFindings
  • Bhatnagar, S., Sutton, R. S., Ghavamzadeh, M., and Lee, M. (2007). Incremental natural actor-critic algorithms. In Neural Information Processing Systems 21.
    Google ScholarLocate open access versionFindings
  • Degris, T., Pilarski, P. M., and Sutton, R. S. (2012a). Model-free reinforcement learning with continuous action in practice. In American Control Conference.
    Google ScholarLocate open access versionFindings
  • Degris, T., White, M., and Sutton, R. S. (2012b). Linear off-policy actor-critic. In 29th International Conference on Machine Learning.
    Google ScholarLocate open access versionFindings
  • Engel, Y., Szabo, P., and Volkinshtein, D. (2005). Learning to control an octopus arm with gaussian process temporal difference methods. In Neural Information Processing Systems 18.
    Google ScholarLocate open access versionFindings
  • Hafner, R. and Riedmiller, M. (2011). Reinforcement learning in feedback control. Machine Learning, 84(12):137–169.
    Google ScholarLocate open access versionFindings
  • Heess, N., Silver, D., and Teh, Y. (2012). Actor-critic reinforcement learning with energy-based policies. JMLR Workshop and Conference Proceedings: EWRL 2012, 24:43–58.
    Google ScholarLocate open access versionFindings
  • Sutton, R. S., McAllester, D. A., Singh, S. P., and Mansour, Y. (1999). Policy gradient methods for reinforcement learning with function approximation. In Neural Information Processing Systems 12, pages 1057–1063.
    Google ScholarLocate open access versionFindings
  • (2000). Comparing policy-gradient algorithms.
    Google ScholarFindings
  • http://webdocs.cs.ualberta.ca/
    Findings
  • Toussaint, M. (2012). Some notes on gradient descent. http://ipvs.informatik.uni-stuttgart.de/mlr/marc/notes/gradientDescent.pdf.
    Findings
  • Watkins, C. and Dayan, P. (1992). Q-learning. Machine Learning, 8(3):279–292.
    Google ScholarLocate open access versionFindings
  • Werbos, P. J. (1990). A menu of designs for reinforcement learning over time. In Neural networks for control, pages 67–95. Bradford.
    Google ScholarLocate open access versionFindings
  • Williams, R. J. (1992). Simple statistical gradientfollowing algorithms for connectionist reinforcement learning. Machine Learning, 8:229–256.
    Google ScholarLocate open access versionFindings
  • Zhao, T., Hachiya, H., Niu, G., and Sugiyama, M. (2012). Analysis and improvement of policy gradient estimation. Neural Networks, 26:118–129.
    Google ScholarLocate open access versionFindings
  • Kakade, S. (2001). A natural policy gradient. In Neural Information Processing Systems 14, pages 1531–1538.
    Google ScholarLocate open access versionFindings
  • Lagoudakis, M. G. and Parr, R. (2003). Least-squares policy iteration. Journal of Machine Learning Research, 4:1107–1149.
    Google ScholarLocate open access versionFindings
  • Maei, H. R., Szepesvari, C., Bhatnagar, S., and Sutton, R. S. (2010). Toward off-policy learning control with function approximation. In 27th International Conference on Machine Learning, pages 719–726.
    Google ScholarLocate open access versionFindings
  • Peters, J. (2010). Policy gradient methods. Scholarpedia, 5(11):3698.
    Google ScholarLocate open access versionFindings
  • Peters, J., Vijayakumar, S., and Schaal, S. (2005). Natural actor-critic. In 16th European Conference on Machine Learning, pages 280–291.
    Google ScholarLocate open access versionFindings
  • Sutton, R. and Barto, A. (1998). Reinforcement Learning: an Introduction. MIT Press.
    Google ScholarFindings
  • Sutton, R. S., Maei, H. R., Precup, D., Bhatnagar, S., Silver, D., Szepesvari, C., and Wiewiora, E. (2009). Fast gradient-descent methods for temporal-difference learning with linear function approximation. In 26th International Conference on Machine Learning, page 125.
    Google ScholarLocate open access versionFindings
0
Your rating :

No Ratings

Tags
Comments
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn