AI helps you reading Science

AI generates interpretation videos

AI extracts and analyses the key points of the paper to generate videos automatically


pub
Go Generating

AI Traceability

AI parses the academic lineage of this thesis


Master Reading Tree
Generate MRT

AI Insight

AI extracts a summary of this paper


Weibo:
Our planning procedure can be instantiated by any black-box approximate planner, such as value iteration or natural policy gradient

Reward-Free Exploration for Reinforcement Learning

ICML, pp.4870-4879, (2020)

Cited by: 3|Views139
EI
Full Text
Bibtex
Weibo

Abstract

Exploration is widely regarded as one of the most challenging aspects of reinforcement learning (RL), with many naive approaches succumbing to exponential sample complexity. To isolate the challenges of exploration, we propose a new "reward-free RL" framework. In the exploration phase, the agent first collects trajectories from an MDP $...More

Code:

Data:

0
Introduction
  • In reinforcement learning (RL), an agent repeatedly interacts with an unknown environment with the goal of maximizing its cumulative reward.
  • Reward functions are often iteratively engineered to encourage desired behavior via trial and error
  • In such cases, repeatedly invoking the same reinforcement learning algorithm with different reward functions can be quite sample inefficient
Highlights
  • In reinforcement learning (RL), an agent repeatedly interacts with an unknown environment with the goal of maximizing its cumulative reward
  • Our planning procedure can be instantiated by any black-box approximate planner, such as value iteration or natural policy gradient
  • While reinforcement learning has seen a tremendous surge of recent research activity, essentially all of the standard algorithms deployed in practice employ simple randomization or its variants, and incur extremely high sample complexity
  • We propose a new “reward-free reinforcement learning” framework, comprising of two phases
  • The learner is no longer allowed to interact with the MDP and she is instead tasked with computing near-optimal policies under for M for a collection of given reward functions
  • This paper provides an efficient algorithm that conducts O(S2Apoly(H)/ǫ2) episodes of exploration and returns ǫ-suboptimal policies for an arbitrary number of adaptively chosen reward functions
Results
  • The authors are ready to state the main theorem.
  • It asserts that the algorithm, which the authors will describe in the subsequent sections, is a reward-free exploration algorithm with sample complexity O(H5S2A/ǫ2), ignoring lower order terms.
  • The theorem demonstrates that the sample complexity of reward-free exploration is at most O(H5S2A/ǫ2), which the authors will show to be near-optimal with the lower bound .
  • The number of episodes collected in the exploration phase is bounded by c·
Conclusion
  • The authors propose a new “reward-free RL” framework, comprising of two phases.
  • The learner is no longer allowed to interact with the MDP and she is instead tasked with computing near-optimal policies under for M for a collection of given reward functions.
  • This framework is suitable when there are many reward functions of interest, or when the authors are interested in learning the transition operator directly.
  • The authors give a nearly-matching Ω(S2AH2/ǫ2) lower bound, demonstrating the near-optimality of the algorithm in this setting
Summary
  • Introduction:

    In reinforcement learning (RL), an agent repeatedly interacts with an unknown environment with the goal of maximizing its cumulative reward.
  • Reward functions are often iteratively engineered to encourage desired behavior via trial and error
  • In such cases, repeatedly invoking the same reinforcement learning algorithm with different reward functions can be quite sample inefficient
  • Results:

    The authors are ready to state the main theorem.
  • It asserts that the algorithm, which the authors will describe in the subsequent sections, is a reward-free exploration algorithm with sample complexity O(H5S2A/ǫ2), ignoring lower order terms.
  • The theorem demonstrates that the sample complexity of reward-free exploration is at most O(H5S2A/ǫ2), which the authors will show to be near-optimal with the lower bound .
  • The number of episodes collected in the exploration phase is bounded by c·
  • Conclusion:

    The authors propose a new “reward-free RL” framework, comprising of two phases.
  • The learner is no longer allowed to interact with the MDP and she is instead tasked with computing near-optimal policies under for M for a collection of given reward functions.
  • This framework is suitable when there are many reward functions of interest, or when the authors are interested in learning the transition operator directly.
  • The authors give a nearly-matching Ω(S2AH2/ǫ2) lower bound, demonstrating the near-optimality of the algorithm in this setting
Tables
  • Table1: A comparison between the three MDPs involved
Download tables as Excel
Related work
  • For reward-free exploration in the tabular setting, we are aware of only a few prior approaches. First, when one runs a PAC-RL algorithm like RMAX with no reward function [Brafman and Tennenholtz, 2002], it does visit the entire state space and can be shown to provide a coverage guarantee. However, for RMAX in particular the resulting sample complexity is quite poor, and significantly worse than our nearoptimal guarantee (See Appendix A for a detailed calculation). We expect similar behavior from other PAC algorithms, because reward-dependent exploration is typically suboptimal for the reward-free setting.

    Second, one can extract the exploration component of recent results for RL with function approximation [Du et al, 2019, Misra et al, 2019]. Specifically, the former employs a model based approach where a model is iteratively refined by planning to visit unexplored states, while the latter uses model free dynamic programming to identify and reach all states. While these papers address a more difficult setting, it is relatively straightforward to specialize their results to the tabular setting. In this case, both methods guarantee coverage, but they have suboptimal sample complexity and require that all states can be visited with significant probability. In contrast, our approach requires no visitation probability assumptions and achieves the optimal sample complexity.
Reference
  • Joshua Achiam, David Held, Aviv Tamar, and Pieter Abbeel. Constrained policy optimization. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 22–3JMLR. org, 2017.
    Google ScholarLocate open access versionFindings
  • Alekh Agarwal, Sham M Kakade, Jason D Lee, and Gaurav Mahajan. Optimality and approximation with policy gradient methods in markov decision processes. arXiv preprint arXiv:1908.00261, 2019.
    Findings
  • Eitan Altman. Constrained Markov decision processes, volume 7. CRC Press, 1999.
    Google ScholarFindings
  • Andras Antos, Csaba Szepesvari, and Remi Munos. Learning near-optimal policies with bellman-residual minimization based fitted policy iteration and a single sample path. Machine Learning, 71(1):89–129, 2008.
    Google ScholarLocate open access versionFindings
  • Mohammad Gheshlaghi Azar, Ian Osband, and Remi Munos. Minimax regret bounds for reinforcement learning. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 263–272. JMLR. org, 2017.
    Google ScholarLocate open access versionFindings
  • Ronen I Brafman and Moshe Tennenholtz. R-max-a general polynomial time algorithm for near-optimal reinforcement learning. Journal of Machine Learning Research, 3(Oct):213–231, 2002.
    Google ScholarLocate open access versionFindings
  • Qi Cai, Zhuoran Yang, Chi Jin, and Zhaoran Wang. Provably efficient exploration in policy optimization. arXiv preprint arXiv:1912.05830, 2019.
    Findings
  • Jinglin Chen and Nan Jiang. Information-theoretic considerations in batch reinforcement learning. arXiv preprint arXiv:1905.00360, 2019.
    Findings
  • Xi Chen, Adityanand Guntuboyina, and Yuchen Zhang. On bayes risk lower bounds. The Journal of Machine Learning Research, 17(1):7687–7744, 2016.
    Google ScholarLocate open access versionFindings
  • Christoph Dann and Emma Brunskill. Sample complexity of episodic fixed-horizon reinforcement learning. In Advances in Neural Information Processing Systems, pages 2818–2826, 2015.
    Google ScholarLocate open access versionFindings
  • Christoph Dann, Tor Lattimore, and Emma Brunskill. Unifying pac and regret: Uniform pac bounds for episodic reinforcement learning. In Advances in Neural Information Processing Systems, pages 5713– 5723, 2017.
    Google ScholarLocate open access versionFindings
  • Simon S Du, Akshay Krishnamurthy, Nan Jiang, Alekh Agarwal, Miroslav Dudık, and John Langford. Provably efficient rl with rich observations via latent state decoding. arXiv preprint arXiv:1901.09018, 2019.
    Findings
  • Elad Hazan, Sham M Kakade, Karan Singh, and Abby Van Soest. Provably efficient maximum entropy exploration. arXiv preprint arXiv:1812.02690, 2018.
    Findings
  • Chi Jin, Zeyuan Allen-Zhu, Sebastien Bubeck, and Michael I Jordan. Is q-learning provably efficient? In Advances in Neural Information Processing Systems, pages 4863–4873, 2018.
    Google ScholarLocate open access versionFindings
  • Sham Kakade and John Langford. Approximately optimal approximate reinforcement learning. In ICML, volume 2, pages 267–274, 2002.
    Google ScholarLocate open access versionFindings
  • Sham Machandranath Kakade. On the sample complexity of reinforcement learning. PhD thesis, University of London, London, England, 2003.
    Google ScholarFindings
  • Emilie Kaufmann, Olivier Cappe, and Aurelien Garivier. On the complexity of best-arm identification in multi-armed bandit models. The Journal of Machine Learning Research, 17(1):1–42, 2016.
    Google ScholarLocate open access versionFindings
  • Michael Kearns and Satinder Singh. Near-optimal reinforcement learning in polynomial time. Machine learning, 49(2-3):209–232, 2002.
    Google ScholarLocate open access versionFindings
  • Sobhan Miryoosefi, Kiante Brantley, Hal Daume III, Miroslav Dudik, and Robert Schapire. Reinforcement learning with convex constraints. arXiv preprint arXiv:1906.09323, 2019.
    Findings
  • Dipendra Misra, Mikael Henaff, Akshay Krishnamurthy, and John Langford. Kinematic state abstraction and provably efficient rich-observation reinforcement learning. arXiv preprint arXiv:1911.05815, 2019.
    Findings
  • Remi Munos and Csaba Szepesvari. Finite-time bounds for fitted value iteration. Journal of Machine Learning Research, 9(May):815–857, 2008.
    Google ScholarLocate open access versionFindings
  • Max Simchowitz and Kevin G Jamieson. Non-asymptotic gap-dependent regret bounds for tabular mdps. In Advances in Neural Information Processing Systems, pages 1151–1160, 2019.
    Google ScholarLocate open access versionFindings
  • Chen Tessler, Daniel J Mankowitz, and Shie Mannor. Reward constrained policy optimization. arXiv preprint arXiv:1805.11074, 2018.
    Findings
  • Andrea Zanette and Emma Brunskill. Tighter problem-dependent regret bounds in reinforcement learning without domain knowledge using value function bounds. arXiv preprint arXiv:1901.00210, 2019.
    Findings
Your rating :
0

 

Tags
Comments
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn
小科