AI helps you reading Science

AI generates interpretation videos

AI extracts and analyses the key points of the paper to generate videos automatically


pub
Go Generating

AI Traceability

AI parses the academic lineage of this thesis


Master Reading Tree
Generate MRT

AI Insight

AI extracts a summary of this paper


Weibo:
We focus on the algorithm families based on Approximate Policy Iteration and Approximate Value Iteration, which form the prototype of many model-free online and offline reinforcement learning algorithms

Provably Good Batch Off-Policy Reinforcement Learning Without Great Exploration

NIPS 2020, (2020)

Cited by: 0|Views117
EI
Full Text
Bibtex
Weibo

Abstract

Batch reinforcement learning (RL) is important to apply RL algorithms to many high stakes tasks. Doing batch RL in a way that yields a reliable new policy in large domains is challenging: a new decision policy may visit states and actions outside the support of the batch data, and function approximation and optimization with limited sampl...More

Code:

Data:

0
Introduction
  • A key question in Reinforcement Learning is about learning good policies from off policy batch data in large or infinite state spaces.
  • This problem is relevant to the batch setting; many online RL algorithms use a growing batch of data such as a replay buffer [24, 28].
  • One particular issue is that the max in the Bellman operator may pick actions in (s, a) pairs with limited but rewarding samples, which can lead to overly optimistic value function estimates and under-performing policies [26]
Highlights
  • A key question in Reinforcement Learning is about learning good policies from off policy batch data in large or infinite state spaces
  • We focus on the algorithm families based on Approximate Policy Iteration (API) and Approximate Value Iteration (AVI), which form the prototype of many model-free online and offline reinforcement learning (RL) algorithms
  • We show several experiments where the data collected ranges from being inadequate for batch RL to complete coverage
  • The results show that our algorithm achieves good performance among all different values, and is always better than or close to both behavior cloning and vanilla fitted Q iteration (FQI) unlike other baselines
  • We study a key assumption for analysis in batch value-based RL, concentrability, and provide policy iteration and Q iteration algorithms with minor modifications that can be agnostic to this assumption
  • Several state-of-the-art methods [17, 18] in batch RL leverage the pessimism based on the conditional action distribution, which is composable with our proposal for state-action pessimism
Results
  • The key innovation in the algorithm uses an estimated ζ to filter backups from unsound state-action pairs.
  • The authors can run post-hoc diagnostics on the choice of b by computing the average of ζ(s, π(s)) for the resulting policy π over the batch dataset.
  • If this quantity is too small, the authors can conclude that ζ filters out too many Bellman backups and rerun the procedure with a lower b
Conclusion
  • The authors study a key assumption for analysis in batch value-based RL, concentrability, and provide policy iteration and Q iteration algorithms with minor modifications that can be agnostic to this assumption.
  • This work can provide some intuition for designing practical deep RL algorithms by leveraging pessimism based on the marginalized state-action distribution.
  • Towards design more practical algorithms, estimating state action visitation distributions is an active research area (e.g.
  • Towards design more practical algorithms, estimating state action visitation distributions is an active research area (e.g. [39]) and the algorithmic framework is composable with better estimators of μ(s, a) or other uncertainty meansurements
Summary
  • Introduction:

    A key question in Reinforcement Learning is about learning good policies from off policy batch data in large or infinite state spaces.
  • This problem is relevant to the batch setting; many online RL algorithms use a growing batch of data such as a replay buffer [24, 28].
  • One particular issue is that the max in the Bellman operator may pick actions in (s, a) pairs with limited but rewarding samples, which can lead to overly optimistic value function estimates and under-performing policies [26]
  • Objectives:

    The authors' aim is to create algorithms that are guaranteed to find an approximately optimal policy over all policies that only visit states and actions with sufficient visitation under μ.
  • The authors' goal is to relax such assumptions to make the resulting algorithms more practically useful
  • Results:

    The key innovation in the algorithm uses an estimated ζ to filter backups from unsound state-action pairs.
  • The authors can run post-hoc diagnostics on the choice of b by computing the average of ζ(s, π(s)) for the resulting policy π over the batch dataset.
  • If this quantity is too small, the authors can conclude that ζ filters out too many Bellman backups and rerun the procedure with a lower b
  • Conclusion:

    The authors study a key assumption for analysis in batch value-based RL, concentrability, and provide policy iteration and Q iteration algorithms with minor modifications that can be agnostic to this assumption.
  • This work can provide some intuition for designing practical deep RL algorithms by leveraging pessimism based on the marginalized state-action distribution.
  • Towards design more practical algorithms, estimating state action visitation distributions is an active research area (e.g.
  • Towards design more practical algorithms, estimating state action visitation distributions is an active research area (e.g. [39]) and the algorithmic framework is composable with better estimators of μ(s, a) or other uncertainty meansurements
Tables
  • Table1: The final policy after 500K training steps in 3 D4RL tasks. The values are normalized with respect to the random policy (0) and expert policy (100). The results of our algorithm is averaged over 5 random seeds and the results of other algorithm are from D4RL evaluations
Download tables as Excel
Related work
  • Research in batch RL focuses on deriving the best possible policy from the available data [20]. For practical settings that necessitate using function approximators, fitted value iteration [6, 33] and fitted policy iteration [19] provide an empirical foundation that has spawned many successful modern deep RL algorithms. Many prior works provide error bounds as a function of the violation of realizability and completeness assumptions such as [40]. In the online RL setting, concentrability can be side-stepped [41] but can still pose a significant challenge (e.g., the hardness of exploration in [5]). A commonly-used equivalent form of the concentrability assumption is on the discounted summation of ratios between the product of probabilities over state actions under any policy, to the data generating distribution [3, 23]. Our goal is to relax such assumptions to make the resulting algorithms more practically useful.
Funding
  • This work was supported in part by an NSF CAREER award and an ONR Young Investigator Award
Reference
  • Alekh Agarwal, Sham M Kakade, Jason D Lee, and Gaurav Mahajan. Optimality and approximation with policy gradient methods in markov decision processes. arXiv preprint arXiv:1908.00261, 2019.
    Findings
  • Rishabh Agarwal, Dale Schuurmans, and Mohammad Norouzi. An optimistic perspective on offline reinforcement learning. In ICML, 2020.
    Google ScholarLocate open access versionFindings
  • András Antos, Csaba Szepesvári, and Rémi Munos. Learning near-optimal policies with bellman-residual minimization based fitted policy iteration and a single sample path. Machine Learning, 71:89–129, 2008.
    Google ScholarLocate open access versionFindings
  • Dimitri P Bertsekas. Approximate policy iteration: A survey and some new methods. Journal of Control Theory and Applications, 9:310–335, 2011.
    Google ScholarLocate open access versionFindings
  • Jinglin Chen and Nan Jiang. Information-theoretic considerations in batch reinforcement learning. In ICML, pages 1042–1051, 2019.
    Google ScholarLocate open access versionFindings
  • Damien Ernst, Pierre Geurts, and Louis Wehenkel. Tree-based batch mode reinforcement learning. Journal of Machine Learning Research, 6:503–556, 2005.
    Google ScholarLocate open access versionFindings
  • Amir-massoud Farahmand. Regularization in reinforcement learning. PhD thesis, University of Alberta, 2011.
    Google ScholarFindings
  • Amir-massoud Farahmand, Mohammad Ghavamzadeh, Csaba Szepesvári, and Shie Mannor. Regularized policy iteration with nonparametric function spaces. The Journal of Machine Learning Research, 17:4809–4874, 2016.
    Google ScholarLocate open access versionFindings
  • Amir-massoud Farahmand, Doina Precup, André MS Barreto, and Mohammad Ghavamzadeh. Classification-based approximate policy iteration. IEEE Transactions on Automatic Control, 60:2989–2993, 2015.
    Google ScholarLocate open access versionFindings
  • Amir-massoud Farahmand, Csaba Szepesvári, and Rémi Munos. Error propagation for approximate policy and value iteration. In NeurIPS, pages 568–576, 2010.
    Google ScholarLocate open access versionFindings
  • Justin Fu, Aviral Kumar, Ofir Nachum, George Tucker, and Sergey Levine. D4rl: Datasets for deep data-driven reinforcement learning, 2020.
    Google ScholarFindings
  • Scott Fujimoto, David Meger, and Doina Precup. Off-policy deep reinforcement learning without exploration. In ICML, pages 2052–2062, 2019.
    Google ScholarLocate open access versionFindings
  • Matthieu Geist, Bruno Scherrer, and Olivier Pietquin. A theory of regularized markov decision processes. arXiv preprint arXiv:1901.11275, 2019.
    Findings
  • Roland Hafner and Martin Riedmiller. Reinforcement learning in feedback control. Machine learning, 84:137–169, 2011.
    Google ScholarLocate open access versionFindings
  • Sham Kakade and John Langford. Approximately optimal approximate reinforcement learning. In ICML, pages 267–274, 2002.
    Google ScholarLocate open access versionFindings
  • Sham M Kakade. A natural policy gradient. In NeurIPS, pages 1531–1538, 2002.
    Google ScholarLocate open access versionFindings
  • Aviral Kumar, Justin Fu, Matthew Soh, George Tucker, and Sergey Levine. Stabilizing off-policy q-learning via bootstrapping error reduction. In NeurIPS, pages 11761–11771, 2019.
    Google ScholarLocate open access versionFindings
  • Aviral Kumar, Aurick Zhou, George Tucker, and Sergey Levine. Conservative q-learning for offline reinforcement learning, 2020.
    Google ScholarFindings
  • Michail G Lagoudakis and Ronald Parr. Least-squares policy iteration. Journal of machine learning research, 4:1107–1149, 2003.
    Google ScholarLocate open access versionFindings
  • Sascha Lange, Thomas Gabel, and Martin Riedmiller. Batch reinforcement learning. In Reinforcement learning, pages 45–73.
    Google ScholarLocate open access versionFindings
  • Romain Laroche, Paul Trichelair, and Remi Tachet Des Combes. Safe policy improvement with baseline bootstrapping. In ICML, pages 3652–3661, 2019.
    Google ScholarLocate open access versionFindings
  • Alessandro Lazaric, Mohammad Ghavamzadeh, and Rémi Munos. Finite-sample analysis of least-squares policy iteration. Journal of Machine Learning Research, 13:3041–3074, 2012.
    Google ScholarLocate open access versionFindings
  • Hoang Le, Cameron Voloshin, and Yisong Yue. Batch policy learning under constraints. In ICML, pages 3703–3712, 2019.
    Google ScholarLocate open access versionFindings
  • Long-Ji Lin. Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning, 8:293–321, 1992.
    Google ScholarLocate open access versionFindings
  • Clive R Loader et al. Local likelihood density estimation. The Annals of Statistics, 24:1602– 1618, 1996.
    Google ScholarLocate open access versionFindings
  • Tyler Lu, Dale Schuurmans, and Craig Boutilier. Non-delusional q-learning and value-iteration. In Advances in neural information processing systems, pages 9949–9959, 2018.
    Google ScholarLocate open access versionFindings
  • Peter Malec and Melanie Schienle. Nonparametric kernel density estimation near the boundary. Computational Statistics & Data Analysis, 72:57–76, 2014.
    Google ScholarLocate open access versionFindings
  • Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529–533, 2015.
    Google ScholarLocate open access versionFindings
  • Rémi Munos. Error bounds for approximate policy iteration. In ICML, pages 560–567, 2003.
    Google ScholarLocate open access versionFindings
  • Rémi Munos. Error bounds for approximate value iteration. In Proceedings of the National Conference on Artificial Intelligence, page 1006, 2005.
    Google ScholarLocate open access versionFindings
  • Rémi Munos and Csaba Szepesvári. Finite-time bounds for fitted value iteration. Journal of Machine Learning Research, 9:815–857, 2008.
    Google ScholarLocate open access versionFindings
  • Bernardo Avila Pires and Csaba Szepesvári. Statistical linear estimation with penalized estimators: an application to reinforcement learning. arXiv preprint arXiv:1206.6444, 2012.
    Findings
  • Martin Riedmiller. Neural fitted q iteration–first experiences with a data efficient neural reinforcement learning method. In European Conference on Machine Learning, pages 317–328, 2005.
    Google ScholarLocate open access versionFindings
  • Bruno Scherrer. Approximate policy iteration schemes: a comparison. In ICML, pages 1314– 1322, 2014.
    Google ScholarLocate open access versionFindings
  • Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press, 2018.
    Google ScholarFindings
  • Csaba Szepesvári and Rémi Munos. Finite time bounds for sampling based fitted value iteration. In ICML, pages 880–887, 2005.
    Google ScholarLocate open access versionFindings
  • Philip Thomas, Georgios Theocharous, and Mohammad Ghavamzadeh. High Confidence Policy Improvement. In ICML, pages 2380–2388, 2015.
    Google ScholarLocate open access versionFindings
  • Philip S Thomas, Bruno Castro da Silva, Andrew G Barto, Stephen Giguere, Yuriy Brun, and Emma Brunskill. Preventing undesirable behavior of intelligent machines. Science, 366:999– 1004, 2019.
    Google ScholarLocate open access versionFindings
  • Junfeng Wen, Bo Dai, Lihong Li, and Dale Schuurmans. Batch stationary distribution estimation. arXiv preprint arXiv:2003.00722, 2020.
    Findings
  • Tengyang Xie and Nan Jiang. q approximation schemes for batch reinforcement learning: A theoretical comparison. UAI, 2020.
    Google ScholarFindings
  • Ming Yu, Zhuoran Yang, Mengdi Wang, and Zhaoran Wang. Provable q-iteration with l infinity guarantees and function approximation. Workshop on Optimization and RL, NeurIPS, 2019.
    Google ScholarLocate open access versionFindings
  • Tong Zhang et al. From epsilon-entropy to kl-entropy: Analysis of minimum information complexity density estimation. The Annals of Statistics, 34:2180–2210, 2006.
    Google ScholarLocate open access versionFindings
Your rating :
0

 

Tags
Comments
小科