Deployment-Efficient Reinforcement Learning via Model-Based Offline Optimization

Matsushima Tatsuya
Matsushima Tatsuya
Furuta Hiroki
Furuta Hiroki

ICLR 2021, 2021.

Cited by: 0|Bibtex|Views13
Other Links: arxiv.org
Weibo:
We propose a novel method that achieves both high sample-efficiency in offline RL and "deployment-efficiency" in online RL.

Abstract:

Most reinforcement learning (RL) algorithms assume online access to the environment, in which one may readily interleave updates to the policy with experience collection using that policy. However, in many real-world applications such as health, education, dialogue agents, and robotics, the cost or potential risk of deploying a new data...More
0
Introduction
Highlights
  • Reinforcement learning (RL) algorithms have recently demonstrated impressive success in learning behaviors for a variety of sequential decision-making tasks (Barth-Maron et al, 2018; Hessel et al, 2018; Nachum et al, 2019). All of these demonstrations have relied on highly-frequent online access to the environment, with the RL algorithms often interleaving each update to the policy with additional experience collection of that policy acting in the environment
  • We introduced deployment efficiency, a novel measure for RL performance that counts the number of changes in the data-collection policy during learning
  • We proposed a novel model-based offline algorithm, Behavior-Regularized Model-ENsemble (BREMEN), combining model-ensembles with trust region updates from model-based RL literature (Kurutach et al, 2018), and policy initialization with behavior cloning from offline RL literature (Fujimoto et al, 2019; Wu et al, 2019)
  • BREMEN can improve policies offline sample-efficiently even when the batch size is 10-20 times smaller than prior works, allowing BREMEN to achieve impressive results in limited deployment settings, obtaining successful policies from scratch in only 5-10 deployments. Can this help alleviate costs and risks in real-world applications, but it can reduce the amount of communication required during distributed learning and could form the basis for communication-efficient large-scale RL in contrast to prior works (Nair et al, 2015; Espeholt et al, 2018; 2019)
  • We hope our work can gear the research community to value deployment efficiency as an important criterion for RL algorithms, and to eventually achieve similar sample efficiency and asymptotic performance as the state-of-the-art algorithms like SAC (Haarnoja et al, 2018) while having the deployment efficiency well-suited for safe and practical real-world reinforcement learning
Methods
  • Method Dataset BC BCQ BRAC.
  • BRAC BREMEN (Ours) ME-TRPO.
  • 50,000 (50K) transitions 938±32 -73±95.
  • Dataset BC BRAC.
  • Noise: eps 000000 (1M) transitions.
  • Noise: gaussian 000000 (1M) transitions.
Results
  • The authors evaluate BREMEN on standard offline RL benchmarks of high-dimensional continuous control tasks, where only a single static dataset is used.
  • In this fixed-batch setting, the experiments show that BREMEN can achieve performance competitive with state-of-the-art when using standard dataset sizes and learn with 10-20 times smaller datasets, which previous methods are unable to attain
Conclusion
  • IMPORTANCE OF DEPLOYMENT EFFICIENCY IN REAL-WORLD APPLICATIONS

    The authors' notion of deployment-efficiency is necessitated by cost and safety constraints typical in many real world scenarios.
  • BREMEN can improve policies offline sample-efficiently even when the batch size is 10-20 times smaller than prior works, allowing BREMEN to achieve impressive results in limited deployment settings, obtaining successful policies from scratch in only 5-10 deployments
  • Can this help alleviate costs and risks in real-world applications, but it can reduce the amount of communication required during distributed learning and could form the basis for communication-efficient large-scale RL in contrast to prior works (Nair et al, 2015; Espeholt et al, 2018; 2019).
  • The authors hope the work can gear the research community to value deployment efficiency as an important criterion for RL algorithms, and to eventually achieve similar sample efficiency and asymptotic performance as the state-of-the-art algorithms like SAC (Haarnoja et al, 2018) while having the deployment efficiency well-suited for safe and practical real-world reinforcement learning
Summary
  • Introduction:

    Reinforcement learning (RL) algorithms have recently demonstrated impressive success in learning behaviors for a variety of sequential decision-making tasks (Barth-Maron et al, 2018; Hessel et al, 2018; Nachum et al, 2019).
  • Even when the data efficiency is high, the deployment efficiency could be low, since many on-policy and off-policy algorithms alternate data collection with each policy update (Schulman et al, 2015; Lillicrap et al, 2016; Gu et al, 2016; Haarnoja et al, 2018)
  • Such dependence on high-frequency policy deployments is best illustrated in the recent works in offline RL (Fujimoto et al, 2019; Jaques et al, 2019; Kumar et al, 2019; Levine et al, 2020; Wu et al, 2019), where baseline off-policy algorithms exhibited poor performance when trained on a static dataset.
  • In contrast to those prior works, the authors aim to learn successful policies from scratch in a manner that is both sample and deployment-efficient
  • Objectives:

    In contrast to those prior works, the authors aim to learn successful policies from scratch in a manner that is both sample and deployment-efficient.
  • The authors aim to have both high deployment efficiency and sample efficiency by developing an algorithm that can solve the tasks with minimal policy deployments as well as transition samples
  • Methods:

    Method Dataset BC BCQ BRAC.
  • BRAC BREMEN (Ours) ME-TRPO.
  • 50,000 (50K) transitions 938±32 -73±95.
  • Dataset BC BRAC.
  • Noise: eps 000000 (1M) transitions.
  • Noise: gaussian 000000 (1M) transitions.
  • Results:

    The authors evaluate BREMEN on standard offline RL benchmarks of high-dimensional continuous control tasks, where only a single static dataset is used.
  • In this fixed-batch setting, the experiments show that BREMEN can achieve performance competitive with state-of-the-art when using standard dataset sizes and learn with 10-20 times smaller datasets, which previous methods are unable to attain
  • Conclusion:

    IMPORTANCE OF DEPLOYMENT EFFICIENCY IN REAL-WORLD APPLICATIONS

    The authors' notion of deployment-efficiency is necessitated by cost and safety constraints typical in many real world scenarios.
  • BREMEN can improve policies offline sample-efficiently even when the batch size is 10-20 times smaller than prior works, allowing BREMEN to achieve impressive results in limited deployment settings, obtaining successful policies from scratch in only 5-10 deployments
  • Can this help alleviate costs and risks in real-world applications, but it can reduce the amount of communication required during distributed learning and could form the basis for communication-efficient large-scale RL in contrast to prior works (Nair et al, 2015; Espeholt et al, 2018; 2019).
  • The authors hope the work can gear the research community to value deployment efficiency as an important criterion for RL algorithms, and to eventually achieve similar sample efficiency and asymptotic performance as the state-of-the-art algorithms like SAC (Haarnoja et al, 2018) while having the deployment efficiency well-suited for safe and practical real-world reinforcement learning
Tables
  • Table1: top) shows that BREMEN can achieve performance competitive with state-of-the-art model-free offline RL algorithms when using the standard dataset size of 1M. We also test BREMEN with more recent benchmarks of D4RL (Fu et al, 2020) and compared the performance with the existing model-free and model-based methods. See Appendix D for the results. Comparison of BREMEN to the existing offline methods on static datasets. Each cell shows the average cumulative reward and their standard deviation, where the number of samples is 1M, 100K, and 50K, respectively. The maximum steps per episode is 1,000. BRAC applies a primal form of KL value penalty, and BRAC (max Q) means its variant of sampling multiple actions and taking the maximum according to the learned Q function
  • Table2: Evaluation on D4RL MuJoCo locomotion datasets. The normalized score of BREMEN are averaged over 4 random seeds. We refer the score of MOPO (Yu et al, 2020) and CQL (Kumar et al, 2020) from their original papers. Other results are cited from Fu et al (2020). BREMEN achieves the best and competitive score in several domains, while none of the algorithms beats all other methods
  • Table3: Reward function and termination in rollouts in the experiments. We remove all contact information from observation of Ant, basically following <a class="ref-link" id="cWang_et+al_2019_a" href="#rWang_et+al_2019_a">Wang et al (2019</a>)
  • Table4: Hyper-parameters of BREMEN in deployment-efficient settings
  • Table5: Hyper-parameters of BCQ
  • Table6: Hyper-parameters of BRAC
  • Table7: Comparison of BREMEN to the existing offline methods in offline settings, namely, BC, BCQ (<a class="ref-link" id="cFujimoto_et+al_2019_a" href="#rFujimoto_et+al_2019_a">Fujimoto et al, 2019</a>), and BRAC (<a class="ref-link" id="cWu_et+al_2019_a" href="#rWu_et+al_2019_a">Wu et al, 2019</a>). Each cell shows the average cumulative reward and their standard deviation with 5 seeds. The maximum steps per episode is 1,000. Five different types of exploration noise are introduced during the data collection, eps1, eps3, gaussian1, gaussian3, and random. BRAC applies a primal form of KL value penalty, and BRAC (max Q) means sampling multiple actions and taking the maximum according to the learned Q function
Download tables as Excel
Related work
  • Deployment Efficiency and Offline RL Although we are not aware of any previous works which explicitly proposed the concept of deployment efficiency, its necessity in many real-world applications has been generally known. One may consider previously proposed semi-batch RL algorithms (Ernst et al, 2005; Lange et al, 2012; Singh et al, 1994; Roux, 2016) or theoretical analysis of switching

    Cumulative reward

    BREMEN(w/o BC re-initialization) ME-TRPO( =0)

    De3plAonymt 4ent 5

    HDael3pfClohyeme4etnath 5

    1 2 Dep3loyme4nt 5 6 Dep3loyme4nt 5

    Explicit KL( =1e-2) Hopper =1e-4 =1e-6
Funding
  • We evaluate BREMEN on standard offline RL benchmarks of high-dimensional continuous control tasks, where only a single static dataset is used. In this fixed-batch setting, our experiments show that BREMEN can not only achieve performance competitive with state-of-the-art when using standard dataset sizes but also learn with 10-20 times smaller datasets, which previous methods are unable to attain
Reference
  • Fabian Abel, Yashar Deldjoo, Mehdi Elahi, and Daniel Kohlsdorf. Recsys challenge 2017: Offline and online evaluation. In ACM Conference on Recommender Systems, 2017.
    Google ScholarLocate open access versionFindings
  • Rishabh Agarwal, Dale Schuurmans, and Mohammad Norouzi. An optimistic perspective on offline reinforcement learning. arXiv preprint arXiv:1907.04543, 2019.
    Findings
  • Christopher G. Atkeson, Benzun P. Wisely Babu, Nandan Banerjee, Dmitry Berenson, Christoper P. Bove, Xiongyi Cui, Mathew DeDonato, Ruixiang Du, Siyuan Feng, Perry Franklin, Michael Gennert, Joshua P. Graff, Peng He, Aaron Jaeger, Joohyung Kim, Kevin Knoedler, Lening Li, Chenggang Liu, Xianchao Long, Taskin Padir, Felipe Polido, G. G. Tighe, and X Xinjilefu. No falls, no resets: Reliable humanoid behavior in the darpa robotics challenge. In International Conference on Humanoid Robots, 2015.
    Google ScholarLocate open access versionFindings
  • Yu Bai, Tengyang Xie, Nan Jiang, and Yu-Xiang Wang. Provably efficient q-learning with low switching cost. In Advances in Neural Information Processing Systems, 2019.
    Google ScholarLocate open access versionFindings
  • Gabriel Barth-Maron, Matthew Hoffman, David Budden, Will Dabney, Dan Horgan, Dhruva TB, Alistair Muldal, Nicolas Heess, and Timothy Lillicrap. Distributed distributional deterministic policy gradients. In International Conference on Learning Representations, 2018.
    Google ScholarLocate open access versionFindings
  • Rinu Boney, Juho Kannala, and Alexander Ilin. Regularizing model-based planning with energy-based models. In Conference on Robot Learning, 2019.
    Google ScholarLocate open access versionFindings
  • Serkan Cabi, Sergio Gómez Colmenarejo, Alexander Novikov, Ksenia Konyushova, Scott Reed, Rae Jeong, Konrad Zolna, Yusuf Aytar, David Budden, Mel Vecerik, Oleg Sushkov, David Barker, Jonathan Scholz, Misha Denil, Nando de Freitas, and Ziyu Wang. Scaling data-driven robotics with reward sketching and batch reinforcement learning. In Robotics: Science and Systems, 2020.
    Google ScholarLocate open access versionFindings
  • Yinlam Chow, Aviv Tamar, Shie Mannor, and Marco Pavone. Risk-sensitive and robust decisionmaking: a cvar optimization approach. In Advances in Neural Information Processing Systems, 2015.
    Google ScholarLocate open access versionFindings
  • Yinlam Chow, Ofir Nachum, Edgar Duenez-Guzman, and Mohammad Ghavamzadeh. A lyapunovbased approach to safe reinforcement learning. In Advances in neural information processing systems, 2018.
    Google ScholarLocate open access versionFindings
  • Yinlam Chow, Ofir Nachum, Aleksandra Faust, Edgar Duenez-Guzman, and Mohammad Ghavamzadeh. Lyapunov-based safe policy optimization for continuous control. arXiv preprint arXiv:1901.10031, 2019.
    Findings
  • Kurtland Chua, Roberto Calandra, Rowan McAllister, and Sergey Levine. Deep reinforcement learning in a handful of trials using probabilistic dynamics models. In Advances in Neural Information Processing Systems, 2018.
    Google ScholarLocate open access versionFindings
  • Ignasi Clavera, Jonas Rothfuss, John Schulman, Yasuhiro Fujita, Tamim Asfour, and Pieter Abbeel. Model-based reinforcement learning via meta-policy optimization. In Conference on Robot Learning, 2018.
    Google ScholarLocate open access versionFindings
  • Thomas Degris, Martha White, and Richard S Sutton. Off-policy actor-critic. arXiv preprint arXiv:1205.4839, 2012.
    Findings
  • Marc Deisenroth and Carl E Rasmussen. PILCO: A model-based and data-efficient approach to policy search. In International Conference on Machine Learning, 2011.
    Google ScholarLocate open access versionFindings
  • Gabriel Dulac-Arnold, Daniel Mankowitz, and Todd Hester. Challenges of real-world reinforcement learning. arXiv preprint arXiv:1904.12901, 2019.
    Findings
  • Damien Ernst, Pierre Geurts, and Louis Wehenkel. Tree-based batch mode reinforcement learning. Journal of Machine Learning Research, 2005.
    Google ScholarLocate open access versionFindings
  • Lasse Espeholt, Hubert Soyer, Remi Munos, Karen Simonyan, Vlad Mnih, Tom Ward, Yotam Doron, Vlad Firoiu, Tim Harley, Iain Dunning, et al. IMPALA: Scalable distributed deep-rl with importance weighted actor-learner architectures. In International Conference on Machine Learning, 2018.
    Google ScholarLocate open access versionFindings
  • Lasse Espeholt, Raphaël Marinier, Piotr Stanczyk, Ke Wang, and Marcin Michalski. SEED RL: Scalable and efficient deep-rl with accelerated central inference. arXiv preprint arXiv:1910.06591, 2019.
    Findings
  • Benjamin Eysenbach, Shixiang Gu, Julian Ibarz, and Sergey Levine. Leave no trace: Learning to reset for safe and autonomous reinforcement learning. International Conference on Learning Representations, 2018.
    Google ScholarLocate open access versionFindings
  • Rasool Fakoor, Pratik Chaudhari, and Alexander J. Smola. P3o: Policy-on policy-off policy optimization. arXiv preprint arXiv:1905.01756, 2019.
    Findings
  • Justin Fu, Aviral Kumar, Ofir Nachum, George Tucker, and Sergey Levine. D4RL: Datasets for deep data-driven reinforcement learning. arXiv preprint arXiv:2004.07219, 2020.
    Findings
  • Scott Fujimoto, David Meger, and Doina Precup. Off-policy deep reinforcement learning without exploration. In International Conference on Machine Learning, 2019.
    Google ScholarLocate open access versionFindings
  • Omer Gottesman, Fredrik Johansson, Joshua Meier, Jack Dent, Donghun Lee, Srivatsan Srinivasan, Linying Zhang, Yi Ding, David Wihl, Xuefeng Peng, Jiayu Yao, Isaac Lage, Christopher Mosch, Li wei H. Lehman, Matthieu Komorowski, Matthieu Komorowski, Aldo Faisal, Leo Anthony Celi, David Sontag, and Finale Doshi-Velez. Evaluating reinforcement learning algorithms in observational health settings. arXiv preprint arXiv:1805.12298, 2018.
    Findings
  • Shixiang Gu, Timothy Lillicrap, Ilya Sutskever, and Sergey Levine. Continuous deep q-learning with model-based acceleration. In International Conference on Machine Learning, 2016.
    Google ScholarLocate open access versionFindings
  • Shixiang Gu, Ethan Holly, Timothy Lillicrap, and Sergey Levine. Deep reinforcement learning for robotic manipulation with asynchronous off-policy updates. In International Conference on Robotics and Automation, 2017a.
    Google ScholarLocate open access versionFindings
  • Shixiang Gu, Timothy Lillicrap, Zoubin Ghahramani, Richard E. Turner, and Sergey Levine. Q-Prop: Sample-efficient policy gradient with an off-policy critic. In International Conference on Learning Representations, 2017b.
    Google ScholarLocate open access versionFindings
  • Zhaohan Guo and Emma Brunskill. Concurrent pac rl. In AAAI Conference on Artificial Intelligence, 2015.
    Google ScholarLocate open access versionFindings
  • Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International Conference on Machine Learning, 2018.
    Google ScholarLocate open access versionFindings
  • Nicolas Heess, Gregory Wayne, David Silver, Timothy Lillicrap, Tom Erez, and Yuval Tassa. Learning continuous control policies by stochastic value gradients. In Advances in Neural Information Processing Systems, 2015.
    Google ScholarLocate open access versionFindings
  • Matteo Hessel, Joseph Modayil, Hado Van Hasselt, Tom Schaul, Georg Ostrovski, Will Dabney, Dan Horgan, Bilal Piot, Mohammad Azar, and David Silver. Rainbow: Combining improvements in deep reinforcement learning. In AAAI Conference on Artificial Intelligence, 2018.
    Google ScholarLocate open access versionFindings
  • Michael Janner, Justin Fu, Marvin Zhang, and Sergey Levine. When to trust your model: Model-based policy optimization. In Advances in Neural Information Processing Systems, 2019.
    Google ScholarLocate open access versionFindings
  • Natasha Jaques, Asma Ghandeharioun, Judy Hanwen Shen, Craig Ferguson, Agata Lapedriza, Noah Jones, Shixiang Gu, and Rosalind Picard. Way off-policy batch deep reinforcement learning of implicit human preferences in dialog. arXiv preprint arXiv:1907.00456, 2019.
    Findings
  • Dmitry Kalashnikov, Alex Irpan, Peter Pastor, Julian Ibarz, Alexander Herzog, Eric Jang, Deirdre Quillen, Ethan Holly, Mrinal Kalakrishnan, Vincent Vanhoucke, and Sergey Levine. QT-Opt: Scalable deep reinforcement learning for vision-based robotic manipulation. In Conference on Robot Learning, 2018.
    Google ScholarLocate open access versionFindings
  • Rahul Kidambi, Aravind Rajeswaran, Praneeth Netrapalli, and Thorsten Joachims. MOReL: Modelbased offline reinforcement learning. arXiv preprint arXiv:2005.05951, 2020.
    Findings
  • D. Kingma and J. Ba. Adam: A method for stochastic optimization. In International Conference on Learning Representations, 2014.
    Google ScholarLocate open access versionFindings
  • Aviral Kumar, Justin Fu, Matthew Soh, George Tucker, and Sergey Levine. Stabilizing off-policy q-learning via bootstrapping error reduction. In Advances in Neural Information Processing Systems, 2019.
    Google ScholarLocate open access versionFindings
  • Aviral Kumar, Aurick Zhou, George Tucker, and Sergey Levine. Conservative q-learning for offline reinforcement learning. arXiv preprint arXiv:2006.04779, 2020.
    Findings
  • Thanard Kurutach, Ignasi Clavera, Yan Duan, Aviv Tamar, and Pieter Abbeel. Model-Ensemble Trust-Region Policy Optimization. In International Conference on Learning Representations, 2018.
    Google ScholarLocate open access versionFindings
  • Sascha Lange, Thomas Gabel, and Martin Riedmiller. Batch reinforcement learning. In Reinforcement learning. Springer, 2012.
    Google ScholarFindings
  • Sergey Levine, Aviral Kumar, George Tucker, and Justin Fu. Offline reinforcement learning: Tutorial, review, and perspectives on open problems. arXiv preprint arXiv:2005.01643, 2020.
    Findings
  • Timothy P. Lillicrap, Jonathan J. Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. In International Conference on Learning Representations, 2016.
    Google ScholarLocate open access versionFindings
  • Long-Ji Lin. Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning, 1992.
    Google ScholarFindings
  • Yao Liu, Adith Swaminathan, Alekh Agarwal, and Emma Brunskill. Off-policy policy gradient with state distribution correction. arXiv preprint arXiv:1904.08473, 2019.
    Findings
  • Travis Mandel, Yun-En Liu, Sergey Levine, Emma Brunskill, and Zoran Popovic. Offline policy evaluation across representations with applications to educational games. In International Conference on Autonomous Agents and Multiagent Systems, 2014.
    Google ScholarLocate open access versionFindings
  • Ajay Mandlekar, Fabio Ramos, Byron Boots, Li Fei-Fei, Animesh Garg, and Dieter Fox. IRIS: Implicit reinforcement without interaction at scale for learning control from offline robot manipulation data. arXiv preprint arXiv:1911.05321, 2019.
    Findings
  • Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 2015.
    Google ScholarLocate open access versionFindings
  • Susan A Murphy, Mark J van der Laan, James M Robins, and Conduct Problems Prevention Research Group. Marginal mean models for dynamic regimes. Journal of the American Statistical Association, 2001.
    Google ScholarLocate open access versionFindings
  • Ofir Nachum, Shixiang Shane Gu, Honglak Lee, and Sergey Levine. Data-efficient hierarchical reinforcement learning. In Advances in Neural Information Processing Systems, 2018.
    Google ScholarLocate open access versionFindings
  • Ofir Nachum, Michael Ahn, Hugo Ponte, Shixiang Gu, and Vikash Kumar. Multi-agent manipulation via locomotion using hierarchical sim2real. In Conference on Robot Learning, 2019.
    Google ScholarLocate open access versionFindings
  • Anusha Nagabandi, Gregory Kahn, Ronald S. Fearing, and Sergey Levine. Neural Network Dynamics for Model-Based Deep Reinforcement Learning with Model-Free Fine-Tuning. International Conference on Robotics and Automation, 2018.
    Google ScholarLocate open access versionFindings
  • Vaishnavh Nagarajan and J Zico Kolter. Generalization in deep networks: The role of distance from initialization. arXiv preprint arXiv:1901.01672, 2019.
    Findings
  • Arun Nair, Praveen Srinivasan, Sam Blackwell, Cagdas Alcicek, Rory Fearon, Alessandro De Maria, Vedavyas Panneershelvam, Mustafa Suleyman, Charles Beattie, Stig Petersen, et al. Massively parallel methods for deep reinforcement learning. arXiv preprint arXiv:1507.04296, 2015.
    Findings
  • Ashvin Nair, Murtaza Dalal, Abhishek Gupta, and Sergey Levine. Accelerating online reinforcement learning with offline datasets. arXiv preprint arXiv:2006.09359, 2020.
    Findings
  • Xue Bin Peng, Aviral Kumar, Grace Zhang, and Sergey Levine. Advantage-weighted regression: Simple and scalable off-policy reinforcement learning. arXiv preprint arXiv:1910.00177, 2019.
    Findings
  • Matthias Plappert, Marcin Andrychowicz, Alex Ray, Bob McGrew, Bowen Baker, Glenn Powell, Jonas Schneider, Josh Tobin, Maciek Chociej, Peter Welinder, et al. Multi-goal reinforcement learning: Challenging robotics environments and request for research. arXiv preprint arXiv:1802.09464, 2018.
    Findings
  • Doina Precup, Richard S Sutton, and Sanjoy Dasgupta. Off-policy temporal-difference learning with function approximation. In International Conference on Machine Learning, 2001.
    Google ScholarLocate open access versionFindings
  • Aravind Rajeswaran, Vikash Kumar, Abhishek Gupta, Giulia Vezzani, John Schulman, Emanuel Todorov, and Sergey Levine. Learning complex dexterous manipulation with deep reinforcement learning and demonstrations. arXiv preprint arXiv:1709.10087, 2017.
    Findings
  • Aravind Rajeswaran, Igor Mordatch, and Vikash Kumar. A game theoretic framework for model based reinforcement learning. arXiv preprint arXiv:2004.07804, 2020.
    Findings
  • Alex Ray, Joshua Achiam, and Dario Amodei. Benchmarking safe exploration in deep reinforcement learning. arXiv preprint arXiv:1910.01708, 2019.
    Findings
  • Stéphane Ross, Geoffrey Gordon, and Drew Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. In International conference on artificial intelligence and statistics, 2011.
    Google ScholarLocate open access versionFindings
  • Nicolas Le Roux. Efficient iterative policy optimization. arXiv preprint arXiv:1612.08967, 2016.
    Findings
  • John Schulman, Sergey Levine, Philipp Moritz, Michael Jordan, and Pieter Abbeel. Trust region policy optimization. In International Conference on Machine Learning, 2015.
    Google ScholarLocate open access versionFindings
  • John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
    Findings
  • Noah Y. Siegel, Jost Tobias Springenberg, Felix Berkenkamp, Abbas Abdolmaleki, Michael Neunert, Thomas Lampe, Roland Hafner, and Martin A. Riedmiller. Keep doing what worked: Behavioral modelling priors for offline reinforcement learning. In International Conference on Learning Representations, 2020.
    Google ScholarLocate open access versionFindings
  • Satinder P Singh, Tommi Jaakkola, and Michael I Jordan. Learning without state-estimation in partially observable markovian decision processes. In Machine Learning Proceedings. Elsevier, 1994.
    Google ScholarLocate open access versionFindings
  • Satinder P. Singh, Tommi Jaakkola, and Michael I. Jordan. Reinforcement learning with soft state aggregation. In Advances in Neural Information Processing Systems, 1995.
    Google ScholarLocate open access versionFindings
  • Sungryull Sohn, Yinlam Chow, Jayden Ooi, Ofir Nachum, Honglak Lee, Ed Chi, and Craig Boutilier. BRPO: Batch residual policy optimization. arXiv preprint arXiv:2002.05522, 2020.
    Findings
  • Richard S Sutton. Dyna, an integrated architecture for learning, planning, and reacting. ACM Sigart Bulletin, 1991.
    Google ScholarLocate open access versionFindings
  • Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based control. In International Conference on Intelligent Robots and Systems, 2012.
    Google ScholarLocate open access versionFindings
  • Tingwu Wang, Xuchan Bao, Ignasi Clavera, Jerrick Hoang, Yeming Wen, Eric Langlois, Shunshi Zhang, Guodong Zhang, Pieter Abbeel, and Jimmy Ba. Benchmarking model-based reinforcement learning. arXiv preprint arXiv:1907.02057, 2019.
    Findings
  • Yifan Wu, George Tucker, and Ofir Nachum. Behavior Regularized Offline Reinforcement Learning. arXiv preprint arXiv:1911.11361, 2019.
    Findings
  • Tianhe Yu, Garrett Thomas, Lantao Yu, Stefano Ermon, James Zou, Sergey Levine, Chelsea Finn, and Tengyu Ma. MOPO: Model-based offline policy optimization. arXiv preprint arXiv:2005.13239, 2020.
    Findings
Full Text
Your rating :
0

 

Tags
Comments