OPAL: Offline Primitive Discovery for Accelerating Offline Reinforcement Learning

ICLR 2021, 2021.

Cited by: 0|Bibtex|Views79
Other Links: arxiv.org
Weibo:
An effective way to leverage multimodal offline behavioral data is to extract a continuous space of primitives, and use it for downstream task learning.

Abstract:

Reinforcement learning (RL) has achieved impressive performance in a variety of online settings in which an agent's ability to query the environment for transitions and rewards is effectively unlimited. However, in many practical applications, the situation is reversed: an agent may have access to large amounts of undirected offline exp...More

Code:

Data:

0
Introduction
  • Reinforcement Learning (RL) systems have achieved impressive performance in a variety of online settings such as games (Silver et al, 2016; Tesauro, 1995; Brown & Sandholm, 2019) and robotics (Levine et al, 2016; Dasari et al, 2019; Peters et al, 2010; Parmas et al, 2019; Pinto & Gupta, 2016), where the agent can act in the environment and sample as many transitions and rewards as needed.
  • A robot learning through trial and error in the real world requires costly human supervision, safety checks, and resets (Atkeson et al, 2015), rendering many standard online RL algorithms inapplicable (Matsushima et al, 2020)
  • In such settings the authors might instead have access to large amounts of previously logged data, which could be logged from a baseline hand-engineered policy or even from other related tasks.
  • While these offline datasets are often undirected and unlabelled, this data is still useful in that it can inform the algorithm about what is possible to do in the real world, without the need for active exploration
Highlights
  • Reinforcement Learning (RL) systems have achieved impressive performance in a variety of online settings such as games (Silver et al, 2016; Tesauro, 1995; Brown & Sandholm, 2019) and robotics (Levine et al, 2016; Dasari et al, 2019; Peters et al, 2010; Parmas et al, 2019; Pinto & Gupta, 2016), where the agent can act in the environment and sample as many transitions and rewards as needed
  • While OPAL is related to these works, we mainly focus on leveraging the learned primitives for asymptotically improving the performance of offline RL; i.e., both the primitive learning and the downstream task must be solved using a single static dataset
  • Baseline and Results: For online RL, we use HIRO (Nachum et al, 2018b), a state-of-the-art hierarchical RL method, SAC (Haarnoja et al, 2018) with Behavior cloning (BC) pre-training on D, and Discovery of Continuous Options (DDCO) (Krishnan et al, 2017) which uses D to learn a discrete set of primitives and learns a task policy in space of those primitives with online RL (Double DQN (DDQN) (Van Hasselt et al, 2015))
  • We proposed Offline Primitives for Accelerating Offline RL (OPAL) as a preproccesing step for extracting recurring primitive behaviors from undirected and unlabelled dataset of diverse behaviors
  • We derived theoretical statements which describe under what conditions OPAL can improve learning of downstream offline RL tasks and showed how these improvements manifest in practice, leading to significant improvements in complex manipulation tasks
  • We further showed empirical demonstrations of OPAL’s application to few-shot imitation learning, online RL, and online multi-task transfer learning
Results
  • As shown in Table 1, CQL+OPAL outperforms most the baselines on antmaze and kitchen tasks, with the exception of EMAQ having similar performance on kitchen mixed.
  • As shown in Table 3, SAC+OPAL outperforms all the baselines, showing (1) the importance of D in exploration, (2) the role of temporal abstraction, and (3) the good quality of learned primitives.
  • At+c−1}|zt) which decodes the entire action sub-trajectory from the latent vector and is represented by a GRU
  • With these state-agnostic primitives in hand, the authors learn a policy πψ(z|s, i) using any off-the-shelf online RL method.
  • Applying Chebyshev’s inequality to J (θ, φ, ω) − Eπ∼Π,τ∼π,z∼qφ(z|τ) − the authors get that, with high probability 1 − δ, c−1
Conclusion
  • The authors proposed Offline Primitives for Accelerating Offline RL (OPAL) as a preproccesing step for extracting recurring primitive behaviors from undirected and unlabelled dataset of diverse behaviors.
  • The authors derived theoretical statements which describe under what conditions OPAL can improve learning of downstream offline RL tasks and showed how these improvements manifest in practice, leading to significant improvements in complex manipulation tasks.
  • The authors focused on simple auto-encoding models for representing OPAL, and an interesting avenue for future work is scaling up this basic paradigm to image-based tasks
Summary
  • Introduction:

    Reinforcement Learning (RL) systems have achieved impressive performance in a variety of online settings such as games (Silver et al, 2016; Tesauro, 1995; Brown & Sandholm, 2019) and robotics (Levine et al, 2016; Dasari et al, 2019; Peters et al, 2010; Parmas et al, 2019; Pinto & Gupta, 2016), where the agent can act in the environment and sample as many transitions and rewards as needed.
  • A robot learning through trial and error in the real world requires costly human supervision, safety checks, and resets (Atkeson et al, 2015), rendering many standard online RL algorithms inapplicable (Matsushima et al, 2020)
  • In such settings the authors might instead have access to large amounts of previously logged data, which could be logged from a baseline hand-engineered policy or even from other related tasks.
  • While these offline datasets are often undirected and unlabelled, this data is still useful in that it can inform the algorithm about what is possible to do in the real world, without the need for active exploration
  • Objectives:

    The authors aim to use a large, unlabeled, and undirected experience dataset D := {τir :=ct=−01}Ni=1 associated with E to extract primitives and improve offline RL for downstream task learning.
  • The authors' goal is to use a dataset with reward labeled sub-trajectories Dr = {τi :=.
  • How to use it with OPAL?
  • In the multi-task setting, the authors aim to learn near-optimal behavior policies on M MDPs {Mi = (Si, A, Pi, ri, γ)}M i=1
  • Results:

    As shown in Table 1, CQL+OPAL outperforms most the baselines on antmaze and kitchen tasks, with the exception of EMAQ having similar performance on kitchen mixed.
  • As shown in Table 3, SAC+OPAL outperforms all the baselines, showing (1) the importance of D in exploration, (2) the role of temporal abstraction, and (3) the good quality of learned primitives.
  • At+c−1}|zt) which decodes the entire action sub-trajectory from the latent vector and is represented by a GRU
  • With these state-agnostic primitives in hand, the authors learn a policy πψ(z|s, i) using any off-the-shelf online RL method.
  • Applying Chebyshev’s inequality to J (θ, φ, ω) − Eπ∼Π,τ∼π,z∼qφ(z|τ) − the authors get that, with high probability 1 − δ, c−1
  • Conclusion:

    The authors proposed Offline Primitives for Accelerating Offline RL (OPAL) as a preproccesing step for extracting recurring primitive behaviors from undirected and unlabelled dataset of diverse behaviors.
  • The authors derived theoretical statements which describe under what conditions OPAL can improve learning of downstream offline RL tasks and showed how these improvements manifest in practice, leading to significant improvements in complex manipulation tasks.
  • The authors focused on simple auto-encoding models for representing OPAL, and an interesting avenue for future work is scaling up this basic paradigm to image-based tasks
Tables
  • Table1: Average success rate (%) (over 4 seeds) of offline RL methods: BC, BEAR (<a class="ref-link" id="cKumar_et+al_2019_a" href="#rKumar_et+al_2019_a">Kumar et al, 2019</a>), EMAQ (Ghasemipour et al, 2020), CQL (Kumar et al, 2020b) and CQL+OPAL (ours)
  • Table2: Average success rate (%) (over 4 seeds) of few-shot IL methods: BC, BC+OPAL, and BC+SVAE (<a class="ref-link" id="cWang_et+al_2017_a" href="#rWang_et+al_2017_a">Wang et al, 2017</a>)
  • Table3: Average success rate (%) (over 4 seeds) of online RL methods: HIRO (<a class="ref-link" id="cNachum_et+al_2018_a" href="#rNachum_et+al_2018_a">Nachum et al, 2018a</a>), SAC+BC, SAC+OPAL, and DDQN+DDCO (<a class="ref-link" id="cKrishnan_et+al_2017_a" href="#rKrishnan_et+al_2017_a">Krishnan et al, 2017</a>). These methods were ran for 2.5e6 steps for antmaze medium environments and 17.5e6 steps for antmaze large environments
  • Table4: Due to improved exploration, PPO+OPAL outperforms PPO and SAC on MT10 and MT50 in terms of average success rate (%) (over 4 seeds)
  • Table5: Average success rate (%) (over 4 seeds) of CQL+OPAL for different values of dim(Z). We fix c = 10
  • Table6: Average success rate on antmaze medium (diverse) (%) (over 4 seeds) of CQL combined with offline DADS and offline CARML for different values of k
  • Table7: Average success rate (%), cumulative dense reward, and cumulative dense reward (last 5 steps) (over 4 seeds) of CQL combined with different offline skill discovery methods on antmaze medium (diverse). For CQL + (Offline) DADS and CQL + (Offline) CARML, we use k = 10. Note that CQL+OPAL outperforms both other methods for unsupervised skill discovery on all of these different evaluation metrics
Related work
  • Offline RL. Offline RL presents the problem of learning a policy from a fixed prior dataset of transitions and rewards. Recent works in offline RL (Kumar et al, 2019; Levine et al, 2020; Wu et al, 2019; Ghasemipour et al, 2020; Jaques et al, 2019; Fujimoto et al, 2018) constrain the policy to be close to the data distribution to avoid the use of out-of-distribution actions (Kumar et al, 2019; Levine et al, 2020). To constrain the policy, some methods use distributional penalties, as measured by KL divergence (Levine et al, 2020; Jaques et al, 2019), MMD (Kumar et al, 2019), or Wasserstein distance (Wu et al, 2019). Other methods first sample actions from the behavior policy and then either clip the maximum deviation from those actions (Fujimoto et al, 2018) or just use those actions (Ghasemipour et al, 2020) during the value backup to stay within the support of the offline data. In contrast to these works, OPAL uses an offline dataset for unsupervised learning of a continuous space of primitives. The use of these primitives for downstream tasks implicitly constrains a learned primitive-directing policy to stay close to the offline data distribution. As we demonstrate in our experiments, the use of OPAL in conjunction with an off-the-shelf offline RL algorithm in this way can yield significant improvement compared to applying offline RL to the dataset directly.
Funding
  • While OPAL is related to these works, we mainly focus on leveraging the learned primitives for asymptotically improving the performance of offline RL; i.e., both the primitive learning and the downstream task must be solved using a single static dataset
Reference
  • Joshua Achiam, David Held, Aviv Tamar, and Pieter Abbeel. Constrained policy optimization. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 22–3JMLR. org, 2017.
    Google ScholarLocate open access versionFindings
  • Christopher G Atkeson, Benzun P Wisely Babu, Nandan Banerjee, Dmitry Berenson, Christoper P Bove, Xiongyi Cui, Mathew DeDonato, Ruixiang Du, Siyuan Feng, Perry Franklin, et al. No falls, no resets: Reliable humanoid behavior in the darpa robotics challenge. In 2015 IEEE-RAS 15th International Conference on Humanoid Robots (Humanoids), pp. 623–630. IEEE, 2015.
    Google ScholarLocate open access versionFindings
  • Pierre-Luc Bacon, Jean Harb, and Doina Precup. The option-critic architecture. In Thirty-First AAAI Conference on Artificial Intelligence, 2017.
    Google ScholarLocate open access versionFindings
  • Noam Brown and Tuomas Sandholm. Superhuman ai for multiplayer poker. Science, 365(6456): 885–890, 2019. ISSN 0036-8075. doi: 10.1126/science.aay2400. URL https://science.sciencemag.org/content/365/6456/885.
    Locate open access versionFindings
  • Sudeep Dasari, Frederik Ebert, Stephen Tian, Suraj Nair, Bernadette Bucher, Karl Schmeckpeper, Siddharth Singh, Sergey Levine, and Chelsea Finn. Robonet: Large-scale multi-robot learning. arXiv preprint arXiv:1910.11215, 2019.
    Findings
  • Gabriel Dulac-Arnold, Daniel Mankowitz, and Todd Hester. Challenges of real-world reinforcement learning. arXiv preprint arXiv:1904.12901, 2019.
    Findings
  • Benjamin Eysenbach, Abhishek Gupta, Julian Ibarz, and Sergey Levine. Diversity is all you need: Learning skills without a reward function. arXiv preprint arXiv:1802.06070, 2018.
    Findings
  • J. Fu, A. Kumar, O. Nachum, G. Tucker, and S. Levine. D4rl: Datasets for deep data-driven reinforcement learning. In arXiv, 2020. URL https://arxiv.org/pdf/2004.07219.
    Findings
  • Scott Fujimoto, David Meger, and Doina Precup. Off-policy deep reinforcement learning without exploration. arXiv preprint arXiv:1812.02900, 2018.
    Findings
  • Seyed Kamyar Seyed Ghasemipour, Dale Schuurmans, and Shixiang Shane Gu. Emaq: Expectedmax q-learning operator for simple yet effective offline and online rl. arXiv preprint arXiv:2007.11091, 2020.
    Findings
  • Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Offpolicy maximum entropy deep reinforcement learning with a stochastic actor. arXiv preprint arXiv:1801.01290, 2018.
    Findings
  • K. Hausman, J. T. Springenberg, Z. Wang, N. Heess, and M. Riedmiller. Learning an embedding space for transferable robot skills. In International Conference on Learning Representations (ICLR), 2018.
    Google ScholarLocate open access versionFindings
  • Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick, Shakir Mohamed, and Alexander Lerchner. beta-vae: Learning basic visual concepts with a constrained variational framework. 2016.
    Google ScholarFindings
  • Allan Jabri, Kyle Hsu, Abhishek Gupta, Ben Eysenbach, Sergey Levine, and Chelsea Finn. Unsupervised curricula for visual meta-reinforcement learning. In Advances in Neural Information Processing Systems, pp. 10519–10531, 2019.
    Google ScholarLocate open access versionFindings
  • Natasha Jaques, Asma Ghandeharioun, Judy Hanwen Shen, Craig Ferguson, Agata Lapedriza, Noah Jones, Shixiang Gu, and Rosalind Picard. Way off-policy batch deep reinforcement learning of implicit human preferences in dialog. arXiv preprint arXiv:1907.00456, 2019.
    Findings
  • Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
    Findings
  • Sanjay Krishnan, Roy Fox, Ion Stoica, and Ken Goldberg. Ddco: Discovery of deep continuous options for robot learning from demonstrations. arXiv preprint arXiv:1710.05421, 2017.
    Findings
  • Ashish Kumar, Saurabh Gupta, and Jitendra Malik. Learning navigation subroutines from egocentric videos. In Conference on Robot Learning, pp. 617–626. PMLR, 2020a.
    Google ScholarLocate open access versionFindings
  • Aviral Kumar, Justin Fu, Matthew Soh, George Tucker, and Sergey Levine. Stabilizing off-policy qlearning via bootstrapping error reduction. In Neural Information Processing Systems (NeurIPS), 2019.
    Google ScholarLocate open access versionFindings
  • Aviral Kumar, Aurick Zhou, George Tucker, and Sergey Levine. Conservative q-learning for offline reinforcement learning. arXiv preprint arXiv:2006.04779, 2020b.
    Findings
  • Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. End-to-end training of deep visuomotor policies. The Journal of Machine Learning Research, 17(1):1334–1373, 2016.
    Google ScholarLocate open access versionFindings
  • Sergey Levine, Aviral Kumar, George Tucker, and Justin Fu. Offline reinforcement learning: Tutorial, review, and perspectives on open problems. arXiv preprint arXiv:2005.01643, 2020.
    Findings
  • Corey Lynch, Mohi Khansari, Ted Xiao, Vikash Kumar, Jonathan Tompson, Sergey Levine, and Pierre Sermanet. Learning latent plans from play. In Conference on Robot Learning, pp. 1113– 1132, 2020.
    Google ScholarLocate open access versionFindings
  • Tatsuya Matsushima, Hiroki Furuta, Yutaka Matsuo, Ofir Nachum, and Shixiang Gu. Deploymentefficient reinforcement learning via model-based offline optimization. arXiv preprint arXiv:2006.03647, 2020.
    Findings
  • Josh Merel, Leonard Hasenclever, Alexandre Galashov, Arun Ahuja, Vu Pham, Greg Wayne, Yee Whye Teh, and Nicolas Heess. Neural probabilistic motor primitives for humanoid control. arXiv preprint arXiv:1811.11711, 2018.
    Findings
  • Ofir Nachum, Shixiang Gu, Honglak Lee, and Sergey Levine. Near-optimal representation learning for hierarchical reinforcement learning. arXiv preprint arXiv:1810.01257, 2018a.
    Findings
  • Ofir Nachum, Shixiang Shane Gu, Honglak Lee, and Sergey Levine. Data-efficient hierarchical reinforcement learning. In Advances in Neural Information Processing Systems, pp. 3303–3313, 2018b.
    Google ScholarLocate open access versionFindings
  • Ofir Nachum, Michael Ahn, Hugo Ponte, Shixiang Gu, and Vikash Kumar. Multi-agent manipulation via locomotion using hierarchical sim2real. arXiv preprint arXiv:1908.05224, 2019a.
    Findings
  • Ofir Nachum, Haoran Tang, Xingyu Lu, Shixiang Gu, Honglak Lee, and Sergey Levine. Why does hierarchy (sometimes) work so well in reinforcement learning? arXiv preprint arXiv:1909.10618, 2019b.
    Findings
  • Paavo Parmas, Carl Edward Rasmussen, Jan Peters, and Kenji Doya. Pipps: Flexible model-based policy search robust to the curse of chaos. arXiv preprint arXiv:1902.01240, 2019.
    Findings
  • Xue Bin Peng, Michael Chang, Grace Zhang, Pieter Abbeel, and Sergey Levine. Mcp: Learning composable hierarchical control with multiplicative compositional policies. In Advances in Neural Information Processing Systems, pp. 3686–3697, 2019.
    Google ScholarLocate open access versionFindings
  • Jan Peters, Katharina Mulling, and Yasemin Altun. Relative entropy policy search. In Twenty-Fourth AAAI Conference on Artificial Intelligence, 2010.
    Google ScholarLocate open access versionFindings
  • Lerrel Pinto and Abhinav Gupta. Supersizing self-supervision: Learning to grasp from 50k tries and 700 robot hours. In 2016 IEEE international conference on robotics and automation (ICRA), pp. 3406–3413. IEEE, 2016.
    Google ScholarLocate open access versionFindings
  • Martin L Puterman. Markov decision processes: Discrete stochastic dynamic programming. 1994.
    Google ScholarFindings
  • John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
    Findings
  • Tanmay Shankar and Abhinav Gupta. Learning robot skills with temporal variational inference. arXiv preprint arXiv:2006.16232, 2020.
    Findings
  • Archit Sharma, Shixiang Gu, Sergey Levine, Vikash Kumar, and Karol Hausman. Dynamics-aware unsupervised discovery of skills. arXiv preprint arXiv:1907.01657, 2019.
    Findings
  • David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of go with deep neural networks and tree search. nature, 529(7587):484–489, 2016.
    Google ScholarLocate open access versionFindings
  • Martin Stolle and Doina Precup. Learning options in reinforcement learning. volume 2371, pp. 212–223, 08 2002. doi: 10.1007/3-540-45622-8
    Locate open access versionFindings
  • 16. G. Tesauro. Temporal difference learning and td-gammon. J. Int. Comput. Games Assoc., 18:88, 1995.
    Google ScholarLocate open access versionFindings
  • Hado Van Hasselt, Arthur Guez, and David Silver. Deep reinforcement learning with double qlearning. arXiv preprint arXiv:1509.06461, 2015.
    Findings
  • Alexander Sasha Vezhnevets, Simon Osindero, Tom Schaul, Nicolas Heess, Max Jaderberg, David Silver, and Koray Kavukcuoglu. Feudal networks for hierarchical reinforcement learning. arXiv preprint arXiv:1703.01161, 2017.
    Findings
  • Ziyu Wang, Josh S Merel, Scott E Reed, Nando de Freitas, Gregory Wayne, and Nicolas Heess. Robust imitation of diverse behaviors. In Advances in Neural Information Processing Systems, pp. 5320–5329, 2017.
    Google ScholarLocate open access versionFindings
  • Yifan Wu, George Tucker, and Ofir Nachum. Behavior regularized offline reinforcement learning. arXiv preprint arXiv:1911.11361, 2019.
    Findings
  • Tianhe Yu, Deirdre Quillen, Zhanpeng He, Ryan Julian, Karol Hausman, Chelsea Finn, and Sergey Levine. Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning. In Conference on Robot Learning, pp. 1094–1100, 2020.
    Google ScholarLocate open access versionFindings
Full Text
Your rating :
0

 

Tags
Comments