AI helps you reading Science

AI generates interpretation videos

AI extracts and analyses the key points of the paper to generate videos automatically


pub
Go Generating

AI Traceability

AI parses the academic lineage of this thesis


Master Reading Tree
Generate MRT

AI Insight

AI extracts a summary of this paper


Weibo:
We show that in a variety of stochastic environments with local optima, our method significantly improves count-based exploration method and self-imitation learning

Memory Based Trajectory-conditioned Policies for Learning from Sparse Rewards

NIPS 2020, (2020)

Cited by: 3|Views394
EI
Full Text
Bibtex
Weibo

Abstract

Reinforcement learning with sparse rewards is challenging because an agent can rarely obtain non-zero rewards and hence, gradient-based optimization of parameterized policies can be incremental and slow. Recent work demonstrated that using a memory buffer of previous successful trajectories can result in more effective policies. However, ...More

Code:

Data:

0
Introduction
  • Deep reinforcement learning (DRL) algorithms with parameterized policy and value function have achieved remarkable success in various complex domains [32, 49, 48].
  • Tasks that require reasoning over long horizons with sparse rewards remain exceedingly challenging for the parametric approaches.
  • In these tasks, a positive reward could only be received after a long sequence of appropriate actions.
  • Many parametric approaches rely on recent samples and do not explore the state space systematically.
  • They might forget the positive-reward trajectories unless the good trajectories are frequently collected
Highlights
  • Deep reinforcement learning (DRL) algorithms with parameterized policy and value function have achieved remarkable success in various complex domains [32, 49, 48]
  • We show that in a variety of stochastic environments with local optima, our method significantly improves count-based exploration method and self-imitation learning
  • Diverse Trajectory-conditioned Self-Imitation Learning (DTSIL) is likely to be useful in real-world RL applications, such as robotics-related tasks
  • We believe RL researchers and practitioners can benefit from DTSIL to solve RL application problems requiring efficient exploration
  • DTSIL helps avoid the cost of collecting human demonstration and the manual engineering burden of designing complicated reward functions
  • As we discussed in Sec. 5, when deployed for more problems in the future, DTSIL has a good potential to perform robustly and avoid local optima in various stochastic environments when combined with other state representation learning approaches
Methods
  • 2.1 Background and Notation for DTSIL

    In the standard RL setting, at each time step t, an agent observes a state st, selects an action at ∈ A, and receives a reward rt when transitioning to a state st+1 ∈ S, where S and A is a set of states and actions respectively.
  • It is worth noting that even with the ground-truth location of the agent, on the two infamously difficult games Montezuma’s Revenge and Pitfall, it is highly non-trivial to explore efficiently and avoid local optima without relying on expert demonstrations or being able to reset to arbitrary states
  • Many complicated elements such as moving entities, traps, and the agent’s inventory should be considered in decision-making process.
  • As summarized in Tab. 1, the previous SOTA baselines using the agent’s ground truth location information even fail to achieve high scores
Results
  • The authors empirically show that the approach significantly outperforms count-based exploration methods and self-imitation learning on various complex tasks with local optima.
  • (3) The authors achieve a performance superior to the state-of-the-art under 5 billion number of frames, on hard-exploration Atari games of Montezuma’s Revenge and Pitfall, without using expert demonstrations or resetting to arbitrary states.
  • The authors show that in a variety of stochastic environments with local optima, the method significantly improves count-based exploration method and self-imitation learning
Conclusion
  • This paper proposes to learn diverse policies by imitating diverse trajectory-level demonstrations through count-based exploration over these trajectories.
  • The authors show that in a variety of stochastic environments with local optima, the method significantly improves count-based exploration method and self-imitation learning.
  • As the authors discussed in Sec. 5, when deployed for more problems in the future, DTSIL has a good potential to perform robustly and avoid local optima in various stochastic environments when combined with other state representation learning approaches
Tables
  • Table1: Comparison with the state-of-the-art results. The top-2 scores for each game are in bold.Abstract-HRL [<a class="ref-link" id="c29" href="#r29">29</a>] and NGU* (i.e., NGU with hand-crafted controllable states) [<a class="ref-link" id="c5" href="#r5">5</a>] assume more high-level state information, including the agent’s location, inventory, etc. DTSIL, PPO+EXP [<a class="ref-link" id="c13" href="#r13">13</a>], and SmartHash [<a class="ref-link" id="c53" href="#r53">53</a>] only make use of agent’s location information from RAM. IDF [<a class="ref-link" id="c10" href="#r10">10</a>], A2C+SIL [<a class="ref-link" id="c36" href="#r36">36</a>], PPO+CoEX [<a class="ref-link" id="c13" href="#r13">13</a>], RND [<a class="ref-link" id="c11" href="#r11">11</a>], NGU [<a class="ref-link" id="c5" href="#r5">5</a>] and Agent57 [<a class="ref-link" id="c4" href="#r4">4</a>] (a contemporaneous work) do not use RAM information. The score is averaged over multiple runs, gathered from each paper, except PPO+EXP from our implementation
Related work
  • Imitation Learning The goal of imitation learning is to train a policy to mimic a given demonstration. Many previous works achieve good results on hard-exploration Atari games by imitating human demonstrations [23, 41]. Aytar et al [3] learn embeddings from a variety of demonstration videos and proposes the one-shot imitation learning reward, which inspires the design of rewards in our method. All these successful attempts rely on the availability of human demonstrations. In contrast, our method treats the agent’s past trajectories as demonstrations. Memory Based RL An external memory buffer enables the storage and usage of past experiences to improve RL algorithms. Episodic reinforcement learning methods [43, 22, 28] typically store and update a look-up table to memorize the best episodic experiences and retrieve the episodic memory in the agent’s decision-making process. Oh et al [36] and Gangwani et al [19] train a parameterized policy to imitate only the high-reward trajectories with the SIL or GAIL objective. Unlike the previous work focusing on high-reward trajectories, we store the past trajectories ending with diverse states in the buffer, because trajectories with low reward in the short term might lead to high reward in the long term. Badia et al [5] train a range of directed exploratory policies based on episodic memory. Gangwani et al [19] propose to learn multiple diverse policies in a SIL framework but their exploration can be limited by the number of policies learned simultaneously and the exploration performance of every single policy, as shown in the supplementary material. Learning Diverse Policies Previous works [20, 17, 42] seek a diversity of policies by maximizing state coverage, the entropy of mixture skill policies, or the entropy of goal state distribution. Zhang et al [56] learns a variety of policies, each performing novel action sequences, where the novelty is measured by a learned autoencoder. However, these methods focus more on tasks with relatively simple state space and dense rewards while DTSIL shows experimental results performing well on long-horizon, sparse-reward environments with a rich observation space like Atari games. Exploration Many exploration methods [46, 2, 12, 50] in RL tend to award a bonus to encourage an agent to visit novel states. Recently this idea was scaled up to large state spaces [53, 7, 38, 11, 39, 10]. Intrinsic curiosity uses the prediction error or pseudo count as intrinsic reward signals to incentivize visiting novel states. We propose that instead of directly taking a quantification of novelty as an intrinsic reward, one can encourage exploration by rewarding the agent when it successfully imitates demonstrations that would lead to novel states. Ecoffet et al [16] also shows the benefit of exploration by returning to promising states. Our method can be viewed in general as an extension of [16], though we do not need to rely on the assumption that the environment is resettable to arbitrary states. Similar to previous off-policy methods, we use experience replay to enhance exploration. Many off-policy methods [25, 36, 1] tend to discard old experiences with low rewards and hence may prematurely converge to sub-optimal behaviors, but DTSIL using these diverse experiences has a better chance of finding higher rewards in the long term. Contemporaneous works [5, 4] as off-policy methods also achieved strong results on Atari games. NGU [5] constructs an episodic memory-based intrinsic reward using k-nearest neighbors over the agent’s recent experience to train the directed exploratory policies. Agent57 [4] parameterizes a family of policies ranging from very exploratory to purely exploitative and proposes an adaptive mechanism to choose which policy to prioritize throughout the training process. While these methods require a large number of interactions, ours perform competitively well on the hard-exploration Atari games with less than one-tenth of samples. Model-based reinforcement learning [24, 47, 26] generally improves the efficiency of policy learning. However, in the long-horizon, sparse-reward tasks, it is rare to collect precious transitions with non-zero rewards and thus it is difficult to learn a model correctly predicting the dynamics of getting positive rewards. We instead perform efficient policy learning in the hard-exploration tasks because of efficient exploration with the trajectory-conditioned policy.
Funding
  • Acknowledgments and Disclosure of Funding This work was supported in part by NSF grant IIS-1526059 and Korea Foundation for Advanced Studies
Reference
  • M. Andrychowicz, F. Wolski, A. Ray, J. Schneider, R. Fong, P. Welinder, B. McGrew, J. Tobin, O. P. Abbeel, and W. Zaremba. Hindsight experience replay. In Advances in Neural Information Processing Systems, pages 5048–5058, 2017.
    Google ScholarLocate open access versionFindings
  • P. Auer. Using confidence bounds for exploitation-exploration trade-offs. Journal of Machine Learning Research, 3(Nov):397–422, 2002.
    Google ScholarLocate open access versionFindings
  • Y. Aytar, T. Pfaff, D. Budden, T. Paine, Z. Wang, and N. de Freitas. Playing hard exploration games by watching youtube. In Advances in Neural Information Processing Systems, pages 2930–2941, 2018.
    Google ScholarLocate open access versionFindings
  • A. P. Badia, B. Piot, S. Kapturowski, P. Sprechmann, A. Vitvitskyi, D. Guo, and C. Blundell. Agent57: Outperforming the atari human benchmark. arXiv preprint arXiv:2003.13350, 2020.
    Findings
  • A. P. Badia, P. Sprechmann, A. Vitvitskyi, D. Guo, B. Piot, S. Kapturowski, O. Tieleman, M. Arjovsky, A. Pritzel, A. Bolt, et al. Never give up: Learning directed exploration strategies. arXiv preprint arXiv:2002.06038, 2020.
    Findings
  • D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014.
    Findings
  • M. Bellemare, S. Srinivasan, G. Ostrovski, T. Schaul, D. Saxton, and R. Munos. Unifying count-based exploration and intrinsic motivation. In Advances in Neural Information Processing Systems, pages 1471–1479, 2016.
    Google ScholarLocate open access versionFindings
  • M. G. Bellemare, Y. Naddaf, J. Veness, and M. Bowling. The arcade learning environment: An evaluation platform for general agents. Journal of Artificial Intelligence Research, 47:253–279, 2013.
    Google ScholarLocate open access versionFindings
  • J. Bornschein, A. Mnih, D. Zoran, and D. J. Rezende. Variational memory addressing in generative models. In Advances in Neural Information Processing Systems, pages 3920–3929, 2017.
    Google ScholarLocate open access versionFindings
  • Y. Burda, H. Edwards, D. Pathak, A. Storkey, T. Darrell, and A. A. Efros. Large-scale study of curiosity-driven learning. arXiv preprint arXiv:1808.04355, 2018.
    Findings
  • Y. Burda, H. Edwards, A. Storkey, and O. Klimov. Exploration by random network distillation. arXiv preprint arXiv:1810.12894, 2018.
    Findings
  • N. Chentanez, A. G. Barto, and S. P. Singh. Intrinsically motivated reinforcement learning. In Advances in neural information processing systems, pages 1281–1288, 2005.
    Google ScholarLocate open access versionFindings
  • J. Choi, Y. Guo, M. Moczulski, J. Oh, N. Wu, M. Norouzi, and H. Lee. Contingency-aware exploration in reinforcement learning. arXiv preprint arXiv:1811.01483, 2018.
    Findings
  • C. Devin, A. Gupta, T. Darrell, P. Abbeel, and S. Levine. Learning modular neural network policies for multi-task and multi-robot transfer. In 2017 IEEE International Conference on Robotics and Automation (ICRA), pages 2169–2176. IEEE, 2017.
    Google ScholarLocate open access versionFindings
  • Y. Duan, X. Chen, R. Houthooft, J. Schulman, and P. Abbeel. Benchmarking deep reinforcement learning for continuous control. In International Conference on Machine Learning, pages 1329– 1338, 2016.
    Google ScholarLocate open access versionFindings
  • A. Ecoffet, J. Huizinga, J. Lehman, K. O. Stanley, and J. Clune. Go-explore: a new approach for hard-exploration problems. arXiv preprint arXiv:1901.10995, 2019.
    Findings
  • B. Eysenbach, A. Gupta, J. Ibarz, and S. Levine. Diversity is all you need: Learning skills without a reward function. arXiv preprint arXiv:1802.06070, 2018.
    Findings
  • L. Fan, Y. Zhu, J. Zhu, Z. Liu, O. Zeng, A. Gupta, J. Creus-Costa, S. Savarese, and L. Fei-Fei. Surreal: Open-source reinforcement learning framework and robot manipulation benchmark. In Conference on Robot Learning, 2018.
    Google ScholarLocate open access versionFindings
  • T. Gangwani, Q. Liu, and J. Peng. Learning self-imitating diverse policies. arXiv preprint arXiv:1805.10309, 2018.
    Findings
  • K. Gregor, D. J. Rezende, and D. Wierstra. Variational intrinsic control. arXiv preprint arXiv:1611.07507, 2016.
    Findings
  • K. Guu, T. B. Hashimoto, Y. Oren, and P. Liang. Generating sentences by editing prototypes. Transactions of the Association for Computational Linguistics, 6:437–450, 2018.
    Google ScholarLocate open access versionFindings
  • S. Hansen, A. Pritzel, P. Sprechmann, A. Barreto, and C. Blundell. Fast deep reinforcement learning using online adjustments from the past. In Advances in Neural Information Processing Systems, pages 10567–10577, 2018.
    Google ScholarLocate open access versionFindings
  • T. Hester, M. Vecerik, O. Pietquin, M. Lanctot, T. Schaul, B. Piot, D. Horgan, J. Quan, A. Sendonaris, I. Osband, et al. Deep q-learning from demonstrations. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
    Google ScholarLocate open access versionFindings
  • L. Kaiser, M. Babaeizadeh, P. Milos, B. Osinski, R. H. Campbell, K. Czechowski, D. Erhan, C. Finn, P. Kozakowski, S. Levine, et al. Model-based reinforcement learning for atari. arXiv preprint arXiv:1903.00374, 2019.
    Findings
  • S. Kapturowski, G. Ostrovski, J. Quan, R. Munos, and W. Dabney. Recurrent experience replay in distributed reinforcement learning. In International conference on learning representations, 2018.
    Google ScholarLocate open access versionFindings
  • T. D. Kulkarni, A. Gupta, C. Ionescu, S. Borgeaud, M. Reynolds, A. Zisserman, and V. Mnih. Unsupervised learning of object keypoints for perception and control. In Advances in neural information processing systems, pages 10724–10734, 2019.
    Google ScholarLocate open access versionFindings
  • C. Liang, M. Norouzi, J. Berant, Q. V. Le, and N. Lao. Memory augmented policy optimization for program synthesis and semantic parsing. In Advances in Neural Information Processing Systems, pages 9994–10006, 2018.
    Google ScholarLocate open access versionFindings
  • Z. Lin, T. Zhao, G. Yang, and L. Zhang. Episodic memory deep q-networks. arXiv preprint arXiv:1805.07603, 2018.
    Findings
  • E. Z. Liu, R. Keramati, S. Seshadri, K. Guu, P. Pasupat, E. Brunskill, and P. Liang. Learning abstract models for long-horizon exploration, 2019. URL https://openreview.net/forum?id=ryxLG2RcYX.
    Findings
  • M. C. Machado, M. G. Bellemare, E. Talvitie, J. Veness, M. Hausknecht, and M. Bowling. Revisiting the arcade learning environment: Evaluation protocols and open problems for general agents. Journal of Artificial Intelligence Research, 61:523–562, 2017.
    Google ScholarLocate open access versionFindings
  • P. Mirowski, R. Pascanu, F. Viola, H. Soyer, A. J. Ballard, A. Banino, M. Denil, R. Goroshin, L. Sifre, K. Kavukcuoglu, et al. Learning to navigate in complex environments. arXiv preprint arXiv:1611.03673, 2016.
    Findings
  • V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis. Human-level control through deep reinforcement learning. Nature, 2015.
    Google ScholarLocate open access versionFindings
  • O. Nachum, S. S. Gu, H. Lee, and S. Levine. Data-efficient hierarchical reinforcement learning. In Advances in Neural Information Processing Systems, pages 3303–3313, 2018.
    Google ScholarLocate open access versionFindings
  • A. Nair, D. Chen, P. Agrawal, P. Isola, P. Abbeel, J. Malik, and S. Levine. Combining self-supervised learning and imitation for vision-based rope manipulation. In 2017 IEEE International Conference on Robotics and Automation (ICRA), pages 2146–2153. IEEE, 2017.
    Google ScholarLocate open access versionFindings
  • J. Oh, S. Singh, H. Lee, and P. Kohli. Zero-shot task generalization with multi-task deep reinforcement learning. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 2661–2670. JMLR. org, 2017.
    Google ScholarLocate open access versionFindings
  • J. Oh, Y. Guo, S. Singh, and H. Lee. Self-imitation learning. arXiv preprint arXiv:1806.05635, 2018.
    Findings
  • I. Osband, Y. Doron, M. Hessel, J. Aslanides, E. Sezener, A. Saraiva, K. McKinney, T. Lattimore, C. Szepesvári, S. Singh, B. Van Roy, R. Sutton, D. Silver, and H. van Hasselt. Behaviour suite for reinforcement learning. 2019.
    Google ScholarFindings
  • G. Ostrovski, M. G. Bellemare, A. van den Oord, and R. Munos. Count-based exploration with neural density models. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 2721–2730. JMLR. org, 2017.
    Google ScholarLocate open access versionFindings
  • D. Pathak, P. Agrawal, A. A. Efros, and T. Darrell. Curiosity-driven exploration by selfsupervised prediction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 16–17, 2017.
    Google ScholarLocate open access versionFindings
  • D. Pathak, P. Mahmoudieh, G. Luo, P. Agrawal, D. Chen, Y. Shentu, E. Shelhamer, J. Malik, A. A. Efros, and T. Darrell. Zero-shot visual imitation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 2050–2053, 2018.
    Google ScholarLocate open access versionFindings
  • T. Pohlen, B. Piot, T. Hester, M. G. Azar, D. Horgan, D. Budden, G. Barth-Maron, H. van Hasselt, J. Quan, M. Vecerík, et al. Observe and look further: Achieving consistent performance on atari. arXiv preprint arXiv:1805.11593, 2018.
    Findings
  • V. H. Pong, M. Dalal, S. Lin, A. Nair, S. Bahl, and S. Levine. Skew-fit: State-covering self-supervised reinforcement learning. arXiv preprint arXiv:1903.03698, 2019.
    Findings
  • A. Pritzel, B. Uria, S. Srinivasan, A. P. Badia, O. Vinyals, D. Hassabis, D. Wierstra, and C. Blundell. Neural episodic control. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 2827–2836. JMLR. org, 2017.
    Google ScholarLocate open access versionFindings
  • T. Schaul, D. Horgan, K. Gregor, and D. Silver. Universal value function approximators. In International conference on machine learning, pages 1312–1320, 2015.
    Google ScholarLocate open access versionFindings
  • T. Schaul, J. Quan, I. Antonoglou, and D. Silver. Prioritized experience replay. arXiv preprint arXiv:1511.05952, 2015.
    Findings
  • J. Schmidhuber. Adaptive confidence and adaptive curiosity. Technical report, Citeseer, 1991.
    Google ScholarFindings
  • J. Schrittwieser, I. Antonoglou, T. Hubert, K. Simonyan, L. Sifre, S. Schmitt, A. Guez, E. Lockhart, D. Hassabis, T. Graepel, et al. Mastering atari, go, chess and shogi by planning with a learned model. arXiv preprint arXiv:1911.08265, 2019.
    Findings
  • J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
    Findings
  • D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, et al. Mastering the game of go with deep neural networks and tree search. nature, 529(7587):484, 2016.
    Google ScholarLocate open access versionFindings
  • A. L. Strehl and M. L. Littman. An analysis of model-based interval estimation for markov decision processes. Journal of Computer and System Sciences, 74(8):1309–1331, 2008.
    Google ScholarLocate open access versionFindings
  • I. Sutskever, O. Vinyals, and Q. V. Le. Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pages 3104–3112, 2014.
    Google ScholarLocate open access versionFindings
  • R. S. Sutton, D. A. McAllester, S. P. Singh, and Y. Mansour. Policy gradient methods for reinforcement learning with function approximation. In Advances in neural information processing systems, pages 1057–1063, 2000.
    Google ScholarLocate open access versionFindings
  • H. Tang, R. Houthooft, D. Foote, A. Stooke, O. X. Chen, Y. Duan, J. Schulman, F. DeTurck, and P. Abbeel. # exploration: A study of count-based exploration for deep reinforcement learning. In Advances in neural information processing systems, pages 2753–2762, 2017.
    Google ScholarLocate open access versionFindings
  • D. Warde-Farley, T. Van de Wiele, T. Kulkarni, C. Ionescu, S. Hansen, and V. Mnih. Unsupervised control through non-parametric discriminative rewards. arXiv preprint arXiv:1811.11359, 2018.
    Findings
  • F. Xia, A. R. Zamir, Z. He, A. Sax, J. Malik, and S. Savarese. Gibson env: Real-world perception for embodied agents. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 9068–9079, 2018.
    Google ScholarLocate open access versionFindings
  • Y. Zhang, W. Yu, and G. Turk. Learning novel policies for tasks. arXiv preprint arXiv:1905.05252, 2019.
    Findings
Author
Your rating :
0

 

Tags
Comments
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn
小科