AI helps you reading Science
AI generates interpretation videos
AI extracts and analyses the key points of the paper to generate videos automatically
AI parses the academic lineage of this thesis
AI extracts a summary of this paper
We show that in a variety of stochastic environments with local optima, our method significantly improves count-based exploration method and self-imitation learning
Memory Based Trajectory-conditioned Policies for Learning from Sparse Rewards
NIPS 2020, (2020)
Reinforcement learning with sparse rewards is challenging because an agent can rarely obtain non-zero rewards and hence, gradient-based optimization of parameterized policies can be incremental and slow. Recent work demonstrated that using a memory buffer of previous successful trajectories can result in more effective policies. However, ...More
PPT (Upload PPT)
- Deep reinforcement learning (DRL) algorithms with parameterized policy and value function have achieved remarkable success in various complex domains [32, 49, 48].
- Tasks that require reasoning over long horizons with sparse rewards remain exceedingly challenging for the parametric approaches.
- In these tasks, a positive reward could only be received after a long sequence of appropriate actions.
- Many parametric approaches rely on recent samples and do not explore the state space systematically.
- They might forget the positive-reward trajectories unless the good trajectories are frequently collected
- Deep reinforcement learning (DRL) algorithms with parameterized policy and value function have achieved remarkable success in various complex domains [32, 49, 48]
- We show that in a variety of stochastic environments with local optima, our method significantly improves count-based exploration method and self-imitation learning
- Diverse Trajectory-conditioned Self-Imitation Learning (DTSIL) is likely to be useful in real-world RL applications, such as robotics-related tasks
- We believe RL researchers and practitioners can benefit from DTSIL to solve RL application problems requiring efficient exploration
- DTSIL helps avoid the cost of collecting human demonstration and the manual engineering burden of designing complicated reward functions
- As we discussed in Sec. 5, when deployed for more problems in the future, DTSIL has a good potential to perform robustly and avoid local optima in various stochastic environments when combined with other state representation learning approaches
- 2.1 Background and Notation for DTSIL
In the standard RL setting, at each time step t, an agent observes a state st, selects an action at ∈ A, and receives a reward rt when transitioning to a state st+1 ∈ S, where S and A is a set of states and actions respectively.
- It is worth noting that even with the ground-truth location of the agent, on the two infamously difficult games Montezuma’s Revenge and Pitfall, it is highly non-trivial to explore efficiently and avoid local optima without relying on expert demonstrations or being able to reset to arbitrary states
- Many complicated elements such as moving entities, traps, and the agent’s inventory should be considered in decision-making process.
- As summarized in Tab. 1, the previous SOTA baselines using the agent’s ground truth location information even fail to achieve high scores
- The authors empirically show that the approach significantly outperforms count-based exploration methods and self-imitation learning on various complex tasks with local optima.
- (3) The authors achieve a performance superior to the state-of-the-art under 5 billion number of frames, on hard-exploration Atari games of Montezuma’s Revenge and Pitfall, without using expert demonstrations or resetting to arbitrary states.
- The authors show that in a variety of stochastic environments with local optima, the method significantly improves count-based exploration method and self-imitation learning
- This paper proposes to learn diverse policies by imitating diverse trajectory-level demonstrations through count-based exploration over these trajectories.
- The authors show that in a variety of stochastic environments with local optima, the method significantly improves count-based exploration method and self-imitation learning.
- As the authors discussed in Sec. 5, when deployed for more problems in the future, DTSIL has a good potential to perform robustly and avoid local optima in various stochastic environments when combined with other state representation learning approaches
- Table1: Comparison with the state-of-the-art results. The top-2 scores for each game are in bold.Abstract-HRL [<a class="ref-link" id="c29" href="#r29">29</a>] and NGU* (i.e., NGU with hand-crafted controllable states) [<a class="ref-link" id="c5" href="#r5">5</a>] assume more high-level state information, including the agent’s location, inventory, etc. DTSIL, PPO+EXP [<a class="ref-link" id="c13" href="#r13">13</a>], and SmartHash [<a class="ref-link" id="c53" href="#r53">53</a>] only make use of agent’s location information from RAM. IDF [<a class="ref-link" id="c10" href="#r10">10</a>], A2C+SIL [<a class="ref-link" id="c36" href="#r36">36</a>], PPO+CoEX [<a class="ref-link" id="c13" href="#r13">13</a>], RND [<a class="ref-link" id="c11" href="#r11">11</a>], NGU [<a class="ref-link" id="c5" href="#r5">5</a>] and Agent57 [<a class="ref-link" id="c4" href="#r4">4</a>] (a contemporaneous work) do not use RAM information. The score is averaged over multiple runs, gathered from each paper, except PPO+EXP from our implementation
- Imitation Learning The goal of imitation learning is to train a policy to mimic a given demonstration. Many previous works achieve good results on hard-exploration Atari games by imitating human demonstrations [23, 41]. Aytar et al  learn embeddings from a variety of demonstration videos and proposes the one-shot imitation learning reward, which inspires the design of rewards in our method. All these successful attempts rely on the availability of human demonstrations. In contrast, our method treats the agent’s past trajectories as demonstrations. Memory Based RL An external memory buffer enables the storage and usage of past experiences to improve RL algorithms. Episodic reinforcement learning methods [43, 22, 28] typically store and update a look-up table to memorize the best episodic experiences and retrieve the episodic memory in the agent’s decision-making process. Oh et al  and Gangwani et al  train a parameterized policy to imitate only the high-reward trajectories with the SIL or GAIL objective. Unlike the previous work focusing on high-reward trajectories, we store the past trajectories ending with diverse states in the buffer, because trajectories with low reward in the short term might lead to high reward in the long term. Badia et al  train a range of directed exploratory policies based on episodic memory. Gangwani et al  propose to learn multiple diverse policies in a SIL framework but their exploration can be limited by the number of policies learned simultaneously and the exploration performance of every single policy, as shown in the supplementary material. Learning Diverse Policies Previous works [20, 17, 42] seek a diversity of policies by maximizing state coverage, the entropy of mixture skill policies, or the entropy of goal state distribution. Zhang et al  learns a variety of policies, each performing novel action sequences, where the novelty is measured by a learned autoencoder. However, these methods focus more on tasks with relatively simple state space and dense rewards while DTSIL shows experimental results performing well on long-horizon, sparse-reward environments with a rich observation space like Atari games. Exploration Many exploration methods [46, 2, 12, 50] in RL tend to award a bonus to encourage an agent to visit novel states. Recently this idea was scaled up to large state spaces [53, 7, 38, 11, 39, 10]. Intrinsic curiosity uses the prediction error or pseudo count as intrinsic reward signals to incentivize visiting novel states. We propose that instead of directly taking a quantification of novelty as an intrinsic reward, one can encourage exploration by rewarding the agent when it successfully imitates demonstrations that would lead to novel states. Ecoffet et al  also shows the benefit of exploration by returning to promising states. Our method can be viewed in general as an extension of , though we do not need to rely on the assumption that the environment is resettable to arbitrary states. Similar to previous off-policy methods, we use experience replay to enhance exploration. Many off-policy methods [25, 36, 1] tend to discard old experiences with low rewards and hence may prematurely converge to sub-optimal behaviors, but DTSIL using these diverse experiences has a better chance of finding higher rewards in the long term. Contemporaneous works [5, 4] as off-policy methods also achieved strong results on Atari games. NGU  constructs an episodic memory-based intrinsic reward using k-nearest neighbors over the agent’s recent experience to train the directed exploratory policies. Agent57  parameterizes a family of policies ranging from very exploratory to purely exploitative and proposes an adaptive mechanism to choose which policy to prioritize throughout the training process. While these methods require a large number of interactions, ours perform competitively well on the hard-exploration Atari games with less than one-tenth of samples. Model-based reinforcement learning [24, 47, 26] generally improves the efficiency of policy learning. However, in the long-horizon, sparse-reward tasks, it is rare to collect precious transitions with non-zero rewards and thus it is difficult to learn a model correctly predicting the dynamics of getting positive rewards. We instead perform efficient policy learning in the hard-exploration tasks because of efficient exploration with the trajectory-conditioned policy.
- Acknowledgments and Disclosure of Funding This work was supported in part by NSF grant IIS-1526059 and Korea Foundation for Advanced Studies
- M. Andrychowicz, F. Wolski, A. Ray, J. Schneider, R. Fong, P. Welinder, B. McGrew, J. Tobin, O. P. Abbeel, and W. Zaremba. Hindsight experience replay. In Advances in Neural Information Processing Systems, pages 5048–5058, 2017.
- P. Auer. Using confidence bounds for exploitation-exploration trade-offs. Journal of Machine Learning Research, 3(Nov):397–422, 2002.
- Y. Aytar, T. Pfaff, D. Budden, T. Paine, Z. Wang, and N. de Freitas. Playing hard exploration games by watching youtube. In Advances in Neural Information Processing Systems, pages 2930–2941, 2018.
- A. P. Badia, B. Piot, S. Kapturowski, P. Sprechmann, A. Vitvitskyi, D. Guo, and C. Blundell. Agent57: Outperforming the atari human benchmark. arXiv preprint arXiv:2003.13350, 2020.
- A. P. Badia, P. Sprechmann, A. Vitvitskyi, D. Guo, B. Piot, S. Kapturowski, O. Tieleman, M. Arjovsky, A. Pritzel, A. Bolt, et al. Never give up: Learning directed exploration strategies. arXiv preprint arXiv:2002.06038, 2020.
- D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014.
- M. Bellemare, S. Srinivasan, G. Ostrovski, T. Schaul, D. Saxton, and R. Munos. Unifying count-based exploration and intrinsic motivation. In Advances in Neural Information Processing Systems, pages 1471–1479, 2016.
- M. G. Bellemare, Y. Naddaf, J. Veness, and M. Bowling. The arcade learning environment: An evaluation platform for general agents. Journal of Artificial Intelligence Research, 47:253–279, 2013.
- J. Bornschein, A. Mnih, D. Zoran, and D. J. Rezende. Variational memory addressing in generative models. In Advances in Neural Information Processing Systems, pages 3920–3929, 2017.
- Y. Burda, H. Edwards, D. Pathak, A. Storkey, T. Darrell, and A. A. Efros. Large-scale study of curiosity-driven learning. arXiv preprint arXiv:1808.04355, 2018.
- Y. Burda, H. Edwards, A. Storkey, and O. Klimov. Exploration by random network distillation. arXiv preprint arXiv:1810.12894, 2018.
- N. Chentanez, A. G. Barto, and S. P. Singh. Intrinsically motivated reinforcement learning. In Advances in neural information processing systems, pages 1281–1288, 2005.
- J. Choi, Y. Guo, M. Moczulski, J. Oh, N. Wu, M. Norouzi, and H. Lee. Contingency-aware exploration in reinforcement learning. arXiv preprint arXiv:1811.01483, 2018.
- C. Devin, A. Gupta, T. Darrell, P. Abbeel, and S. Levine. Learning modular neural network policies for multi-task and multi-robot transfer. In 2017 IEEE International Conference on Robotics and Automation (ICRA), pages 2169–2176. IEEE, 2017.
- Y. Duan, X. Chen, R. Houthooft, J. Schulman, and P. Abbeel. Benchmarking deep reinforcement learning for continuous control. In International Conference on Machine Learning, pages 1329– 1338, 2016.
- A. Ecoffet, J. Huizinga, J. Lehman, K. O. Stanley, and J. Clune. Go-explore: a new approach for hard-exploration problems. arXiv preprint arXiv:1901.10995, 2019.
- B. Eysenbach, A. Gupta, J. Ibarz, and S. Levine. Diversity is all you need: Learning skills without a reward function. arXiv preprint arXiv:1802.06070, 2018.
- L. Fan, Y. Zhu, J. Zhu, Z. Liu, O. Zeng, A. Gupta, J. Creus-Costa, S. Savarese, and L. Fei-Fei. Surreal: Open-source reinforcement learning framework and robot manipulation benchmark. In Conference on Robot Learning, 2018.
- T. Gangwani, Q. Liu, and J. Peng. Learning self-imitating diverse policies. arXiv preprint arXiv:1805.10309, 2018.
- K. Gregor, D. J. Rezende, and D. Wierstra. Variational intrinsic control. arXiv preprint arXiv:1611.07507, 2016.
- K. Guu, T. B. Hashimoto, Y. Oren, and P. Liang. Generating sentences by editing prototypes. Transactions of the Association for Computational Linguistics, 6:437–450, 2018.
- S. Hansen, A. Pritzel, P. Sprechmann, A. Barreto, and C. Blundell. Fast deep reinforcement learning using online adjustments from the past. In Advances in Neural Information Processing Systems, pages 10567–10577, 2018.
- T. Hester, M. Vecerik, O. Pietquin, M. Lanctot, T. Schaul, B. Piot, D. Horgan, J. Quan, A. Sendonaris, I. Osband, et al. Deep q-learning from demonstrations. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
- L. Kaiser, M. Babaeizadeh, P. Milos, B. Osinski, R. H. Campbell, K. Czechowski, D. Erhan, C. Finn, P. Kozakowski, S. Levine, et al. Model-based reinforcement learning for atari. arXiv preprint arXiv:1903.00374, 2019.
- S. Kapturowski, G. Ostrovski, J. Quan, R. Munos, and W. Dabney. Recurrent experience replay in distributed reinforcement learning. In International conference on learning representations, 2018.
- T. D. Kulkarni, A. Gupta, C. Ionescu, S. Borgeaud, M. Reynolds, A. Zisserman, and V. Mnih. Unsupervised learning of object keypoints for perception and control. In Advances in neural information processing systems, pages 10724–10734, 2019.
- C. Liang, M. Norouzi, J. Berant, Q. V. Le, and N. Lao. Memory augmented policy optimization for program synthesis and semantic parsing. In Advances in Neural Information Processing Systems, pages 9994–10006, 2018.
- Z. Lin, T. Zhao, G. Yang, and L. Zhang. Episodic memory deep q-networks. arXiv preprint arXiv:1805.07603, 2018.
- E. Z. Liu, R. Keramati, S. Seshadri, K. Guu, P. Pasupat, E. Brunskill, and P. Liang. Learning abstract models for long-horizon exploration, 2019. URL https://openreview.net/forum?id=ryxLG2RcYX.
- M. C. Machado, M. G. Bellemare, E. Talvitie, J. Veness, M. Hausknecht, and M. Bowling. Revisiting the arcade learning environment: Evaluation protocols and open problems for general agents. Journal of Artificial Intelligence Research, 61:523–562, 2017.
- P. Mirowski, R. Pascanu, F. Viola, H. Soyer, A. J. Ballard, A. Banino, M. Denil, R. Goroshin, L. Sifre, K. Kavukcuoglu, et al. Learning to navigate in complex environments. arXiv preprint arXiv:1611.03673, 2016.
- V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis. Human-level control through deep reinforcement learning. Nature, 2015.
- O. Nachum, S. S. Gu, H. Lee, and S. Levine. Data-efficient hierarchical reinforcement learning. In Advances in Neural Information Processing Systems, pages 3303–3313, 2018.
- A. Nair, D. Chen, P. Agrawal, P. Isola, P. Abbeel, J. Malik, and S. Levine. Combining self-supervised learning and imitation for vision-based rope manipulation. In 2017 IEEE International Conference on Robotics and Automation (ICRA), pages 2146–2153. IEEE, 2017.
- J. Oh, S. Singh, H. Lee, and P. Kohli. Zero-shot task generalization with multi-task deep reinforcement learning. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 2661–2670. JMLR. org, 2017.
- J. Oh, Y. Guo, S. Singh, and H. Lee. Self-imitation learning. arXiv preprint arXiv:1806.05635, 2018.
- I. Osband, Y. Doron, M. Hessel, J. Aslanides, E. Sezener, A. Saraiva, K. McKinney, T. Lattimore, C. Szepesvári, S. Singh, B. Van Roy, R. Sutton, D. Silver, and H. van Hasselt. Behaviour suite for reinforcement learning. 2019.
- G. Ostrovski, M. G. Bellemare, A. van den Oord, and R. Munos. Count-based exploration with neural density models. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 2721–2730. JMLR. org, 2017.
- D. Pathak, P. Agrawal, A. A. Efros, and T. Darrell. Curiosity-driven exploration by selfsupervised prediction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 16–17, 2017.
- D. Pathak, P. Mahmoudieh, G. Luo, P. Agrawal, D. Chen, Y. Shentu, E. Shelhamer, J. Malik, A. A. Efros, and T. Darrell. Zero-shot visual imitation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 2050–2053, 2018.
- T. Pohlen, B. Piot, T. Hester, M. G. Azar, D. Horgan, D. Budden, G. Barth-Maron, H. van Hasselt, J. Quan, M. Vecerík, et al. Observe and look further: Achieving consistent performance on atari. arXiv preprint arXiv:1805.11593, 2018.
- V. H. Pong, M. Dalal, S. Lin, A. Nair, S. Bahl, and S. Levine. Skew-fit: State-covering self-supervised reinforcement learning. arXiv preprint arXiv:1903.03698, 2019.
- A. Pritzel, B. Uria, S. Srinivasan, A. P. Badia, O. Vinyals, D. Hassabis, D. Wierstra, and C. Blundell. Neural episodic control. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 2827–2836. JMLR. org, 2017.
- T. Schaul, D. Horgan, K. Gregor, and D. Silver. Universal value function approximators. In International conference on machine learning, pages 1312–1320, 2015.
- T. Schaul, J. Quan, I. Antonoglou, and D. Silver. Prioritized experience replay. arXiv preprint arXiv:1511.05952, 2015.
- J. Schmidhuber. Adaptive confidence and adaptive curiosity. Technical report, Citeseer, 1991.
- J. Schrittwieser, I. Antonoglou, T. Hubert, K. Simonyan, L. Sifre, S. Schmitt, A. Guez, E. Lockhart, D. Hassabis, T. Graepel, et al. Mastering atari, go, chess and shogi by planning with a learned model. arXiv preprint arXiv:1911.08265, 2019.
- J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
- D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, et al. Mastering the game of go with deep neural networks and tree search. nature, 529(7587):484, 2016.
- A. L. Strehl and M. L. Littman. An analysis of model-based interval estimation for markov decision processes. Journal of Computer and System Sciences, 74(8):1309–1331, 2008.
- I. Sutskever, O. Vinyals, and Q. V. Le. Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pages 3104–3112, 2014.
- R. S. Sutton, D. A. McAllester, S. P. Singh, and Y. Mansour. Policy gradient methods for reinforcement learning with function approximation. In Advances in neural information processing systems, pages 1057–1063, 2000.
- H. Tang, R. Houthooft, D. Foote, A. Stooke, O. X. Chen, Y. Duan, J. Schulman, F. DeTurck, and P. Abbeel. # exploration: A study of count-based exploration for deep reinforcement learning. In Advances in neural information processing systems, pages 2753–2762, 2017.
- D. Warde-Farley, T. Van de Wiele, T. Kulkarni, C. Ionescu, S. Hansen, and V. Mnih. Unsupervised control through non-parametric discriminative rewards. arXiv preprint arXiv:1811.11359, 2018.
- F. Xia, A. R. Zamir, Z. He, A. Sax, J. Malik, and S. Savarese. Gibson env: Real-world perception for embodied agents. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 9068–9079, 2018.
- Y. Zhang, W. Yu, and G. Turk. Learning novel policies for tasks. arXiv preprint arXiv:1905.05252, 2019.