#Exploration: A Study of Count-Based Exploration for Deep Reinforcement Learning

ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 30 (NIPS 2017), 2017.

Cited by: 334|Bibtex|Views222
EI
Other Links: dblp.uni-trier.de|academic.microsoft.com|arxiv.org
Weibo:
We describe a surprising finding: a simple generalization of the classic count-based approach can reach near state-of-the-art performance on various high-dimensional and/or continuous deep RL benchmarks

Abstract:

Count-based exploration algorithms are known to perform near-optimally when used in conjunction with tabular reinforcement learning (RL) methods for solving small discrete Markov decision processes (MDPs). It is generally thought that count-based methods cannot be applied in high-dimensional state spaces, since most states will only occur...More

Code:

Data:

0
Introduction
  • Reinforcement learning (RL) studies an agent acting in an initially unknown environment, learning through trial and error to maximize rewards.
  • Most of the recent state-of-the-art RL results have been obtained using simple exploration strategies such as uniform sampling [21] and i.i.d./correlated Gaussian noise [19, 30].
  • These heuristics are sufficient in tasks with well-shaped rewards, the sample complexity can grow exponentially in tasks with sparse rewards [25].
  • The authors have not seen a very simple and fast method that can work across different domains
Highlights
  • Reinforcement learning (RL) studies an agent acting in an initially unknown environment, learning through trial and error to maximize rewards
  • Most of the recent state-of-the-art Reinforcement learning results have been obtained using simple exploration strategies such as uniform sampling [21] and i.i.d./correlated Gaussian noise [19, 30]. These heuristics are sufficient in tasks with well-shaped rewards, the sample complexity can grow exponentially in tasks with sparse rewards [25]
  • Developed exploration strategies for deep Reinforcement learning have led to significantly improved performance on environments with sparse rewards
  • This paper presents a simple approach for exploration, which extends classic counting-based methods to high-dimensional, continuous state spaces
  • This paper demonstrates that a generalization of classical counting techniques through hashing is able to provide an appropriate signal for exploration, even in continuous and/or high-dimensional MDPs using function approximators, resulting in near state-of-the-art performance across benchmarks
  • 5.2 Hyperparameter sensitivity 105 To study the performance sensitivity to hyperparameter changes, we focus on evaluating TRPO106 RAM-SimHash on the Atari 2600 game Frostbite, where the method has a clear advantage over the 107 baseline
Methods
  • This paper assumes a finite-horizon discounted Markov decision process (MDP), defined by (S, A, P, r, ρ0, γ, T ), in which S is the state space, A the action space, P a transition probability distribution, r : S × A → R a reward function, ρ0 an initial state distribution, γ ∈ (0, 1] a discount factor, and T the horizon.
  • 2.2 Count-Based Exploration via Static Hashing.
  • The authors' approach discretizes the state space with a hash function φ : S → Z.
  • An exploration bonus r+ : S → R is added to the reward function, defined as r+(s) = β, (1) n(φ(s))
Results
  • A reason why the proposed method does not achieve state-of-the-art performance on all games is that TRPO does not reuse off-policy experience, in contrast to DQN-based algorithms [4, 23, 38]), and is.
Conclusion
  • This paper demonstrates that a generalization of classical counting techniques through hashing is able to provide an appropriate signal for exploration, even in continuous and/or high-dimensional MDPs using function approximators, resulting in near state-of-the-art performance across benchmarks.
  • It provides a simple yet powerful baseline for solving MDPs that require informed exploration
Summary
  • Introduction:

    Reinforcement learning (RL) studies an agent acting in an initially unknown environment, learning through trial and error to maximize rewards.
  • Most of the recent state-of-the-art RL results have been obtained using simple exploration strategies such as uniform sampling [21] and i.i.d./correlated Gaussian noise [19, 30].
  • These heuristics are sufficient in tasks with well-shaped rewards, the sample complexity can grow exponentially in tasks with sparse rewards [25].
  • The authors have not seen a very simple and fast method that can work across different domains
  • Methods:

    This paper assumes a finite-horizon discounted Markov decision process (MDP), defined by (S, A, P, r, ρ0, γ, T ), in which S is the state space, A the action space, P a transition probability distribution, r : S × A → R a reward function, ρ0 an initial state distribution, γ ∈ (0, 1] a discount factor, and T the horizon.
  • 2.2 Count-Based Exploration via Static Hashing.
  • The authors' approach discretizes the state space with a hash function φ : S → Z.
  • An exploration bonus r+ : S → R is added to the reward function, defined as r+(s) = β, (1) n(φ(s))
  • Results:

    A reason why the proposed method does not achieve state-of-the-art performance on all games is that TRPO does not reuse off-policy experience, in contrast to DQN-based algorithms [4, 23, 38]), and is.
  • Conclusion:

    This paper demonstrates that a generalization of classical counting techniques through hashing is able to provide an appropriate signal for exploration, even in continuous and/or high-dimensional MDPs using function approximators, resulting in near state-of-the-art performance across benchmarks.
  • It provides a simple yet powerful baseline for solving MDPs that require informed exploration
Tables
  • Table1: Atari 2600: average total reward after training for 50 M time steps. Boldface numbers indicate best results. Italic numbers are the best among our methods. TRPO hyperparameters for rllab experiments
  • Table2: TRPO hyperparameters for Atari experiments with image input
  • Table3: TRPO hyperparameters for Atari experiments with RAM input
  • Table4: Granularity parameters of various hash functions
  • Table5: Average score at 50 M time steps achieved by TRPO-pixel-SimHash k
  • Table6: TRPO-RAM-SimHash performance robustness to hyperparameter changes on Frostbite β
  • Table7: Average score at 50 M time steps achieved by TRPO-SmartHash on Montezuma’s Revenge (RAM observations)
  • Table8: Interpretation of particular RAM entries in Montezuma’s Revenge
  • Table9: Performance comparison between state counting (left of the slash) and state-action counting (right of the slash) using TRPO-RAM-SimHash on Frostbite β
Download tables as Excel
Related work
  • Classic count-based methods such as MBIE [33], MBIE-EB and [16] solve an approximate Bellman equation as an inner loop before the agent takes an action [34]. As such, bonus rewards are propagated immediately throughout the state-action space. In contrast, contemporary deep RL algorithms propagate the bonus signal based on rollouts collected from interacting with environments, with value-based [21] or policy gradient-based [22, 30] methods, at limited speed. In addition, our proposed method is intended to work with contemporary deep RL algorithms, it differs from classical count-based method in that our method relies on visiting unseen states first, before the bonus reward can be assigned, making uninformed exploration strategies still a necessity at the beginning. Filling the gaps between our method and classic theories is an important direction of future research.

    A related line of classical exploration methods is based on the idea of optimism in the face of uncertainty [5] but not restricted to using counting to implement “optimism”, e.g., R-Max [5], UCRL [14], and E3 [15]. These methods, similar to MBIE and MBIE-EB, have theoretical guarantees in tabular settings.
Funding
  • This research was funded in part by ONR through a PECASE award
  • Yan Duan was also supported by a Berkeley AI Research lab Fellowship and a Huawei Fellowship
  • Xi Chen was also supported by a Berkeley AI Research lab Fellowship
  • We gratefully acknowledge the support of the NSF through grant IIS-1619362 and of the ARC through a Laureate Fellowship (FL110100281) and through the ARC Centre of Excellence for Mathematical and Statistical Frontiers
  • Adam Stooke gratefully acknowledges funding from a Fannie and John Hertz Foundation fellowship
  • Rein Houthooft was supported by a Ph.D
Reference
  • Abel, David, Agarwal, Alekh, Diaz, Fernando, Krishnamurthy, Akshay, and Schapire, Robert E. Exploratory gradient boosting for reinforcement learning in complex domains. arXiv preprint arXiv:1603.04119, 2016.
    Findings
  • Andoni, Alexandr and Indyk, Piotr. Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. In Proceedings of the 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS), pp. 459–468, 2006.
    Google ScholarLocate open access versionFindings
  • Bellemare, Marc G, Naddaf, Yavar, Veness, Joel, and Bowling, Michael. The arcade learning environment: An evaluation platform for general agents. Journal of Artificial Intelligence Research, 47:253–279, 06 2013.
    Google ScholarLocate open access versionFindings
  • Bellemare, Marc G, Srinivasan, Sriram, Ostrovski, Georg, Schaul, Tom, Saxton, David, and Munos, Remi. Unifying count-based exploration and intrinsic motivation. In Advances in Neural Information Processing Systems 29 (NIPS), pp. 1471–1479, 2016.
    Google ScholarLocate open access versionFindings
  • Brafman, Ronen I and Tennenholtz, Moshe. R-max-a general polynomial time algorithm for near-optimal reinforcement learning. Journal of Machine Learning Research, 3:213–231, 2002.
    Google ScholarLocate open access versionFindings
  • Charikar, Moses S. Similarity estimation techniques from rounding algorithms. In Proceedings of the 34th Annual ACM Symposium on Theory of Computing (STOC), pp. 380–388, 2002.
    Google ScholarLocate open access versionFindings
  • Dalal, Navneet and Triggs, Bill. Histograms of oriented gradients for human detection. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), pp. 886–893, 2005.
    Google ScholarLocate open access versionFindings
  • Duan, Yan, Chen, Xi, Houthooft, Rein, Schulman, John, and Abbeel, Pieter. Benchmarking deep reinforcement learning for continous control. In Proceedings of the 33rd International Conference on Machine Learning (ICML), pp. 1329–1338, 2016.
    Google ScholarLocate open access versionFindings
  • Ghavamzadeh, Mohammad, Mannor, Shie, Pineau, Joelle, and Tamar, Aviv. Bayesian reinforcement learning: A survey. Foundations and Trends in Machine Learning, 8(5-6):359–483, 2015.
    Google ScholarLocate open access versionFindings
  • Gregor, Karol, Besse, Frederic, Jimenez Rezende, Danilo, Danihelka, Ivo, and Wierstra, Daan. Towards conceptual compression. In Advances in Neural Information Processing Systems 29 (NIPS), pp. 3549–3557. 2016.
    Google ScholarLocate open access versionFindings
  • Guez, Arthur, Heess, Nicolas, Silver, David, and Dayan, Peter. Bayes-adaptive simulation-based search with value function approximation. In Advances in Neural Information Processing Systems (Advances in Neural Information Processing Systems (NIPS)), pp. 451–459, 2014.
    Google ScholarLocate open access versionFindings
  • He, Kaiming, Zhang, Xiangyu, Ren, Shaoqing, and Sun, Jian. Deep residual learning for image recognition. 2015.
    Google ScholarFindings
  • Houthooft, Rein, Chen, Xi, Duan, Yan, Schulman, John, De Turck, Filip, and Abbeel, Pieter. VIME: Variational information maximizing exploration. In Advances in Neural Information Processing Systems 29 (NIPS), pp. 1109–1117, 2016.
    Google ScholarLocate open access versionFindings
  • Jaksch, Thomas, Ortner, Ronald, and Auer, Peter. Near-optimal regret bounds for reinforcement learning. Journal of Machine Learning Research, 11:1563–1600, 2010.
    Google ScholarLocate open access versionFindings
  • Kearns, Michael and Singh, Satinder. Near-optimal reinforcement learning in polynomial time. Machine Learning, 49(2-3):209–232, 2002.
    Google ScholarLocate open access versionFindings
  • Kolter, J Zico and Ng, Andrew Y. Near-bayesian exploration in polynomial time. In Proceedings of the 26th International Conference on Machine Learning (ICML), pp. 513–520, 2009.
    Google ScholarLocate open access versionFindings
  • Krizhevsky, Alex, Sutskever, Ilya, and Hinton, Geoffrey E. ImageNet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems 25 (NIPS), pp. 1097–1105, 2012.
    Google ScholarLocate open access versionFindings
  • Lai, Tze Leung and Robbins, Herbert. Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics, 6(1):4–22, 1985.
    Google ScholarLocate open access versionFindings
  • Lillicrap, Timothy P, Hunt, Jonathan J, Pritzel, Alexander, Heess, Nicolas, Erez, Tom, Tassa, Yuval, Silver, David, and Wierstra, Daan. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015.
    Findings
  • Lowe, David G. Object recognition from local scale-invariant features. In Proceedings of the 7th IEEE International Conference on Computer Vision (ICCV), pp. 1150–1157, 1999.
    Google ScholarLocate open access versionFindings
  • Mnih, Volodymyr, Kavukcuoglu, Koray, Silver, David, Rusu, Andrei A, Veness, Joel, Bellemare, Marc G, Graves, Alex, Riedmiller, Martin, Fidjeland, Andreas K, Ostrovski, Georg, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529–533, 2015.
    Google ScholarLocate open access versionFindings
  • Mnih, Volodymyr, Badia, Adria Puigdomenech, Mirza, Mehdi, Graves, Alex, Lillicrap, Timothy P, Harley, Tim, Silver, David, and Kavukcuoglu, Koray. Asynchronous methods for deep reinforcement learning. arXiv preprint arXiv:1602.01783, 2016.
    Findings
  • Nair, Arun, Srinivasan, Praveen, Blackwell, Sam, Alcicek, Cagdas, Fearon, Rory, De Maria, Alessandro, Panneershelvam, Vedavyas, Suleyman, Mustafa, Beattie, Charles, Petersen, Stig, et al. Massively parallel methods for deep reinforcement learning. arXiv preprint arXiv:1507.04296, 2015.
    Findings
  • Osband, Ian, Blundell, Charles, Pritzel, Alexander, and Van Roy, Benjamin. Deep exploration via bootstrapped DQN. In Advances in Neural Information Processing Systems 29 (NIPS), pp. 4026–4034, 2016.
    Google ScholarLocate open access versionFindings
  • Osband, Ian, Van Roy, Benjamin, and Wen, Zheng. Generalization and exploration via randomized value functions. In Proceedings of the 33rd International Conference on Machine Learning (ICML), pp. 2377–2386, 2016.
    Google ScholarLocate open access versionFindings
  • Oudeyer, Pierre-Yves and Kaplan, Frederic. What is intrinsic motivation? A typology of computational approaches. Frontiers in Neurorobotics, 1:6, 2007.
    Google ScholarLocate open access versionFindings
  • Pazis, Jason and Parr, Ronald. PAC optimal exploration in continuous space Markov decision processes. In Proceedings of the 27th AAAI Conference on Artificial Intelligence (AAAI), 2013.
    Google ScholarLocate open access versionFindings
  • Salakhutdinov, Ruslan and Hinton, Geoffrey. Semantic hashing. International Journal of Approximate Reasoning, 50(7):969 – 978, 2009.
    Google ScholarLocate open access versionFindings
  • Schmidhuber, Jürgen. Formal theory of creativity, fun, and intrinsic motivation (1990–2010). IEEE Transactions on Autonomous Mental Development, 2(3):230–247, 2010.
    Google ScholarLocate open access versionFindings
  • Schulman, John, Levine, Sergey, Moritz, Philipp, Jordan, Michael I, and Abbeel, Pieter. Trust region policy optimization. In Proceedings of the 32nd International Conference on Machine Learning (ICML), 2015.
    Google ScholarLocate open access versionFindings
  • Simonyan, Karen and Zisserman, Andrew. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
    Findings
  • Stadie, Bradly C, Levine, Sergey, and Abbeel, Pieter. Incentivizing exploration in reinforcement learning with deep predictive models. arXiv preprint arXiv:1507.00814, 2015.
    Findings
  • Strehl, Alexander L and Littman, Michael L. A theoretical analysis of model-based interval estimation. In Proceedings of the 21st International Conference on Machine Learning (ICML), pp. 856–863, 2005.
    Google ScholarLocate open access versionFindings
  • Strehl, Alexander L and Littman, Michael L. An analysis of model-based interval estimation for Markov decision processes. Journal of Computer and System Sciences, 74(8):1309–1331, 2008.
    Google ScholarLocate open access versionFindings
  • Sun, Yi, Gomez, Faustino, and Schmidhuber, Jürgen. Planning to be surprised: Optimal Bayesian exploration in dynamic environments. In Proceedings of the 4th International Conference on Artificial General Intelligence (AGI), pp. 41–51. 2011.
    Google ScholarLocate open access versionFindings
  • Tola, Engin, Lepetit, Vincent, and Fua, Pascal. DAISY: An efficient dense descriptor applied to wide-baseline stereo. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(5): 815–830, 2010.
    Google ScholarLocate open access versionFindings
  • van den Oord, Aaron, Kalchbrenner, Nal, and Kavukcuoglu, Koray. Pixel recurrent neural networks. In Proceedings of the 33rd International Conference on Machine Learning (ICML), pp. 1747–1756, 2016.
    Google ScholarLocate open access versionFindings
  • van Hasselt, Hado, Guez, Arthur, Hessel, Matteo, and Silver, David. Learning functions across many orders of magnitudes. arXiv preprint arXiv:1602.07714, 2016.
    Findings
  • van Hasselt, Hado, Guez, Arthur, and Silver, David. Deep reinforcement learning with double Q-learning. In Proceedings of the 30th AAAI Conference on Artificial Intelligence (AAAI), 2016.
    Google ScholarLocate open access versionFindings
  • Wang, Ziyu, de Freitas, Nando, and Lanctot, Marc. Dueling network architectures for deep reinforcement learning. In Proceedings of the 33rd International Conference on Machine Learning (ICML), pp. 1995–2003, 2016.
    Google ScholarLocate open access versionFindings
  • [1] Bellemare, Marc G, Srinivasan, Sriram, Ostrovski, Georg, Schaul, Tom, Saxton, David, and Information Processing Systems 29 (NIPS), pp. 1471–1479, 2016.
    Google ScholarLocate open access versionFindings
  • [2] Bloom, Burton H. Space/time trade-offs in hash coding with allowable errors. Communications of the ACM, 13(7):422–426, 1970.
    Google ScholarLocate open access versionFindings
  • [3] Cormode, Graham and Muthukrishnan, S. An improved data stream summary: the count-min sketch and its applications. Journal of Algorithms, 55(1):58–75, 2005.
    Google ScholarLocate open access versionFindings
  • [4] Duan, Yan, Chen, Xi, Houthooft, Rein, Schulman, John, and Abbeel, Pieter. Benchmarking deep reinforcement learning for continous control. In Proceedings of the 33rd International Conference on Machine Learning (ICML), pp. 1329–1338, 2016.
    Google ScholarLocate open access versionFindings
  • [5] Fan, Li, Cao, Pei, Almeida, Jussara, and Broder, Andrei Z. Summary cache: A scalable wide-area web cache sharing protocol. IEEE/ACM Transactions on Networking, 8(3):281–293, 2000.
    Google ScholarLocate open access versionFindings
  • [6] Houthooft, Rein, Chen, Xi, Duan, Yan, Schulman, John, De Turck, Filip, and Abbeel, Pieter. Processing Systems 29 (NIPS), pp. 1109–1117, 2016.
    Google ScholarLocate open access versionFindings
  • [7] Ioffe, Sergey and Szegedy, Christian. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the 32nd International Conference on Machine Learning (ICML), pp. 448–456, 2015.
    Google ScholarLocate open access versionFindings
  • [8] Kingma, Diederik and Ba, Jimmy. Adam: A method for stochastic optimization. In Proceedings of the International Conference on Learning Representations (ICLR), 2015.
    Google ScholarLocate open access versionFindings
Full Text
Your rating :
0

 

Tags
Comments