AI帮你理解科学

AI 生成解读视频

AI抽取解析论文重点内容自动生成视频


pub
生成解读视频

AI 溯源

AI解析本论文相关学术脉络


Master Reading Tree
生成 溯源树

AI 精读

AI抽取本论文的概要总结


微博一下
Unlike many intrinsic motivation algorithms, pseudo-counts do not rely on learning a forward model. This point is especially important because a number of powerful density models for images exist, and because optimality guarantees cannot in general exist for intrinsic motivation ...

Unifying Count-Based Exploration and Intrinsic Motivation.

ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 29 (NIPS 2016), (2016): 1479-1487

被引用705|浏览224
EI
下载 PDF 全文
引用
微博一下

摘要

We consider an agent's uncertainty about its environment and the problem of generalizing this uncertainty across states. Specifically, we focus on the problem of exploration in non-tabular reinforcement learning. Drawing inspiration from the intrinsic motivation literature, we use density models to measure uncertainty, and propose a novel...更多

代码

数据

0
简介
  • Exploration algorithms for Markov Decision Processes (MDPs) are typically concerned with reducing the agent’s uncertainty over the environment’s reward and transition functions.
  • Defined as the Kullback-Leibler divergence of a prior distribution from its posterior, information gain can be related to the confidence intervals used in count-based exploration.
  • The authors' contribution is to propose a new quantity, the pseudo-count, which connects information-gain-as-learning-progress and count-based exploration.
重点内容
  • Exploration algorithms for Markov Decision Processes (MDPs) are typically concerned with reducing the agent’s uncertainty over the environment’s reward and transition functions. This uncertainty can be quantified using confidence intervals derived from Chernoff bounds, or inferred from a posterior over the environment parameters
  • In this paper we provide formal evidence that intrinsic motivation and count-based exploration are but two sides of the same coin
  • Defined as the Kullback-Leibler divergence of a prior distribution from its posterior, information gain can be related to the confidence intervals used in count-based exploration
  • Unlike many intrinsic motivation algorithms, pseudo-counts do not rely on learning a forward model. This point is especially important because a number of powerful density models for images exist (Van den Oord et al, 2016), and because optimality guarantees cannot in general exist for intrinsic motivation algorithms based on forward models
  • We analyze the limiting behaviour of the ratio Nn/Nn. We use this analysis to assert the consistency of pseudo-counts derived from tabular density models, i.e. models which maintain per-state visit counts
结果
  • The authors derive the pseudo-count from a density model over the state space.
  • If the model generalizes across states so do pseudo-counts.
  • Considering the tabular setting and combining their result to Theorem 1, the authors conclude that bonuses proportional to immediate information gain are insufficient for theoretically near-optimal exploration: to paraphrase Kolter and Ng, these methods produce explore too little in comparison to pseudo-count bonuses.
  • Unlike many intrinsic motivation algorithms, pseudo-counts do not rely on learning a forward model.
  • This point is especially important because a number of powerful density models for images exist (Van den Oord et al, 2016), and because optimality guarantees cannot in general exist for intrinsic motivation algorithms based on forward models.
  • The authors analyze the limiting behaviour of the ratio Nn/Nn. The authors use this analysis to assert the consistency of pseudo-counts derived from tabular density models, i.e. models which maintain per-state visit counts.
  • In the appendix the authors use the same result to bound the approximation error of pseudo-counts derived from directed graphical models, of which the CTS model is a special case.
  • Consider a sequence generated i.i.d. from a distribution μ over a finite state space, and a density model defined from a sequence of nonincreasing step-sizes: ρn(x) = (1 − αn)ρn−1(x) + αnI {xn = x} , with initial condition ρ0(x) = |X |−1.
  • That a density model that does not satisfy Assumption 1(b) may still yield useful pseudo-counts.
  • H.E.R.O. Training frames count-based exploration bonus enables them to make quick progress on a number of games, most dramatically in MONTEZUMA’S REVENGE and VENTURE.
  • The authors believe the success of the method in this game is a strong indicator of the usefulness of pseudo-counts for exploration.1
结论
  • Lopes et al (2012) showed the relationship between time-averaged prediction gain and visit counts in a tabular setting; their result is a special case of Theorem 2.
  • Combining the work with more ideas from deep learning and better density models seems a plausible avenue for quick progress in practical, efficient exploration.
  • The authors focused here on countable state spaces, the authors can as define a pseudo-count in terms of probability density functions.
总结
  • Exploration algorithms for Markov Decision Processes (MDPs) are typically concerned with reducing the agent’s uncertainty over the environment’s reward and transition functions.
  • Defined as the Kullback-Leibler divergence of a prior distribution from its posterior, information gain can be related to the confidence intervals used in count-based exploration.
  • The authors' contribution is to propose a new quantity, the pseudo-count, which connects information-gain-as-learning-progress and count-based exploration.
  • The authors derive the pseudo-count from a density model over the state space.
  • If the model generalizes across states so do pseudo-counts.
  • Considering the tabular setting and combining their result to Theorem 1, the authors conclude that bonuses proportional to immediate information gain are insufficient for theoretically near-optimal exploration: to paraphrase Kolter and Ng, these methods produce explore too little in comparison to pseudo-count bonuses.
  • Unlike many intrinsic motivation algorithms, pseudo-counts do not rely on learning a forward model.
  • This point is especially important because a number of powerful density models for images exist (Van den Oord et al, 2016), and because optimality guarantees cannot in general exist for intrinsic motivation algorithms based on forward models.
  • The authors analyze the limiting behaviour of the ratio Nn/Nn. The authors use this analysis to assert the consistency of pseudo-counts derived from tabular density models, i.e. models which maintain per-state visit counts.
  • In the appendix the authors use the same result to bound the approximation error of pseudo-counts derived from directed graphical models, of which the CTS model is a special case.
  • Consider a sequence generated i.i.d. from a distribution μ over a finite state space, and a density model defined from a sequence of nonincreasing step-sizes: ρn(x) = (1 − αn)ρn−1(x) + αnI {xn = x} , with initial condition ρ0(x) = |X |−1.
  • That a density model that does not satisfy Assumption 1(b) may still yield useful pseudo-counts.
  • H.E.R.O. Training frames count-based exploration bonus enables them to make quick progress on a number of games, most dramatically in MONTEZUMA’S REVENGE and VENTURE.
  • The authors believe the success of the method in this game is a strong indicator of the usefulness of pseudo-counts for exploration.1
  • Lopes et al (2012) showed the relationship between time-averaged prediction gain and visit counts in a tabular setting; their result is a special case of Theorem 2.
  • Combining the work with more ideas from deep learning and better density models seems a plausible avenue for quick progress in practical, efficient exploration.
  • The authors focused here on countable state spaces, the authors can as define a pseudo-count in terms of probability density functions.
表格
  • Table1: A rough taxonomy of Atari 2600 games according to their exploration difficulty
  • Table2: Average score after 200 million training frames for A3C and A3C+ (with Nn−1/2 bonus), with a DQN baseline for comparison
Download tables as Excel
相关工作
  • Information-theoretic quantities have been repeatedly used to describe intrinsically motivated behaviour. Closely related to prediction gain is Schmidhuber (1991)’s notion of compression progress, Baseline score

    A3C+ PERFORMANCE ACROSS GAMES which equates novelty with an agent’s improvement in its ability to compress its past. More recently, Lopes et al (2012) showed the relationship between time-averaged prediction gain and visit counts in a tabular setting; their result is a special case of Theorem 2. Orseau et al (2013) demonstrated that maximizing the sum of future information gains does lead to optimal behaviour, even though maximizing immediate information gain does not (Section 4). Finally, there may be a connection between sequential normalized maximum likelihood estimators and our pseudo-count derivation (see e.g. Ollivier, 2015). Intrinsic motivation has also been studied in reinforcement learning proper, in particular in the context of discovering skills (Singh et al, 2004; Barto, 2013). Recently, Stadie et al (2015) used a squared prediction error bonus for exploring in Atari 2600 games. Closest to our work is Houthooft et al (2016)’s variational approach to intrinsic motivation, which is equivalent to a second order Taylor approximation to prediction gain. Mohamed and Rezende (2015) also considered a variational approach to the different problem of maximizing an agent’s ability to influence its environment. Aside for Orseau et al.’s above-cited work, it is only recently that theoretical guarantees for exploration have emerged for non-tabular, stateful settings. We note Pazis and Parr (2016)’s PAC-MDP result for metric spaces and Leike et al (2016)’s asymptotic analysis of Thompson sampling in general environments.
引用论文
  • Barto, A. G. (2013). Intrinsic motivation and reinforcement learning. In Intrinsically Motivated Learning in Natural and Artificial Systems, pages 17–47. Springer.
    Google ScholarLocate open access versionFindings
  • Bellemare, M., Veness, J., and Talvitie, E. (2014). Skip context tree switching. In Proceedings of the 31st International Conference on Machine Learning, pages 1458–1466.
    Google ScholarLocate open access versionFindings
  • Bellemare, M. G., Naddaf, Y., Veness, J., and Bowling, M. (2013). The Arcade Learning Environment: An evaluation platform for general agents. Journal of Artificial Intelligence Research, 47:253–279.
    Google ScholarLocate open access versionFindings
  • Bellemare, M. G., Ostrovski, G., Guez, A., Thomas, P. S., and Munos, R. (2016). Increasing the action gap: New operators for reinforcement learning. In Proceedings of the 30th AAAI Conference on Artificial Intelligence.
    Google ScholarLocate open access versionFindings
  • Bertsekas, D. P. and Tsitsiklis, J. N. (1996). Neuro-Dynamic Programming. Athena Scientific.
    Google ScholarFindings
  • Cover, T. M. and Thomas, J. A. (1991). Elements of information theory. John Wiley & Sons.
    Google ScholarFindings
  • Houthooft, R., Chen, X., Duan, Y., Schulman, J., De Turck, F., and Abbeel, P. (2016). Variational information maximizing exploration.
    Google ScholarFindings
  • Hutter, M. (2013). Sparse adaptive dirichlet-multinomial-like processes. In Proceedings of the Conference on Online Learning Theory.
    Google ScholarLocate open access versionFindings
  • Kolter, Z. J. and Ng, A. Y. (2009). Near-bayesian exploration in polynomial time. In Proceedings of the 26th International Conference on Machine Learning.
    Google ScholarLocate open access versionFindings
  • Leike, J., Lattimore, T., Orseau, L., and Hutter, M. (2016). Thompson sampling is asymptotically optimal in general environments. In Proceedings of the Conference on Uncertainty in Artificial Intelligence.
    Google ScholarLocate open access versionFindings
  • Lopes, M., Lang, T., Toussaint, M., and Oudeyer, P.-Y. (2012). Exploration in model-based reinforcement learning by empirically estimating learning progress. In Advances in Neural Information Processing Systems 25.
    Google ScholarLocate open access versionFindings
  • Machado, M. C., Srinivasan, S., and Bowling, M. (2015). Domain-independent optimistic initialization for reinforcement learning. AAAI Workshop on Learning for General Competency in Video Games.
    Google ScholarFindings
  • Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap, T. P., Harley, T., Silver, D., and Kavukcuoglu, K. (2016). Asynchronous methods for deep reinforcement learning. In Proceedings of the International Conference on Machine Learning.
    Google ScholarLocate open access versionFindings
  • Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., et al. (2015). Human-level control through deep reinforcement learning. Nature, 518(7540):529–533.
    Google ScholarLocate open access versionFindings
  • Mohamed, S. and Rezende, D. J. (2015). Variational information maximisation for intrinsically motivated reinforcement learning. In Advances in Neural Information Processing Systems 28.
    Google ScholarLocate open access versionFindings
  • Ollivier, Y. (2015). Laplace’s rule of succession in information geometry. arXiv preprint arXiv:1503.04304.
    Findings
  • Orseau, L., Lattimore, T., and Hutter, M. (2013). Universal knowledge-seeking agents for stochastic environments. In Proceedings of the Conference on Algorithmic Learning Theory.
    Google ScholarLocate open access versionFindings
  • Oudeyer, P., Kaplan, F., and Hafner, V. (2007). Intrinsic motivation systems for autonomous mental development. IEEE Transactions on Evolutionary Computation, 11(2):265–286.
    Google ScholarLocate open access versionFindings
  • Pazis, J. and Parr, R. (2016). Efficient PAC-optimal exploration in concurrent, continuous state MDPs with delayed updates. In Proceedings of the 30th AAAI Conference on Artificial Intelligence.
    Google ScholarLocate open access versionFindings
  • Schmidhuber, J. (1991). A possibility for implementing curiosity and boredom in model-building neural controllers. In From animals to animats: proceedings of the first international conference on simulation of adaptive behavior.
    Google ScholarLocate open access versionFindings
  • Schmidhuber, J. (2008). Driven by compression progress. In Knowledge-Based Intelligent Information and Engineering Systems. Springer.
    Google ScholarFindings
  • Singh, S., Barto, A. G., and Chentanez, N. (2004). Intrinsically motivated reinforcement learning. In Advances in Neural Information Processing Systems 16.
    Google ScholarLocate open access versionFindings
  • Stadie, B. C., Levine, S., and Abbeel, P. (2015). Incentivizing exploration in reinforcement learning with deep predictive models. arXiv preprint arXiv:1507.00814.
    Findings
  • Strehl, A. L. and Littman, M. L. (2008). An analysis of model-based interval estimation for Markov decision processes. Journal of Computer and System Sciences, 74(8):1309 – 1331.
    Google ScholarLocate open access versionFindings
  • Van den Oord, A., Kalchbrenner, N., and Kavukcuoglu, K. (2016). Pixel recurrent neural networks. In Proceedigns of the 33rd International Conference on Machine Learning.
    Google ScholarLocate open access versionFindings
  • van Hasselt, H., Guez, A., and Silver, D. (2016). Deep reinforcement learning with double Q-learning. In Proceedings of the 30th AAAI Conference on Artificial Intelligence.
    Google ScholarLocate open access versionFindings
  • Veness, J., Bellemare, M. G., Hutter, M., Chua, A., and Desjardins, G. (2015). Compress and control. In Proceedings of the 29th AAAI Conference on Artificial Intelligence.
    Google ScholarLocate open access versionFindings
  • Wainwright, M. J. and Jordan, M. I. (2008). Graphical models, exponential families, and variational inference. Foundations and Trends in Machine Learning, 1(1-2):1–305.
    Google ScholarLocate open access versionFindings
  • We will show that directed graphical models (Wainwright and Jordan, 2008) satisfy Assumption 1. A directed graphical model describes a probability distribution over a factored state space. To the ith factor xi is associated a parent set π(i) ⊆ {1,..., i − 1}. Let xπ(i) denote the value of the factors in the parent set. The ith factor model is ρin(xi; xπ(i)):= ρi(xi; x1:n, xπ(i)), with the understanding that ρi is allowed to make a different prediction for each value of xπ(i). The state x is assigned the joint probability k ρGM(x; x1:n):= ρin(xi; xπ(i)).
    Google ScholarFindings
  • 1. Then for all x with μ(x) > 0, the density model ρGM satisfies Assumption 1 with r(x) =
    Google ScholarFindings
您的评分 :
0

 

标签
评论
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn
小科