AI helps you reading Science

AI generates interpretation videos

AI extracts and analyses the key points of the paper to generate videos automatically


pub
Go Generating

AI Traceability

AI parses the academic lineage of this thesis


Master Reading Tree
Generate MRT

AI Insight

AI extracts a summary of this paper


Weibo:
We introduce a novel space contraction method by making use of global information to ensure the batch-based monotonicity of the learned quantile function by modifying the network architecture of some state-of-the-art distributional reinforcement learning algorithms

Non-Crossing Quantile Regression for Distributional Reinforcement Learning

NIPS 2020, (2020)

Cited by: 0|Views271
EI
Full Text
Bibtex
Weibo

Abstract

Distributional reinforcement learning (DRL) estimates the distribution over future returns instead of the mean to more efficiently capture the intrinsic uncertainty of MDPs. However, batch-based DRL algorithms cannot guarantee the non-decreasing property of learned quantile curves especially at the early training stage, leading to abnorma...More

Code:

Data:

0
Introduction
  • Different from value-based reinforcement learning algorithms [16, 21, 22] which entirely focus on the expected future return, distributional reinforcement learning (DRL) [12, 20, 24, 17, 1] accounts for the intrinsic randomness within a Markov Decision Process [5, 4, 19] by modelling the total return as a random variable.
  • Existing DRL algorithms fall into two broad categories, one of which learns quantile values at a set of pre-defined locations such as C51 [1], Rainbow[10], and QR-DQN [5].
  • With sufficient network capacity and infinite number of quantiles, IQN can theoretically approximate the full distribution
Highlights
  • Different from value-based reinforcement learning algorithms [16, 21, 22] which entirely focus on the expected future return, distributional reinforcement learning (DRL) [12, 20, 24, 17, 1] accounts for the intrinsic randomness within a Markov Decision Process [5, 4, 19] by modelling the total return as a random variable
  • We introduce a novel space contraction method by making use of global information to ensure the batch-based monotonicity of the learned quantile function by modifying the network architecture of some state-of-the-art DRL algorithms
  • Our method is theoretically correct for any DRL algorithms based on quantile approximation, while the implementation approach in this paper can not be directly applied to some distribution based methods, such as IQN, since the quantile fractions τ ’s are not fixed and re-sampled each time
  • The proposed method on distributional reinforcement learning can more precisely capture the intrinsic uncertainty of MDPs by ensuring the non-crossing of quantile estimates, which helps AI to better understand some complicated realworld decision making problems
Methods
  • The authors test the method on the full Atari-57 benchmark. The authors select QR-DQN as the baseline, and compare it with NC-QR-DQN which accounts for the non-crossing issue by using the implementation approach described in Section 3.3.
  • For the exploration set-up, the authors set the bonus rate ct in (25) to be 50 log t/t which decays with the training step t.
  • For both algorithms, the authors set κ = 1 for the Huber quantile loss in (22) due to its smoothness
Conclusion
  • The authors introduce a novel space contraction method by making use of global information to ensure the batch-based monotonicity of the learned quantile function by modifying the network architecture of some state-of-the-art DRL algorithms.
  • This work has broad social impact because reinforcement learning is useful in many applied areas including automatic car driving, industrial robotics, and so on.
  • The proposed method on distributional reinforcement learning can more precisely capture the intrinsic uncertainty of MDPs by ensuring the non-crossing of quantile estimates, which helps AI to better understand some complicated realworld decision making problems.
  • On the other hand, allowing the agent to explore more uncertainty of the environment may change the way robots think and lead to some negative outcomes in real-life
Summary
  • Introduction:

    Different from value-based reinforcement learning algorithms [16, 21, 22] which entirely focus on the expected future return, distributional reinforcement learning (DRL) [12, 20, 24, 17, 1] accounts for the intrinsic randomness within a Markov Decision Process [5, 4, 19] by modelling the total return as a random variable.
  • Existing DRL algorithms fall into two broad categories, one of which learns quantile values at a set of pre-defined locations such as C51 [1], Rainbow[10], and QR-DQN [5].
  • With sufficient network capacity and infinite number of quantiles, IQN can theoretically approximate the full distribution
  • Methods:

    The authors test the method on the full Atari-57 benchmark. The authors select QR-DQN as the baseline, and compare it with NC-QR-DQN which accounts for the non-crossing issue by using the implementation approach described in Section 3.3.
  • For the exploration set-up, the authors set the bonus rate ct in (25) to be 50 log t/t which decays with the training step t.
  • For both algorithms, the authors set κ = 1 for the Huber quantile loss in (22) due to its smoothness
  • Conclusion:

    The authors introduce a novel space contraction method by making use of global information to ensure the batch-based monotonicity of the learned quantile function by modifying the network architecture of some state-of-the-art DRL algorithms.
  • This work has broad social impact because reinforcement learning is useful in many applied areas including automatic car driving, industrial robotics, and so on.
  • The proposed method on distributional reinforcement learning can more precisely capture the intrinsic uncertainty of MDPs by ensuring the non-crossing of quantile estimates, which helps AI to better understand some complicated realworld decision making problems.
  • On the other hand, allowing the agent to explore more uncertainty of the environment may change the way robots think and lead to some negative outcomes in real-life
Tables
  • Table1: Mean and median of scores across 57 Atari 2600 games, measured as percentages of human baseline. Scores are averages over number of seeds
Download tables as Excel
Funding
  • Acknowledgments and Disclosure of Funding This research was supported by National Natural Science Foundation of China (12001356, 11971292, 11690012), Shanghai Sailing Program (20YF1412300), Fundamental Research Funds for the Central Universities, and Program for Innovative Research Team of SUFE
Study subjects and analysis
data: 100
With more separate quantile curves of the two actions, the agent can make consistent decisions when perceiving this state in the training process after a few iterations, which to some extent increases the training efficiency. As demonstrated in Figure 1 of the supplements, the crossing issue is more severe with smaller sample size N (N = 100) at the early stage, where the advantage of the proposal is more significant. To further show how the ranking of Q-function changes on a variety of states, we randomly pick

cases: 57
agent2 − random where agent1, agent2 and random are the per-game raw scores derived from NC-QR-DQN + exploration, QR-DQN + exploration and random agent baseline. As Figure 5(a) shows, NC-QRDQN + exploration either significantly outperforms its counterpart or achieves a very close result for most of the 57 cases, which verifies our assumption that, NC-QR-DQN can more precisely learn the quantile functions, and highly increase the exploration efficiency by considering the non-crossing restriction. Figure 4 shows the training curves of 9 Atari games averaged by seeds, and we can see that NC-QR-DQN with exploration can learn much faster than QR-DQN with exploration by addressing the crossing issue especially for three hard-explored games presented in the first line

Reference
  • Marc G Bellemare, Will Dabney, and Rémi Munos. A distributional perspective on reinforcement learning. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pages 449–458. JMLR. org, 2017.
    Google ScholarLocate open access versionFindings
  • Howard D Bondell, Brian J Reich, and Huixia Wang. Noncrossing quantile regression curve estimation. Biometrika, 97(4):825–838, 2010.
    Google ScholarLocate open access versionFindings
  • Victor Chernozhukov, Ivan Fernandez-Val, and Alfred Galichon. Improving point and interval estimators of monotone functions by rearrangement. Biometrika, 96(3):559–575, 2009.
    Google ScholarLocate open access versionFindings
  • Will Dabney, Georg Ostrovski, David Silver, and Rémi Munos. Implicit quantile networks for distributional reinforcement learning. arXiv preprint arXiv:1806.06923, 2018.
    Findings
  • Will Dabney, Mark Rowland, Marc G Bellemare, and Rémi Munos. Distributional reinforcement learning with quantile regression. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
    Google ScholarLocate open access versionFindings
  • Holger Dette and Stanislav Volgushev. Non-crossing non-parametric estimates of quantile curves. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 70(3):609– 627, 2008.
    Google ScholarLocate open access versionFindings
  • Chen Dong, Shujie Ma, Liping Zhu, and Xingdong Feng. Non-crossing multiple-index quantile regression. SCIENTIA SINICA Mathematica, 50:1–28, 2020.
    Google ScholarLocate open access versionFindings
  • Peter Hall, Rodney CL Wolff, and Qiwei Yao. Methods for estimating a conditional distribution function. Journal of the American Statistical association, 94(445):154–163, 1999.
    Google ScholarLocate open access versionFindings
  • Xuming He. Quantile curves without crossing. The American Statistician, 51(2):186–192, 1997.
    Google ScholarLocate open access versionFindings
  • Matteo Hessel, Joseph Modayil, Hado Van Hasselt, Tom Schaul, Georg Ostrovski, Will Dabney, Dan Horgan, Bilal Piot, Mohammad Azar, and David Silver. Rainbow: Combining improvements in deep reinforcement learning. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
    Google ScholarLocate open access versionFindings
  • Peter J Huber. Robust estimation of a location parameter. In Breakthroughs in statistics, pages 492–518.
    Google ScholarLocate open access versionFindings
  • Stratton C Jaquette. Markov decision processes with a new optimality criterion: Discrete time. The Annals of Statistics, pages 496–505, 1973.
    Google ScholarLocate open access versionFindings
  • Roger Koenker, Pin Ng, and Stephen Portnoy. Quantile smoothing splines. Biometrika, 81(4):673–680, 1994.
    Google ScholarLocate open access versionFindings
  • Yufeng Liu and Yichao Wu. Stepwise multiple quantile regression estimation using non-crossing constraints. Statistics and its Interface, 2(3):299–310, 2009.
    Google ScholarLocate open access versionFindings
  • Borislav Mavrin, Shangtong Zhang, Hengshuai Yao, Linglong Kong, Kaiwen Wu, and Yaoliang Yu. Distributional reinforcement learning for efficient exploration. arXiv preprint arXiv:1905.06125, 2019.
    Findings
  • Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529–533, 2015.
    Google ScholarLocate open access versionFindings
  • Tetsuro Morimura, Masashi Sugiyama, Hisashi Kashima, Hirotaka Hachiya, and Toshiyuki Tanaka. Nonparametric return distribution approximation for reinforcement learning. 2010.
    Google ScholarFindings
  • Alfred Müller. Integral probability metrics and their generating classes of functions. Advances in Applied Probability, 29(2):429–443, 1997.
    Google ScholarLocate open access versionFindings
  • Mark Rowland, Marc G Bellemare, Will Dabney, Rémi Munos, and Yee Whye Teh. An analysis of categorical distributional reinforcement learning. arXiv preprint arXiv:1802.08163, 2018.
    Findings
  • Matthew J Sobel. The variance of discounted markov decision processes. Journal of Applied Probability, 19(4):794–802, 1982.
    Google ScholarLocate open access versionFindings
  • Hado Van Hasselt, Arthur Guez, and David Silver. Deep reinforcement learning with double q-learning. In Thirtieth AAAI conference on artificial intelligence, 2016.
    Google ScholarLocate open access versionFindings
  • Ziyu Wang, Tom Schaul, Matteo Hessel, Hado Van Hasselt, Marc Lanctot, and Nando De Freitas. Dueling network architectures for deep reinforcement learning. arXiv preprint arXiv:1511.06581, 2015.
    Findings
  • Christopher JCH Watkins and Peter Dayan. Q-learning. Machine learning, 8(3-4):279–292, 1992.
    Google ScholarLocate open access versionFindings
  • DJ White. Mean, variance, and probabilistic criteria in finite markov decision processes: A review. Journal of Optimization Theory and Applications, 56(1):1–29, 1988.
    Google ScholarLocate open access versionFindings
  • Derek Yang, Li Zhao, Zichuan Lin, Tao Qin, Jiang Bian, and Tie-Yan Liu. Fully parameterized quantile function for distributional reinforcement learning. In Advances in Neural Information Processing Systems, pages 6190–6199, 2019.
    Google ScholarLocate open access versionFindings
Author
Fan Zhou
Fan Zhou
Jianing Wang
Jianing Wang
Xingdong Feng
Xingdong Feng
Your rating :
0

 

Tags
Comments
小科