## AI helps you reading Science

## AI Insight

AI extracts a summary of this paper

Weibo:

# Non-Crossing Quantile Regression for Distributional Reinforcement Learning

NIPS 2020, (2020)

EI

Keywords

Abstract

Distributional reinforcement learning (DRL) estimates the distribution over future returns instead of the mean to more efficiently capture the intrinsic uncertainty of MDPs. However, batch-based DRL algorithms cannot guarantee the non-decreasing property of learned quantile curves especially at the early training stage, leading to abnorma...More

Code:

Data:

Introduction

- Different from value-based reinforcement learning algorithms [16, 21, 22] which entirely focus on the expected future return, distributional reinforcement learning (DRL) [12, 20, 24, 17, 1] accounts for the intrinsic randomness within a Markov Decision Process [5, 4, 19] by modelling the total return as a random variable.
- Existing DRL algorithms fall into two broad categories, one of which learns quantile values at a set of pre-defined locations such as C51 [1], Rainbow[10], and QR-DQN [5].
- With sufficient network capacity and infinite number of quantiles, IQN can theoretically approximate the full distribution

Highlights

- Different from value-based reinforcement learning algorithms [16, 21, 22] which entirely focus on the expected future return, distributional reinforcement learning (DRL) [12, 20, 24, 17, 1] accounts for the intrinsic randomness within a Markov Decision Process [5, 4, 19] by modelling the total return as a random variable
- We introduce a novel space contraction method by making use of global information to ensure the batch-based monotonicity of the learned quantile function by modifying the network architecture of some state-of-the-art DRL algorithms
- Our method is theoretically correct for any DRL algorithms based on quantile approximation, while the implementation approach in this paper can not be directly applied to some distribution based methods, such as IQN, since the quantile fractions τ ’s are not fixed and re-sampled each time
- The proposed method on distributional reinforcement learning can more precisely capture the intrinsic uncertainty of MDPs by ensuring the non-crossing of quantile estimates, which helps AI to better understand some complicated realworld decision making problems

Methods

- The authors test the method on the full Atari-57 benchmark. The authors select QR-DQN as the baseline, and compare it with NC-QR-DQN which accounts for the non-crossing issue by using the implementation approach described in Section 3.3.
- For the exploration set-up, the authors set the bonus rate ct in (25) to be 50 log t/t which decays with the training step t.
- For both algorithms, the authors set κ = 1 for the Huber quantile loss in (22) due to its smoothness

Conclusion

- The authors introduce a novel space contraction method by making use of global information to ensure the batch-based monotonicity of the learned quantile function by modifying the network architecture of some state-of-the-art DRL algorithms.
- This work has broad social impact because reinforcement learning is useful in many applied areas including automatic car driving, industrial robotics, and so on.
- The proposed method on distributional reinforcement learning can more precisely capture the intrinsic uncertainty of MDPs by ensuring the non-crossing of quantile estimates, which helps AI to better understand some complicated realworld decision making problems.
- On the other hand, allowing the agent to explore more uncertainty of the environment may change the way robots think and lead to some negative outcomes in real-life

Summary

## Introduction:

Different from value-based reinforcement learning algorithms [16, 21, 22] which entirely focus on the expected future return, distributional reinforcement learning (DRL) [12, 20, 24, 17, 1] accounts for the intrinsic randomness within a Markov Decision Process [5, 4, 19] by modelling the total return as a random variable.- Existing DRL algorithms fall into two broad categories, one of which learns quantile values at a set of pre-defined locations such as C51 [1], Rainbow[10], and QR-DQN [5].
- With sufficient network capacity and infinite number of quantiles, IQN can theoretically approximate the full distribution
## Methods:

The authors test the method on the full Atari-57 benchmark. The authors select QR-DQN as the baseline, and compare it with NC-QR-DQN which accounts for the non-crossing issue by using the implementation approach described in Section 3.3.- For the exploration set-up, the authors set the bonus rate ct in (25) to be 50 log t/t which decays with the training step t.
- For both algorithms, the authors set κ = 1 for the Huber quantile loss in (22) due to its smoothness
## Conclusion:

The authors introduce a novel space contraction method by making use of global information to ensure the batch-based monotonicity of the learned quantile function by modifying the network architecture of some state-of-the-art DRL algorithms.- This work has broad social impact because reinforcement learning is useful in many applied areas including automatic car driving, industrial robotics, and so on.
- The proposed method on distributional reinforcement learning can more precisely capture the intrinsic uncertainty of MDPs by ensuring the non-crossing of quantile estimates, which helps AI to better understand some complicated realworld decision making problems.
- On the other hand, allowing the agent to explore more uncertainty of the environment may change the way robots think and lead to some negative outcomes in real-life

- Table1: Mean and median of scores across 57 Atari 2600 games, measured as percentages of human baseline. Scores are averages over number of seeds

Funding

- Acknowledgments and Disclosure of Funding This research was supported by National Natural Science Foundation of China (12001356, 11971292, 11690012), Shanghai Sailing Program (20YF1412300), Fundamental Research Funds for the Central Universities, and Program for Innovative Research Team of SUFE

Study subjects and analysis

data: 100

With more separate quantile curves of the two actions, the agent can make consistent decisions when perceiving this state in the training process after a few iterations, which to some extent increases the training efficiency. As demonstrated in Figure 1 of the supplements, the crossing issue is more severe with smaller sample size N (N = 100) at the early stage, where the advantage of the proposal is more significant. To further show how the ranking of Q-function changes on a variety of states, we randomly pick

cases: 57

agent2 − random where agent1, agent2 and random are the per-game raw scores derived from NC-QR-DQN + exploration, QR-DQN + exploration and random agent baseline. As Figure 5(a) shows, NC-QRDQN + exploration either significantly outperforms its counterpart or achieves a very close result for most of the 57 cases, which verifies our assumption that, NC-QR-DQN can more precisely learn the quantile functions, and highly increase the exploration efficiency by considering the non-crossing restriction. Figure 4 shows the training curves of 9 Atari games averaged by seeds, and we can see that NC-QR-DQN with exploration can learn much faster than QR-DQN with exploration by addressing the crossing issue especially for three hard-explored games presented in the first line

Reference

- Marc G Bellemare, Will Dabney, and Rémi Munos. A distributional perspective on reinforcement learning. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pages 449–458. JMLR. org, 2017.
- Howard D Bondell, Brian J Reich, and Huixia Wang. Noncrossing quantile regression curve estimation. Biometrika, 97(4):825–838, 2010.
- Victor Chernozhukov, Ivan Fernandez-Val, and Alfred Galichon. Improving point and interval estimators of monotone functions by rearrangement. Biometrika, 96(3):559–575, 2009.
- Will Dabney, Georg Ostrovski, David Silver, and Rémi Munos. Implicit quantile networks for distributional reinforcement learning. arXiv preprint arXiv:1806.06923, 2018.
- Will Dabney, Mark Rowland, Marc G Bellemare, and Rémi Munos. Distributional reinforcement learning with quantile regression. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
- Holger Dette and Stanislav Volgushev. Non-crossing non-parametric estimates of quantile curves. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 70(3):609– 627, 2008.
- Chen Dong, Shujie Ma, Liping Zhu, and Xingdong Feng. Non-crossing multiple-index quantile regression. SCIENTIA SINICA Mathematica, 50:1–28, 2020.
- Peter Hall, Rodney CL Wolff, and Qiwei Yao. Methods for estimating a conditional distribution function. Journal of the American Statistical association, 94(445):154–163, 1999.
- Xuming He. Quantile curves without crossing. The American Statistician, 51(2):186–192, 1997.
- Matteo Hessel, Joseph Modayil, Hado Van Hasselt, Tom Schaul, Georg Ostrovski, Will Dabney, Dan Horgan, Bilal Piot, Mohammad Azar, and David Silver. Rainbow: Combining improvements in deep reinforcement learning. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
- Peter J Huber. Robust estimation of a location parameter. In Breakthroughs in statistics, pages 492–518.
- Stratton C Jaquette. Markov decision processes with a new optimality criterion: Discrete time. The Annals of Statistics, pages 496–505, 1973.
- Roger Koenker, Pin Ng, and Stephen Portnoy. Quantile smoothing splines. Biometrika, 81(4):673–680, 1994.
- Yufeng Liu and Yichao Wu. Stepwise multiple quantile regression estimation using non-crossing constraints. Statistics and its Interface, 2(3):299–310, 2009.
- Borislav Mavrin, Shangtong Zhang, Hengshuai Yao, Linglong Kong, Kaiwen Wu, and Yaoliang Yu. Distributional reinforcement learning for efficient exploration. arXiv preprint arXiv:1905.06125, 2019.
- Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529–533, 2015.
- Tetsuro Morimura, Masashi Sugiyama, Hisashi Kashima, Hirotaka Hachiya, and Toshiyuki Tanaka. Nonparametric return distribution approximation for reinforcement learning. 2010.
- Alfred Müller. Integral probability metrics and their generating classes of functions. Advances in Applied Probability, 29(2):429–443, 1997.
- Mark Rowland, Marc G Bellemare, Will Dabney, Rémi Munos, and Yee Whye Teh. An analysis of categorical distributional reinforcement learning. arXiv preprint arXiv:1802.08163, 2018.
- Matthew J Sobel. The variance of discounted markov decision processes. Journal of Applied Probability, 19(4):794–802, 1982.
- Hado Van Hasselt, Arthur Guez, and David Silver. Deep reinforcement learning with double q-learning. In Thirtieth AAAI conference on artificial intelligence, 2016.
- Ziyu Wang, Tom Schaul, Matteo Hessel, Hado Van Hasselt, Marc Lanctot, and Nando De Freitas. Dueling network architectures for deep reinforcement learning. arXiv preprint arXiv:1511.06581, 2015.
- Christopher JCH Watkins and Peter Dayan. Q-learning. Machine learning, 8(3-4):279–292, 1992.
- DJ White. Mean, variance, and probabilistic criteria in finite markov decision processes: A review. Journal of Optimization Theory and Applications, 56(1):1–29, 1988.
- Derek Yang, Li Zhao, Zichuan Lin, Tao Qin, Jiang Bian, and Tie-Yan Liu. Fully parameterized quantile function for distributional reinforcement learning. In Advances in Neural Information Processing Systems, pages 6190–6199, 2019.

Tags

Comments