# Distributional Reward Decomposition for Reinforcement Learning

NeurIPS, pp. 6212-6221, 2019.

EI

Weibo:

Abstract:

Many reinforcement learning (RL) tasks have specific properties that can be lever-aged to modify existing RL algorithms to adapt to those tasks and further improve performance, and a general class of such properties is the multiple reward channel. In those environments the full reward can be decomposed into sub-rewards obtained from diffe...More

Code:

Data:

Introduction

- Reinforcement learning has achieved great success in decision making problems since Deep Qlearning was proposed by Mnih et al [2015].
- Reward decomposition has been proposed to investigate such properties.
- The sub-rewards may further be leveraged to learn better policies.
- The authors consider a general reinforcement learning setting, in which the interaction of the agent and the environment can be viewed as a Markov Decision Process (MDP).
- Given a fixed policy π, reinforcement learning estimates the action-value function of π, defined by Qπ(x, a) =.
- The Bellman equation characterizes the action-value function by temporal equivalence, given by.
- To maximize the total return given by E [Qπ(x0, a0)], one x0 ,a0 common approach is to find the fixed point for the Bellman optimality operator

Highlights

- Reinforcement learning has achieved great success in decision making problems since Deep Qlearning was proposed by Mnih et al [2015]
- We propose Distributional Reward Decomposition for Reinforcement Learning (DRDRL), an reinforcement learning algorithm that captures the latent multiple-channel structure for reward, under the setting of distributional reinforcement learning
- We propose an reinforcement learning algorithm that estimates distributions of the sub-returns, and combine the sub-returns to get the distribution of the total returns
- We propose Distributional Reward Decomposition for Reinforcement Learning (DRDRL), a novel reward decomposition algorithm which captures the multiple reward channel structure under distributional setting
- Our algorithm significantly outperforms state-of-the-art reinforcement learning methods RAINBOW on Atari games with multiple reward channels
- We might try to develop reward decomposition method based on quantile networks (Dabney et al [2018a,b])

Results

- The authors tested the algorithm on the games from Arcade Learning Environment (ALE; Bellemare et al [2013]).
- The authors conduct experiments on six Atari games, some with complicated rules and some with Return Seaquest.
- Which is an advanced variant of C51 (Bellemare et al [2017]) and achieved state-of-the-art results in Atari games domain.
- In Rainbow, the Q-value is bounded by [Vmin, Vmax] where Vmax = −Vmin = 10.
- The authors bound the categorical distribution of each sub-return Zi(i

Conclusion

- The authors propose Distributional Reward Decomposition for Reinforcement Learning (DRDRL), a novel reward decomposition algorithm which captures the multiple reward channel structure under distributional setting.
- The authors' algorithm significantly outperforms state-of-the-art RL methods RAINBOW on Atari games with multiple reward channels.
- The authors provide interesting experimental analysis to get insight for the algorithm.
- The authors might try to develop reward decomposition method based on quantile networks (Dabney et al [2018a,b])

Summary

## Introduction:

Reinforcement learning has achieved great success in decision making problems since Deep Qlearning was proposed by Mnih et al [2015].- Reward decomposition has been proposed to investigate such properties.
- The sub-rewards may further be leveraged to learn better policies.
- The authors consider a general reinforcement learning setting, in which the interaction of the agent and the environment can be viewed as a Markov Decision Process (MDP).
- Given a fixed policy π, reinforcement learning estimates the action-value function of π, defined by Qπ(x, a) =.
- The Bellman equation characterizes the action-value function by temporal equivalence, given by.
- To maximize the total return given by E [Qπ(x0, a0)], one x0 ,a0 common approach is to find the fixed point for the Bellman optimality operator
## Results:

The authors tested the algorithm on the games from Arcade Learning Environment (ALE; Bellemare et al [2013]).- The authors conduct experiments on six Atari games, some with complicated rules and some with Return Seaquest.
- Which is an advanced variant of C51 (Bellemare et al [2017]) and achieved state-of-the-art results in Atari games domain.
- In Rainbow, the Q-value is bounded by [Vmin, Vmax] where Vmax = −Vmin = 10.
- The authors bound the categorical distribution of each sub-return Zi(i
## Conclusion:

The authors propose Distributional Reward Decomposition for Reinforcement Learning (DRDRL), a novel reward decomposition algorithm which captures the multiple reward channel structure under distributional setting.- The authors' algorithm significantly outperforms state-of-the-art RL methods RAINBOW on Atari games with multiple reward channels.
- The authors provide interesting experimental analysis to get insight for the algorithm.
- The authors might try to develop reward decomposition method based on quantile networks (Dabney et al [2018a,b])

Related work

- Our method is closely related to previous work of reward decomposition. Reward function decomposition has been studied among others by Russell and Zimdars [2003] and Sprague and Ballard [2003]. While these earlier works mainly focus on how to achieve optimal policy given the decomposed reward functions, there have been several recent works attempting to learn latent decomposed rewards. Van Seijen et al [2017] construct an easy-to-learn value function by decomposing the reward function of the environment into n different reward functions. To ensure the learned decomposition is non-trivial, Van Seijen et al [2017] proposed to split a state into different pieces following domain knowledge and then feed different state pieces into each reward function branch. While such method can accelerate learning process, it always requires many pre-defined preprocessing techniques. There has been other work that explores learn reward decomposition network end-to-end. Grimm and Singh [2019] investigates how to learn independently-obtainable reward functions. While it learns interesting reward decomposition, their method requires that the environments be resettable to specific states since it needs multiple trajectories from the same starting state to compute their objective function. Besides, their method aims at learning different optimal policies for each decomposed reward function. Different from the works above, our method can learn meaningful implicit reward decomposition without any requirements on prior knowledge. Also, our method can leverage the decomposed sub-rewards to find better behaviour for a single agent. Our work also relates to Horde (Sutton et al [2011]). The Horde architecture consists of a large number of ‘sub-agents’ that learn in parallel via off-policy learning. Each demon trains a separate general value function (GVF) based on its own policy and pseudo-reward function. A pseudo-reward can be any feature-based signal that encodes useful information. The Horde architecture is focused on building up general knowledge about the world, encoded via a large number of GVFs. UVFA (Schaul et al [2015]) extends Horde along a different direction that enables value function generalizing across different goals. Our method focuses on learning implicit reward decomposition in order to more efficiently learn a control policy.

Funding

- This work was supported in part by the National Key Research & Development Plan of China (grant No 2016YFA0602200 and 2017YFA0604500), and by Center for High Performance Computing and System Simulation, Pilot National Laboratory for Marine Science and Technology (Qingdao)

Reference

- Marc G Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling. The arcade learning environment: An evaluation platform for general agents. Journal of Artificial Intelligence Research, 47: 253–279, 2013.
- Marc G Bellemare, Will Dabney, and Rémi Munos. A distributional perspective on reinforcement learning. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 449–458. JMLR. org, 2017.
- Pablo Samuel Castro, Subhodeep Moitra, Carles Gelada, Saurabh Kumar, and Marc G. Bellemare. Dopamine: A Research Framework for Deep Reinforcement Learning. 2018. URL http://arxiv.org/abs/1812.06110.
- Will Dabney, Georg Ostrovski, David Silver, and Remi Munos. Implicit quantile networks for distributional reinforcement learning. In International Conference on Machine Learning, pages 1104–1113, 2018a.
- Will Dabney, Mark Rowland, Marc G Bellemare, and Rémi Munos. Distributional reinforcement learning with quantile regression. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018b.
- Christopher Grimm and Satinder Singh. Learning independently-obtainable reward functions. arXiv preprint arXiv:1901.08649, 2019.
- Matteo Hessel, Joseph Modayil, Hado Van Hasselt, Tom Schaul, Georg Ostrovski, Will Dabney, Dan Horgan, Bilal Piot, Mohammad Azar, and David Silver. Rainbow: Combining improvements in deep reinforcement learning. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
- Adrien Laversanne-Finot, Alexandre Péré, and Pierre-Yves Oudeyer. Curiosity driven exploration of learned disentangled goal spaces. arXiv preprint arXiv:1807.01521, 2018.
- Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529, 2015.
- Stuart J Russell and Andrew Zimdars. Q-decomposition for reinforcement learning agents. In Proceedings of the 20th International Conference on Machine Learning (ICML-03), pages 656– 663, 2003.
- Tom Schaul, Daniel Horgan, Karol Gregor, and David Silver. Universal value function approximators. In International Conference on Machine Learning, pages 1312–1320, 2015.
- Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Deep inside convolutional networks: Visualising image classification models and saliency maps. arXiv preprint arXiv:1312.6034, 2013.
- Nathan Sprague and Dana Ballard. Multiple-goal reinforcement learning with modular sarsa (0). 2003.
- Richard S Sutton, Joseph Modayil, Michael Delp, Thomas Degris, Patrick M Pilarski, Adam White, and Doina Precup. Horde: A scalable real-time architecture for learning knowledge from unsupervised sensorimotor interaction. In The 10th International Conference on Autonomous Agents and Multiagent Systems-Volume 2, pages 761–768. International Foundation for Autonomous Agents and Multiagent Systems, 2011.
- Valentin Thomas, Jules Pondard, Emmanuel Bengio, Marc Sarfati, Philippe Beaudoin, Marie-Jean Meurs, Joelle Pineau, Doina Precup, and Yoshua Bengio. Independently controllable features. arXiv preprint arXiv:1708.01289, 2017.
- Hado Van Hasselt, Arthur Guez, and David Silver. Deep reinforcement learning with double qlearning. In Thirtieth AAAI Conference on Artificial Intelligence, 2016.
- Harm Van Seijen, Mehdi Fatemi, Joshua Romoff, Romain Laroche, Tavian Barnes, and Jeffrey Tsang. Hybrid reward architecture for reinforcement learning. In Advances in Neural Information Processing Systems, pages 5392–5402, 2017.

Full Text

Tags

Comments