# RD22^2: Reward Decomposition with Representation Decomposition

NIPS 2020, 2020.

Weibo:

Abstract:

Reward decomposition, which aims to decompose the full reward into multiple sub-rewards, has been proven beneficial for improving sample efficiency in reinforcement learning. Existing works on discovering reward decomposition are mostly policy dependent, which constrains diversified or disentangled behavior between different policies indu...More

Code:

Data:

Introduction

- Since deep Q-learning was proposed by Mnih et al [2015], reinforcement learning (RL) has achieved great success in decision making problems.
- The authors propose a set of novel principles for reward decomposition by exploring the relation between sub-rewards and their relevant features.
- The authors relax and integrate the above principles into deep learning settings, which leads to the algorithm, Reward Decomposition with Representation Disentanglement(RD2).

Highlights

- Since deep Q-learning was proposed by Mnih et al [2015], reinforcement learning (RL) has achieved great success in decision making problems
- While general RL algorithms have been extensively studied, here we focus on those RL tasks with multiple reward channels
- We propose a set of novel reward decomposition principles which encourage sub-rewards to have compact and non-trivial representations, termed RD2
- RD2 is capable of decomposing rewards for arbitrary state-action pairs under general RL settings and does not rely on policies
- Experiments demonstrate that RD2 greatly improves sample efficiency against existing reward decomposition methods
- RD2 naturally has a closer connection to learning compact representation for sub-values and speed up RL algorithms

Results

- Different from those works, the authors focus on reward decomposition in RL, and learn compact representation for each sub-reward.
- The authors introduce the principles for finding minimal supporting reward decomposition under fMDP.
- The authors further define K-minimal supporting reward decomposition, which directly leads to the second principle: each sub-reward should be unique in that their relevant features contain exclusive information.
- The intuition of minimal sufficient supporting sub-state is to contain all and only the information required to compute a sub-reward.
- Another trivial decomposition is r1 = r and r2 = 0 with corresponding minimal sufficient supporting sub-states s1 = s and s2 = ∅, notice that the second channel would not contain any information.
- A more general case of trivial decomposition would be r1 = r + f and r2 = −f with corresponding minimal sufficient supporting sub-states s1 = s and s2 = {sagent}, where f is an arbitrary function.
- 1 2 rtreasure where the corresponding minimal sufficient supporting sub-states are s1 = s and s2 = {sagent, streasure}.
- The ideal decomposition for the Monster-Treasure environment would be to decompose the reward r into rmonster and rtreasure, because it is a compact decomposition in which each sub-reward has a compact minimal sufficient supporting sub-state.
- Minimal supporting principles define the ideal reward decomposition under factored MDP, where selecting factors is inherently optimizing a boolean mask over factors.
- The authors would first find the minimal sufficient supporting state si of a given reward decomposition, represented by ri, evaluate H(si).
- This objective cannot back propagate to ri since the operation of finding minimal sufficient supporting sub-state is not derivable.

Conclusion

- The authors first present the results of reward decomposition and visualize the trained masks using saliency maps on several Atari games, and show that the decomposed rewards can accelerate the training process of existing RL algorithms.
- The authors propose a set of novel reward decomposition principles which encourage sub-rewards to have compact and non-trivial representations, termed RD2.
- RD2 is capable of decomposing rewards for arbitrary state-action pairs under general RL settings and does not rely on policies.

Summary

- Since deep Q-learning was proposed by Mnih et al [2015], reinforcement learning (RL) has achieved great success in decision making problems.
- The authors propose a set of novel principles for reward decomposition by exploring the relation between sub-rewards and their relevant features.
- The authors relax and integrate the above principles into deep learning settings, which leads to the algorithm, Reward Decomposition with Representation Disentanglement(RD2).
- Different from those works, the authors focus on reward decomposition in RL, and learn compact representation for each sub-reward.
- The authors introduce the principles for finding minimal supporting reward decomposition under fMDP.
- The authors further define K-minimal supporting reward decomposition, which directly leads to the second principle: each sub-reward should be unique in that their relevant features contain exclusive information.
- The intuition of minimal sufficient supporting sub-state is to contain all and only the information required to compute a sub-reward.
- Another trivial decomposition is r1 = r and r2 = 0 with corresponding minimal sufficient supporting sub-states s1 = s and s2 = ∅, notice that the second channel would not contain any information.
- A more general case of trivial decomposition would be r1 = r + f and r2 = −f with corresponding minimal sufficient supporting sub-states s1 = s and s2 = {sagent}, where f is an arbitrary function.
- 1 2 rtreasure where the corresponding minimal sufficient supporting sub-states are s1 = s and s2 = {sagent, streasure}.
- The ideal decomposition for the Monster-Treasure environment would be to decompose the reward r into rmonster and rtreasure, because it is a compact decomposition in which each sub-reward has a compact minimal sufficient supporting sub-state.
- Minimal supporting principles define the ideal reward decomposition under factored MDP, where selecting factors is inherently optimizing a boolean mask over factors.
- The authors would first find the minimal sufficient supporting state si of a given reward decomposition, represented by ri, evaluate H(si).
- This objective cannot back propagate to ri since the operation of finding minimal sufficient supporting sub-state is not derivable.
- The authors first present the results of reward decomposition and visualize the trained masks using saliency maps on several Atari games, and show that the decomposed rewards can accelerate the training process of existing RL algorithms.
- The authors propose a set of novel reward decomposition principles which encourage sub-rewards to have compact and non-trivial representations, termed RD2.
- RD2 is capable of decomposing rewards for arbitrary state-action pairs under general RL settings and does not rely on policies.

- Table1: Example of reward decomposition on Monster-Treasure

Reference

- Drew Bagnell and Andrew Y. Ng. On local rewards and scaling distributed reinforcement learning. In Y. Weiss, B. Schölkopf, and J. C. Platt, editors, Advances in Neural Information Processing Systems 18, pages 91–98. MIT Press, 2006. URL http://papers.nips.cc/paper/2951-on-local-rewards-and-scaling-distributed-reinforcement-learning.pdf.
- Yoshua Bengio, Aaron Courville, and Pascal Vincent. Representation learning: A review and new perspectives. IEEE transactions on pattern analysis and machine intelligence, 35(8):1798–1828, 2013. URL http://arxiv.org/abs/1206.5538.cite arxiv:1206.5538.
- C. Boutilier, T. Deam, and S. Hanks. Decision-Theoretic Planning: Structural Assumptions and Computational Leverage. JAIR, 11:1–94, 1999.
- Craig Boutilier, Richard Dearden, and Moisés Goldszmidt. Exploiting structure in policy construction. In IJCAI, pages 1104–1113. Morgan Kaufmann, 1995. URL http://dblp.uni-trier.de/db/conf/ijcai/ijcai95.html#BoutilierDG95.
- Pablo Samuel Castro, Subhodeep Moitra, Carles Gelada, Saurabh Kumar, and Marc G. Bellemare. Dopamine: A Research Framework for Deep Reinforcement Learning. 2018. URL http://arxiv.org/abs/1812.06110.
- Tian Qi Chen, Xuechen Li, Roger B. Grosse, and David Duvenaud. Isolating sources of disentanglement in variational autoencoders. In Samy Bengio, Hanna M. Wallach, Hugo Larochelle, Kristen Grauman, Nicolò Cesa-Bianchi, and Roman Garnett, editors, NeurIPS, pages 2615–2625, 2018. URL http://dblp.uni-trier.de/db/conf/nips/nips2018.html#ChenLGD18.
- Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel. Infogan: Interpretable representation learning by information maximizing generative adversarial nets. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems 29, pages 2172–2180. Curran Associates, Inc., 2016. URL http://papers.nips.cc/paper/
- Maxime Chevalier-Boisvert, Lucas Willems, and Suman Pal. Minimalistic gridworld environment for openai gym. https://github.com/maximecb/gym-minigrid, 2018.
- Christopher Grimm and Satinder Singh. Learning independently-obtainable reward functions. CoRR, abs/1901.08649, 201URL http://dblp.uni-trier.de/db/journals/corr/corr1901.html#abs-1901-08649.
- Matteo Hessel, Joseph Modayil, Hado Van Hasselt, Tom Schaul, Georg Ostrovski, Will Dabney, Dan Horgan, Bilal Piot, Mohammad Azar, and David Silver. Rainbow: Combining improvements in deep reinforcement learning. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
- Irina Higgins, Loïc Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick, Shakir Mohamed, and Alexander Lerchner. beta-vae: Learning basic visual concepts with a constrained variational framework. In ICLR (Poster). OpenReview.net, 2017. URL http://dblp.uni-trier.de/db/conf/iclr/iclr2017.html#HigginsMPBGBML17.
- Wei-Ning Hsu, Yu Zhang, and James R. Glass. Unsupervised learning of disentangled and interpretable representations from sequential data. In Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett, editors, NIPS, pages 1878–1889, 2017. URL http://dblp.uni-trier.de/db/conf/nips/nips2017.html#HsuZG17.
- Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- Yuanpeng Li, Liang Zhao, Jianyu Wang, and Joel Hestness. Compositional generalization for primitive substitutions. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 4284–4293, 2019.
- Zichuan Lin, Li Zhao, Derek Yang, Tao Qin, Tie-Yan Liu, and Guangwen Yang. Distributional reward decomposition for reinforcement learning. In H. Wallach, H. Larochelle, A. Beygelzimer, F. dAlché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems 32, pages 6212–6221. Curran Associates, Inc., 2019. URL http://papers.nips.cc/paper/8852-distributional-reward-decomposition-for-reinforcement-learning.pdf.
- Michael Littman and Justin Boyan. A distributed reinforcement learning scheme for network routing, 1993.
- Bhaskara Marthi. Automatic shaping and decomposition of reward functions, 2007.
- Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. Human-level control through deep reinforcement learning. Nature, 518(7540):529–533, February 2015. ISSN 00280836. URL http://dx.doi.org/10.1038/nature14236.
- OpenAI,:, Christopher Berner, Greg Brockman, Brooke Chan, Vicki Cheung, Przemysław Debiak, Christy Dennison, David Farhi, Quirin Fischer, Shariq Hashme, Chris Hesse, Rafal Józefowicz, Scott Gray, Catherine Olsson, Jakub Pachocki, Michael Petrov, Henrique Pondé de Oliveira Pinto, Jonathan Raiman, Tim Salimans, Jeremy Schlatter, Jonas Schneider, Szymon Sidor, Ilya Sutskever, Jie Tang, Filip Wolski, and Susan Zhang. Dota 2 with large scale deep reinforcement learning, 2019.
- Jonas Peters, Dominik Janzing, and Bernhard Schölkopf. Elements of Causal Inference: Foundations and Learning Algorithms. Adaptive Computation and Machine Learning. MIT Press, Cambridge, MA, 2017. ISBN 978-0-262-03731-0. URL https://mitpress.mit.edu/books/elements-causal-inference.
- Martin L. Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons, Inc., New York, NY, USA, 1st edition, 1994. ISBN 0471619779.
- Stuart Russell and Andrew Zimdars. Q-decomposition for reinforcement learning agents. volume 2, pages 656–663, 01 2003.
- Jeff Schneider, Weng-Keen Wong, Andrew Moore, and Martin Riedmiller. Distributed value functions. In In Proceedings of the Sixteenth International Conference on Machine Learning, pages 371–378. Morgan Kaufmann, 1999.
- Julian Schrittwieser, Ioannis Antonoglou, Thomas Hubert, Karen Simonyan, Laurent Sifre, Simon Schmitt, Arthur Guez, Edward Lockhart, Demis Hassabis, Thore Graepel, et al. Mastering atari, go, chess and shogi by planning with a learned model. arXiv preprint arXiv:1911.08265, 2019.
- Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Deep inside convolutional networks: Visualising image classification models and saliency maps. arXiv preprint arXiv:1312.6034, 2013.
- Harm Van Seijen, Mehdi Fatemi, Joshua Romoff, Romain Laroche, Tavian Barnes, and Jeffrey Tsang. Hybrid reward architecture for reinforcement learning. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 5392–5402. Curran Associates, Inc., 2017. URL http://papers.nips.cc/paper/7123-hybrid-reward-architecture-for-reinforcement-learning.pdf.

Full Text

Tags

Comments