I4R: Promoting Deep Reinforcement Learning by the Indicator for Expressive Representations

IJCAI 2020, pp. 2669-2675, 2020.

Cited by: 0|Views117
EI
Weibo:
We conduct experiments for promoting A3C and DQN, on 55 and 30 Atari games, respectively, which demonstrates that I4R can promote the deep reinforcement learning performance significantly

Abstract:

Learning expressive representations is always crucial for well-performed policies in deep reinforcement learning (DRL). Different from supervised learning, in DRL, accurate targets are not always available, and some inputs with different actions only have tiny differences, which stimulates the demand for learning expressive representation...More

Code:

Data:

0
Introduction
  • Along with the developmet of deep learning techniques, deep reinforcement learning (DRL) models have been more widely used in decision making tasks and automatic control tasks [Mnih et al, 2015; Silver et al, 2016; Schulman et al, 2017].
  • Convolutional neural network (CNN) and recurrent neural network (RNN), are two kinds of typical SEs for images-based games [Mnih et al, 2015] and natural languagebased games [Zhao and Eskenazi, 2016], respectively.
  • The other is to take actions according to the representations and the current learned policies, and the whole DRL model can be optimized by the gradient/value based DRL algorithms
Highlights
  • Along with the developmet of deep learning techniques, deep reinforcement learning (DRL) models have been more widely used in decision making tasks and automatic control tasks [Mnih et al, 2015; Silver et al, 2016; Schulman et al, 2017]
  • The current learned policies, and the whole deep reinforcement learning model can be optimized by the gradient/value based deep reinforcement learning algorithms
  • All exploratory experiments are implemented on a popular decision making task, Atari game, with highdimensional video frame inputs, and we aim to find out the relationship between the representations extracted by state extractor and the performance of deep reinforcement learning algorithms
  • We find that our proposed method outperforms the baseline on 20 games, and compared with the baseline, it obtains more than 2 times human normalized score for the median of scores on 30 games (i.e., 15% and 38%), which demonstrates the superiority of I4R over DQN, and supports our claim that encouraging highly expressive state extractor promotes performances of kinds of deep reinforcement learning algorithms
  • We mainly study the relationship between representations and performance of the deep reinforcement learning agents
  • We conduct experiments for promoting A3C and DQN, on 55 and 30 Atari games, respectively, which demonstrates that I4R can promote the deep reinforcement learning performance significantly
Methods
  • The authors show the proposed method I4R based on exploratory experiments, including 3 parts, i.e., observations, the proposed indicator NSSV, and the novel algorithm I4R.

    3.1 Observations on Exploratory Experiments

    all exploratory experiments are implemented on a popular decision making task, Atari game, with highdimensional video frame inputs, and the authors aim to find out the relationship between the representations extracted by SE and the performance of DRL algorithms.

    The authors select two DRL models with the same number of last hidden layer units but different model sizes, to play the Atari game Gravitar for 200M frames, respectively.
  • The authors use the raw frames in the trajectory played by the large trained model, as the input of three SEs, and plot the two-dimensional embedding of the representation matrices in the last hidden layer assigned by such three models in Fig. 2.
  • These plots are generated by using SVD dimension reduction on the representation matrices.
  • From Fig. 2, the authors observe that the embedding of representations generated by the model with
Conclusion
  • The authors mainly study the relationship between representations and performance of the DRL agents.
  • The authors define the NSSV indicator, i.e, the smallest number of significant singular values, as a measurement for learning representations, the authors verify the positive correlation between NSSV and the rewards, and further propose a novel method called I4R, to improve DRL algorthims via adding the corresponding regularization term to enhance NSSV.
  • In the future work, the authors will further take the study on if I4R has influence on supervised/unsupervised learning process
Summary
  • Introduction:

    Along with the developmet of deep learning techniques, deep reinforcement learning (DRL) models have been more widely used in decision making tasks and automatic control tasks [Mnih et al, 2015; Silver et al, 2016; Schulman et al, 2017].
  • Convolutional neural network (CNN) and recurrent neural network (RNN), are two kinds of typical SEs for images-based games [Mnih et al, 2015] and natural languagebased games [Zhao and Eskenazi, 2016], respectively.
  • The other is to take actions according to the representations and the current learned policies, and the whole DRL model can be optimized by the gradient/value based DRL algorithms
  • Methods:

    The authors show the proposed method I4R based on exploratory experiments, including 3 parts, i.e., observations, the proposed indicator NSSV, and the novel algorithm I4R.

    3.1 Observations on Exploratory Experiments

    all exploratory experiments are implemented on a popular decision making task, Atari game, with highdimensional video frame inputs, and the authors aim to find out the relationship between the representations extracted by SE and the performance of DRL algorithms.

    The authors select two DRL models with the same number of last hidden layer units but different model sizes, to play the Atari game Gravitar for 200M frames, respectively.
  • The authors use the raw frames in the trajectory played by the large trained model, as the input of three SEs, and plot the two-dimensional embedding of the representation matrices in the last hidden layer assigned by such three models in Fig. 2.
  • These plots are generated by using SVD dimension reduction on the representation matrices.
  • From Fig. 2, the authors observe that the embedding of representations generated by the model with
  • Conclusion:

    The authors mainly study the relationship between representations and performance of the DRL agents.
  • The authors define the NSSV indicator, i.e, the smallest number of significant singular values, as a measurement for learning representations, the authors verify the positive correlation between NSSV and the rewards, and further propose a novel method called I4R, to improve DRL algorthims via adding the corresponding regularization term to enhance NSSV.
  • In the future work, the authors will further take the study on if I4R has influence on supervised/unsupervised learning process
Tables
  • Table1: Relationship between model performance and NSSV. LT = Large Trained; ST = Small Trained; LR = Large Random
  • Table2: Results on Atari games with different hyper-parameters
  • Table3: Comparison between two regularization terms via human normalized scores in Eq 6
  • Table4: Ablation Study on α
Download tables as Excel
Related work
  • In recent years, there are many studies on learning representations. On the one hand, just in DRL area, generally speaking, there are two main typical categories for learning representations. One is based on auxiliary models, such as auto-encoders [Mattner et al, 2012; Higgins et al, 2017], generative adversarial networks [Donahue et al, 2017; Shelhamer et al, 2016] and some other models [Oh et al, 2017; Racanière et al, 2017]. Such auxiliary models can help to improve learning representations via completing some certain tasks, such as reconstructing the current observation or state [Watter et al, 2015], predicting the future observations or states [François-Lavet et al, 2019], and recovering the actions given transitions [Zhang et al, 2018]. Here, some essential information for taking actions can be retained by completing the above auxiliary tasks. The other is based on prior task-specific knowledge/information. For example, in [Goel et al, 2018], the authors proposed to use the detection of moving objects for learning video games better. In [Jonschkowski and Brock, 2015], the authors proposed to use robotic prior knowledge for robot learning. In [Zhao and Eskenazi, 2016], task-related information was utilized in chatting systems. Different from these methods, in this paper, we simply add a regularization term without any task-specific knowledge/information, and do not need to build or train any extra models.
Funding
  • This work was supported by the National Natural Science Foundation of China (No 61421003)
Reference
  • [Bellemare et al., 2013] Marc G Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling. The arcade learning environment: An evaluation platform for general agents. JAIR, 47:253–279, 2013.
    Google ScholarLocate open access versionFindings
  • [Brockman et al., 2016] Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. OpenAI Gym, 2016.
    Google ScholarLocate open access versionFindings
  • [Donahue et al., 2017] Jeff Donahue, Philipp Krähenbühl, and Trevor Darrell. Adversarial feature learning. In ICLR, 2017.
    Google ScholarLocate open access versionFindings
  • [François-Lavet et al., 2019] Vincent
    Google ScholarFindings
  • François-Lavet, Yoshua Bengio, Doina Precup, and Joelle Pineau. Combined reinforcement learning via abstract representations. In AAAI, 2019.
    Google ScholarLocate open access versionFindings
  • [Goel et al., 2018] Vikash Goel, Jameson Weng, and Pascal Poupart. Unsupervised video object segmentation for deep reinforcement learning. In NeurIPS, pages 5683–5694, 2018.
    Google ScholarLocate open access versionFindings
  • [Higgins et al., 2017] Irina Higgins, Arka Pal, Andrei A Rusu, Loic Matthey, Christopher P Burgess, Alexander Pritzel, Matthew Botvinick, Charles Blundell, and Alexander Lerchner. Darla: Improving zero-shot transfer in reinforcement learning. arXiv, 2017.
    Google ScholarFindings
  • [Hjelm et al., 2019] R Devon Hjelm, Alex Fedorov, Samuel Lavoie-Marchildon, Karan Grewal, Phil Bachman, Adam Trischler, and Yoshua Bengio. Learning deep representations by mutual information estimation and maximization. In ICLR, 2019.
    Google ScholarLocate open access versionFindings
  • [Houle, 2017] Michael E Houle. Local intrinsic dimensionality i: An extreme-value-theoretic foundation for similarity applications. In SISAP, pages 64–79, 2017.
    Google ScholarLocate open access versionFindings
  • [Jonschkowski and Brock, 2015] Rico Jonschkowski and Oliver Brock. Learning state representations with robotic priors. Autonomous Robots, 39(3):407–428, 2015.
    Google ScholarLocate open access versionFindings
  • [Kostrikov, 2018] Ilya Kostrikov. PyTorch implementations of A3C. https://github.com/ikostrikov/pytorch-a3c, 2018. Accessed on Feb 28, 2018.
    Findings
  • [Lerer and Peysakhovich, 2018] Adam Lerer and Alexander Peysakhovich. Learning social conventions in markov games. arXiv, 2018.
    Google ScholarLocate open access versionFindings
  • [Mattner et al., 2012] Jan Mattner, Sascha Lange, and Martin Riedmiller. Learn to swing up and balance a real pole based on raw visual input data. In ICONIP, pages 126–133, 2012.
    Google ScholarLocate open access versionFindings
  • [Mnih et al., 2015] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529–533, 2015.
    Google ScholarLocate open access versionFindings
  • [Mnih et al., 2016] Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In ICML, pages 1928–1937, 2016.
    Google ScholarLocate open access versionFindings
  • [Munk et al., 2016] Jelle Munk, Jens Kober, and Robert Babuška. Learning state representation for deep actor-critic control. In CDC, pages 4667–4673, 2016.
    Google ScholarLocate open access versionFindings
  • [Oh et al., 2017] Junhyuk Oh, Satinder Singh, and Honglak Lee. Value prediction network. In NeurIPS, pages 6118– 6128, 2017.
    Google ScholarLocate open access versionFindings
  • [Peysakhovich and Lerer, 2018] Alexander Peysakhovich and Adam Lerer. Consequentialist conditional cooperation in social dilemmas with imperfect information. In ICLR, 2018.
    Google ScholarLocate open access versionFindings
  • [Racanière et al., 2017] Sébastien Racanière, Théophane Weber, David P Reichert, Lars Buesing, Arthur Guez, Danilo Rezende, Adria Puigdomènech Badia, Oriol Vinyals, Nicolas Heess, Yujia Li, et al. Imagination-augmented agents for deep reinforcement learning. In NeurIPS, pages 5690– 5701, 2017.
    Google ScholarLocate open access versionFindings
  • [Schulman et al., 2017] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv, 2017.
    Google ScholarFindings
  • [Shelhamer et al., 2016] Evan Shelhamer, Parsa Mahmoudieh, Max Argus, and Trevor Darrell. Loss is its own reward: Self-supervision for reinforcement learning. arXiv, 2016.
    Google ScholarFindings
  • [Silver et al., 2016] David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of Go with deep neural networks and tree search. Nature, 529(7587):484, 2016.
    Google ScholarLocate open access versionFindings
  • [Tran et al., 2017] Luan Tran, Xi Yin, and Xiaoming Liu. Disentangled representation learning GAN for pose-invariant face recognition. In CVPR, pages 1415–1424, 2017.
    Google ScholarLocate open access versionFindings
  • [Van Hasselt et al., 2016] Hado Van Hasselt, Arthur Guez, and David Silver. Deep reinforcement learning with double q-learning. In AAAI, 2016.
    Google ScholarLocate open access versionFindings
  • [Vincent et al., 2010] Pascal Vincent, Hugo Larochelle, Isabelle Lajoie, Yoshua Bengio, and Pierre-Antoine Manzagol. Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. JMLR, 11(Dec.):3371–3408, 2010.
    Google ScholarLocate open access versionFindings
  • [Wang et al., 2016] Ziyu Wang, Tom Schaul, Matteo Hessel, Hado Hasselt, Marc Lanctot, and Nando Freitas. Dueling network architectures for deep reinforcement learning. In ICML, pages 1995–2003, 2016.
    Google ScholarLocate open access versionFindings
  • [Watter et al., 2015] Manuel Watter, Jost Springenberg, Joschka Boedecker, and Martin Riedmiller. Embed to control: A locally linear latent dynamics model for control from raw images. In NeurIPS, pages 2746–2754, 2015.
    Google ScholarLocate open access versionFindings
  • [Zhang et al., 2018] Amy Zhang, Harsh Satija, and Joelle Pineau. Decoupling dynamics and reward for transfer learning. arXiv, 2018.
    Google ScholarFindings
  • [Zhao and Eskenazi, 2016] Tiancheng Zhao and Maxine Eskenazi. Towards end-to-end learning for dialog state tracking and management using deep reinforcement learning. arXiv, 2016.
    Google ScholarFindings
Your rating :
0

 

Tags
Comments