Policy Distillation

CoRR, Volume abs/1511.06295, 2015.

Cited by: 288|Bibtex|Views130|Links
EI
Keywords:
deep q networkpolicy distillationq networkdeep Q-networkreinforcement learningMore(9+)
Weibo:
This procedure has been used for three distinct purposes: to compress policies learnt on single games in smaller models, to build agents that are capable of playing multiple games, to improve the stability of the deep Q-network algorithm by distilling online the policy of the bes...

Abstract:

Policies for complex visual tasks have been successfully learned with deep reinforcement learning, using an approach called deep Q-networks (DQN), but relatively large (task-specific) networks and extensive training are needed to achieve good performance. In this work, we present a novel method called policy distillation that can be use...More

Code:

Data:

0
Introduction
  • Advances in deep reinforcement learning have shown that policies can be encoded through end-to-end learning from reward signals, and that these pixel-to-action policies can deliver superhuman performance on many challenging tasks (Mnih et al, 2015).
  • The method has multiple advantages: network size can be compressed by up to 15 times without degradation in performance; multiple expert policies can be combined into a single multi-task policy that can outperform the original experts; and it can be applied as a real-time, online learning process by continually distilling the best policy to a target network, efficiently tracking the evolving Q-learning policy.
  • The contribution of this work is to describe and discuss the policy distillation approach and to demonstrate results on (a) single game distillation, (b) single game distillation with highly compressed models, (c) multi-game distillation, and (d) online distillation
Highlights
  • Advances in deep reinforcement learning have shown that policies can be encoded through end-to-end learning from reward signals, and that these pixel-to-action policies can deliver superhuman performance on many challenging tasks (Mnih et al, 2015)
  • We show that distillation can be used in the context of reinforcement learning (RL), a significant discovery that belies the commonly held belief that supervised learning cannot generalize to sequential prediction tasks (Barto and Dietterich, 2004)
  • Single task policy distillation is a process of data generation by the teacher network and supervised training by the student network, as illustrated in Figure 2(a)
  • This procedure has been used for three distinct purposes: (1) to compress policies learnt on single games in smaller models, (2) to build agents that are capable of playing multiple games, (3) to improve the stability of the deep Q-network algorithm by distilling online the policy of the best performing agent
  • We have shown that in the reinforcement learning setting, special care must be taken to chose the correct loss function for distillation and have observed that the best results are obtained by weighing action classification by a soft-max of the action-gap, to what is suggested by the classification-based policy iteration framework Farahmand et al (2012)
  • Our results show that distillation can be applied to reinforcement learning, even without using an iterative approach and without allowing the student network to control the data distribution it is trained on
Results
  • RESULTS AND DISCUSSION

    A brief overview of the training and evaluation setup is given below; complete details are in Appendix A.

    4.1 TRAINING AND EVALUATION

    Single task policy distillation is a process of data generation by the teacher network and supervised training by the student network, as illustrated in Figure 2(a).
  • The authors employed a similar training procedure for multi-task policy distillation, as shown in Figure 2(b).
  • A larger network was used to train on multi-task distillation with 10 games.
  • D VISUALIZATION OF REPRESENTATION OVER 10 ATARI GAMES t-SNE embedding of conv.
  • A visual interpretation of the representation learned by the multi-task distillation is given in Figure D2, where t-SNE embeddings of network activations from 10 different games are plotted with distinct colors.
  • While activations are still game-specific, the authors observe higher within-game variance of representations, which probably reflect output statistics
Conclusion
  • In this work the authors have applied distillation to policy learnt in deep Q-networks.
  • This procedure has been used for three distinct purposes: (1) to compress policies learnt on single games in smaller models, (2) to build agents that are capable of playing multiple games, (3) to improve the stability of the DQN algorithm by distilling online the policy of the best performing agent.
  • The fact that the distilled policy can yield better results than the teacher confirms the growing body of evidence that distillation is a general principle for model regularization
Summary
  • Introduction:

    Advances in deep reinforcement learning have shown that policies can be encoded through end-to-end learning from reward signals, and that these pixel-to-action policies can deliver superhuman performance on many challenging tasks (Mnih et al, 2015).
  • The method has multiple advantages: network size can be compressed by up to 15 times without degradation in performance; multiple expert policies can be combined into a single multi-task policy that can outperform the original experts; and it can be applied as a real-time, online learning process by continually distilling the best policy to a target network, efficiently tracking the evolving Q-learning policy.
  • The contribution of this work is to describe and discuss the policy distillation approach and to demonstrate results on (a) single game distillation, (b) single game distillation with highly compressed models, (c) multi-game distillation, and (d) online distillation
  • Results:

    RESULTS AND DISCUSSION

    A brief overview of the training and evaluation setup is given below; complete details are in Appendix A.

    4.1 TRAINING AND EVALUATION

    Single task policy distillation is a process of data generation by the teacher network and supervised training by the student network, as illustrated in Figure 2(a).
  • The authors employed a similar training procedure for multi-task policy distillation, as shown in Figure 2(b).
  • A larger network was used to train on multi-task distillation with 10 games.
  • D VISUALIZATION OF REPRESENTATION OVER 10 ATARI GAMES t-SNE embedding of conv.
  • A visual interpretation of the representation learned by the multi-task distillation is given in Figure D2, where t-SNE embeddings of network activations from 10 different games are plotted with distinct colors.
  • While activations are still game-specific, the authors observe higher within-game variance of representations, which probably reflect output statistics
  • Conclusion:

    In this work the authors have applied distillation to policy learnt in deep Q-networks.
  • This procedure has been used for three distinct purposes: (1) to compress policies learnt on single games in smaller models, (2) to build agents that are capable of playing multiple games, (3) to improve the stability of the DQN algorithm by distilling online the policy of the best performing agent.
  • The fact that the distilled policy can yield better results than the teacher confirms the growing body of evidence that distillation is a general principle for model regularization
Tables
  • Table1: Comparison of learning criteria used for policy distillation from DQN teachers to students with identical network architectures: MSE (mean squared error), NLL (negative log likelihood), and KL (Kullback-Leibler divergence). Best relative scores are outlined in bold
  • Table2: Performance of a distilled multi-task agent on 10 Atari games. The agent is a single network that achieves 89.3% of the generalization score of 10 single-task DQN teachers, computed as a geometric mean
  • Table3: Network architectures and parameter counts of models used for single-task compression experiments
  • Table4: Network architectures and parameter counts of models used for multi-task distillation experiments
  • Table5: Performance of single-task compressed networks on 10 Atari games. Best relative scores are outlined in bold
  • Table6: Performance of multi-task distilled agents on 3 Atari games. Best relative scores are outlined in bold
Download tables as Excel
Reference
  • Jimmy Ba and Rich Caruana. Do deep nets really need to be deep? In Advances in Neural Information Processing Systems (NIPS), pages 2654–2662. Curran Associates, Inc., 2014.
    Google ScholarLocate open access versionFindings
  • A. G. Barto and T. G. Dietterich. Handbook of learning and approximate dynamic programming. Wiley-IEEE Press, 2004.
    Google ScholarFindings
  • Cristian Bucila, Rich Caruana, and Alexandru Niculescu-Mizil. Model compression. In KDD, pages 535–541. ACM, 2006.
    Google ScholarLocate open access versionFindings
  • Rich Caruana. Multitask learning. Mach. Learn., 28(1):41–75, July 1997.
    Google ScholarLocate open access versionFindings
  • William Chan, Nan Rosemary Ke, and Ian Lane. Transferring knowledge from a rnn to a dnn. arXiv preprint arXiv:1504.01483, 2015.
    Findings
  • Amir-massoud Farahmand, Doina Precup, and Mohammad Ghavamzadeh. Generalized classification-based approximate policy iteration. In Tenth European Workshop on Reinforcement Learning (EWRL), volume 2, 2012.
    Google ScholarLocate open access versionFindings
  • Philip J. Fleming and John J. Wallace. How not to lie with statistics: The correct way to summarize benchmark results. Commun. ACM, 29(3):218–221, March 1986.
    Google ScholarLocate open access versionFindings
  • Evan Greensmith, Peter L. Bartlett, and Jonathan Baxter. Variance reduction techniques for gradient estimates in reinforcement learning. Journal of Machine Learning Research (JMLR), pages 1471–1530, 2004.
    Google ScholarLocate open access versionFindings
  • Xiaoxiao Guo, Satinder P. Singh, Honglak Lee, Richard L. Lewis, and Xiaoshi Wang. Deep learning for realtime atari game play using offline monte-carlo tree search planning. In Advances in Neural Information Processing Systems (NIPS), pages 3338–3346, 2014.
    Google ScholarLocate open access versionFindings
  • G. Hinton, O. Vinyals, and J. Dean. Distilling the Knowledge in a Neural Network. Deep Learning and Representation Learning Workshop, NIPS, 2014.
    Google ScholarFindings
  • Levente Kocsis and Csaba Szepesvari. Bandit based monte-carlo planning. In Machine Learning: ECML 2006, pages 282–293.
    Google ScholarLocate open access versionFindings
  • Omnipress, 2010.
    Google ScholarFindings
  • Jinyu Li, Rui Zhao, Jui-Ting Huang, and Yifan Gong. Learning small-size dnn with output-distribution-based criteria. In Proc. Interspeech, 2014.
    Google ScholarLocate open access versionFindings
  • Percy Liang, Hal Daum III, and Dan Klein. Structure compilation: trading structure for features. In Proceedings of International Conference on Machine Learning (ICML), 2008.
    Google ScholarLocate open access versionFindings
  • Joshua Menke and Tony Martinez. Improving supervised learning by adapting the problem to the learner. International Journal of Neural Systems, 19(01):1–9, 2009.
    Google ScholarLocate open access versionFindings
  • Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin A. Riedmiller. Playing atari with deep reinforcement learning. Deep Learning Workshop, NIPS, 2013.
    Google ScholarFindings
  • Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. Human-level control through deep reinforcement learning. Nature, 518(7540):529–533, 02 2015.
    Google ScholarLocate open access versionFindings
  • Arun Nair, Praveen Srinivasan, Sam Blackwell, Cagdas Alcicek, Rory Fearon, Alessandro De Maria, Vedavyas Panneershelvam, Mustafa Suleyman, Charles Beattie, Stig Petersen, Shane Legg, Volodymyr Mnih, Koray Kavukcuoglu, and David Silver. Massively parallel methods for deep reinforcement learning. CoRR, abs/1507.04296, 2015.
    Findings
  • Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou, Antoine Chassang, Carlo Gatta, and Yoshua Bengio. Fitnets: Hints for thin deep nets. arXiv preprint arXiv:1412.6550, 2014.
    Findings
  • Stephane Ross, Geoffrey J Gordon, and J Andrew Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. arXiv preprint arXiv:1011.0686, 2010.
    Findings
  • S. Shalev-Shwartz. SelfieBoost: A Boosting Algorithm for Deep Learning. ArXiv e-prints, November 2014.
    Google ScholarFindings
  • Richard S. Sutton and Andrew G. Barto. Introduction to Reinforcement Learning. MIT Press, Cambridge, MA, USA, 1st edition, 1998.
    Google ScholarFindings
  • Zhiyuan Tang, Dong Wang, Yiqiao Pan, and Zhiyong Zhang. Knowledge transfer pre-training. arXiv preprint arXiv:1506.02256, 2015.
    Findings
  • T. Tieleman and G. Hinton. Lecture 6.5—RmsProp: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural Networks for Machine Learning, 2012.
    Google ScholarLocate open access versionFindings
  • L.J.P. van der Maaten and G.E. Hinton. Visualizing high-dimensional data using t-sne. Journal of Machine Learning Research (JMLR), 2008.
    Google ScholarLocate open access versionFindings
  • Hado van Hasselt, Arthur Guez, and David Silver. Deep reinforcement learning with double q-learning. In Proceedings of the AAAI Conference on Artificial Intelligence.
    Google ScholarLocate open access versionFindings
  • Dong Wang, Chao Liu, Zhiyuan Tang, Zhiyong Zhang, and Mengyuan Zhao. Recurrent neural network training with dark knowledge transfer. arXiv preprint arXiv:1505.04630, 2015.
    Findings
  • Policy Distillation Training Procedure Online data collection during policy distillation was performed under similar conditions to agent evaluation in Mnih et al. (2015). The DQN agent plays a random number of null-ops (up to 30) to initialize the episode, then acts greedily with respect to its Q-function, except for 5% of actions, which are chosen uniformly at random. Episodes can last up to 30 minutes of real-time play, or 108,000 frames. The small percentage of random actions leads to diverse game trajectories, which improves coverage of a game’s state space.
    Google ScholarLocate open access versionFindings
Your rating :
0

 

Tags
Comments