Large scale distributed neural network training through online distillationEI
Techniques such as ensembling and distillation promise model quality improvements when paired with almost any base model. However, due to increased test-time cost (for ensembles) and increased complexity of the training pipeline (for distillation), these techniques are challenging to use in industrial settings. In this paper we explore a variant of distillation which is relatively straightforward to use as it does not require a complicated multi-stage setup or many new hyperparameters. Our first claim is that online distillation enabl...更多
- 6Ioannis Mitliagkas, Ce Zhang, Stefan Hadjis, Christopher Ré. Asynchrony begets Momentum, with an Application to Deep Learning.Allerton, pp. 997-1004, 2016.
- 7Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, Jeff Klingner, Apurva Shah, Melvin Johnson, Xiaobing Liu, Lukasz Kaiser, Stephan Gouws, Yoshikiyo Kato, Taku Kudo, Hideto Kazawa, Keith Stevens, George Kurian, Nishant Patil, Wei Wang, Cliff Young, Jason Smith, Jason Riesa, Alex Rudnick, Oriol Vinyals, Greg Corrado, Macduff Hughes, Jeffrey Dean. Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation.CoRR, 2016.
- 8Sam Gross, Marc'Aurelio Ranzato, Arthur Szlam. Hard Mixtures of Experts for Large Scale Weakly Supervised Vision.CVPR, pp. 5085-5093, 2017.
- 9Jimmy Ba, Roger B. Grosse, James Martens. Distributed Second-Order Optimization using Kronecker-Factored Approximations.international conference on learning representations, 2017.