# BackPACK: Packing more into Backprop

ICLR, 2020.

EI

Weibo:

Abstract:

Automatic differentiation frameworks are optimized for exactly one thing: computing the average mini-batch gradient. Yet, other quantities such as the variance of the mini-batch gradients or many approximations to the Hessian can, in theory, be computed efficiently, and at the same time as the gradient. While these quantities are of great...More

Introduction

- The success of deep learning and the applications it fuels can be traced to the popularization of automatic differentiation frameworks.
- Packages like TENSORFLOW (Abadi et al, 2016), CHAINER (Tokui et al, 2015), MXNET (Chen et al, 2015), and PYTORCH (Paszke et al, 2019) provide efficient implementations of parallel, GPU-based gradient computations to a wide range of users, with elegant syntactic sugar.
- Other quantities can be computed with automatic differentiation at a comparable cost or minimal overhead to the gradient backpropagation pass; for example, approximate second-order information or the variance of gradients within the batch.
- Researchers who want to investigate their use face a chickenand-egg problem: automatic differentiation tools required to go beyond standard gradient methods are not available, but there is no incentive for their implementation in existing deep-learning software as long as no large portion of the users need it

Highlights

- The success of deep learning and the applications it fuels can be traced to the popularization of automatic differentiation frameworks
- Other quantities can be computed with automatic differentiation at a comparable cost or minimal overhead to the gradient backpropagation pass; for example, approximate second-order information or the variance of gradients within the batch. These quantities are valuable to understand the geometry of deep neural networks, for the identification of free parameters, and to push the development of more efficient optimization algorithms
- To address this need for a specialized framework focused on machine learning, we propose a framework for the implementation of generalized backpropagation to compute additional quantities
- Our results show that the curvature approximations based on Monte-Carlo (MC) estimates of the generalized Gauss-Newton, the approach used by KFAC, give similar progress per iteration to their more accurate counterparts, but being much cheaper to compute
- Regarding second-order extensions, the computation of the generalized Gauss-Newton can be expensive for networks with large outputs like CIFAR100, regardless of the approximation being diagonal of Kronecker-factored
- To support research and development in optimization for deep learning, we have introduced BACKPACK, an efficient implementation in PYTORCH of recent conceptual advances and extensions to backpropagation (Tab. 1 lists all features)

Methods

- To illustrate the utility of BACKPACK, the authors implement preconditioned gradient descent optimizers using diagonal and Kronecker approximations of the GGN.
- The update rule the authors implement uses a curvature matrix G(θt(i)), which could be a diagonal or Kronecker factorization of the GGN blocks, and a damping parameter λ to precondition the gradient: θt(+i)1 = θt(i) − α(G(θt(i)) + λI)−1∇L(θt(i)) , i = 1, .
- For the Kronecker-factored quantities, the authors use the approximation introduced by Martens & Grosse (2015)

Results

**EVALUATION AND BENCHMARKS**

The authors benchmark the overhead of BACKPACK on the CIFAR-10 and CIFAR-100 datasets, using the 3C3D network3 provided by DEEPOBS (Schneider et al, 2019) and the ALL-CNN-C4 network of Springenberg et al (2015).- The MC approximation used by KFAC, which the authors implement for a diagonal approximation, can be computed at minimal overhead—much less than two backward passes.
- This last point is encouraging, as the optimization experiment in Section 4 suggest that this approximation is reasonably accurate

Conclusion

- Machine learning’s coming-of-age has been accompanied, and in part driven, by a maturing of the software ecosystem
- This has drastically simplified the lives of developers and researchers alike, but has crystallized parts of the algorithmic landscape.
- This has dampened research in cutting-edge areas that are far from mature, like second-order optimization for deep neural networks.
- The authors hope that studies like this allow BACKPACK to help mature the ML software ecosystem further

Summary

## Introduction:

The success of deep learning and the applications it fuels can be traced to the popularization of automatic differentiation frameworks.- Packages like TENSORFLOW (Abadi et al, 2016), CHAINER (Tokui et al, 2015), MXNET (Chen et al, 2015), and PYTORCH (Paszke et al, 2019) provide efficient implementations of parallel, GPU-based gradient computations to a wide range of users, with elegant syntactic sugar.
- Other quantities can be computed with automatic differentiation at a comparable cost or minimal overhead to the gradient backpropagation pass; for example, approximate second-order information or the variance of gradients within the batch.
- Researchers who want to investigate their use face a chickenand-egg problem: automatic differentiation tools required to go beyond standard gradient methods are not available, but there is no incentive for their implementation in existing deep-learning software as long as no large portion of the users need it
## Methods:

To illustrate the utility of BACKPACK, the authors implement preconditioned gradient descent optimizers using diagonal and Kronecker approximations of the GGN.- The update rule the authors implement uses a curvature matrix G(θt(i)), which could be a diagonal or Kronecker factorization of the GGN blocks, and a damping parameter λ to precondition the gradient: θt(+i)1 = θt(i) − α(G(θt(i)) + λI)−1∇L(θt(i)) , i = 1, .
- For the Kronecker-factored quantities, the authors use the approximation introduced by Martens & Grosse (2015)
## Results:

**EVALUATION AND BENCHMARKS**

The authors benchmark the overhead of BACKPACK on the CIFAR-10 and CIFAR-100 datasets, using the 3C3D network3 provided by DEEPOBS (Schneider et al, 2019) and the ALL-CNN-C4 network of Springenberg et al (2015).- The MC approximation used by KFAC, which the authors implement for a diagonal approximation, can be computed at minimal overhead—much less than two backward passes.
- This last point is encouraging, as the optimization experiment in Section 4 suggest that this approximation is reasonably accurate
## Conclusion:

Machine learning’s coming-of-age has been accompanied, and in part driven, by a maturing of the software ecosystem- This has drastically simplified the lives of developers and researchers alike, but has crystallized parts of the algorithmic landscape.
- This has dampened research in cutting-edge areas that are far from mature, like second-order optimization for deep neural networks.
- The authors hope that studies like this allow BACKPACK to help mature the ML software ecosystem further

- Table1: Overview of the features supported in the first release of BACKPACK
- Table2: Test problems considered from the DEEPOBS library (<a class="ref-link" id="cSchneider_et+al_2019_a" href="#rSchneider_et+al_2019_a">Schneider et al, 2019</a>)
- Table3: Best hyperparameter settings for optimizers and baselines shown in this work. In the Momentum baselines, the momentum was fixed to 0.9. Parameters for computation of the running averages in Adam use the default values (β1, β2) = (0.9, 0.999). The symbols and denote whether the hyperparameter setting is an interior point of the grid or not, respectively
- Table4: Overview of the features supported in the first release of BACKPACK. The quantities are computed separately for all module parameters, i.e. i = 1, . . . , L

Funding

- The authors would like to thank Aaron Bahde, Ludwig Bald, and Frank Schneider for their help with DEEPOBS and Lukas Balles, Simon Bartels, Filip de Roos, Tim Fischer, Nicolas Kramer, Agustinus Kristiadi, Frank Schneider, Jonathan Wenger, and Matthias Werner for constructive feedback. The authors gratefully acknowledge financial support by the European Research Council through ERC StG Action 757275 / PANAMA; the DFG Cluster of Excellence “Machine Learning - New Feature Details Individual gradients Batch variance 2nd moment Indiv. gradient 2 norm DiagGGN ∇ 1 N θ(i) n(θ), n = 1, . . . , N [∇θ(i) n(θ)]2j − [∇θ(i) L(θ)]2j n(θ)]2j , j = 1, . . . , d(i). n(θ) , n = 1, . . . , N diag G(θ(i)) DiagGGN-MC Hessian diagonal KFAC KFLR KFRA diag G (θ(i))

Reference

- Martın Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek Gordon Murray, Benoit Steiner, Paul A. Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. Tensorflow: A system for large-scale machine learning. In 12th USENIX Symposium on Operating Systems Design and Implementation, 2016.
- Shun-ichi Amari. Natural gradient works efficiently in learning. Neural Computation, 10(2), 1998.
- Lukas Balles and Philipp Hennig. Dissecting Adam: The sign, magnitude and variance of stochastic gradients. In Proceedings of the 35th International Conference on Machine Learning, 2018.
- Lukas Balles, Javier Romero, and Philipp Hennig. Coupling adaptive batch sizes with learning rates. In Proceedings of the 33rd Conference on Uncertainty in Artificial Intelligence, 2017.
- Atilim Gunes Baydin, Barak A. Pearlmutter, Alexey Andreyevich Radul, and Jeffrey Mark Siskind. Automatic differentiation in machine learning: A survey. Journal of Machine Learning Research, 18(153), 2018.
- Sue Becker and Yann Le Cun. Improving the convergence of back-propagation learning with second order methods. In Proceedings of the 1988 Connectionist Models Summer School, 1989.
- Antoine Bordes, Leon Bottou, and Patrick Gallinari. SGD-QN: careful quasi-Newton stochastic gradient descent. J. Mach. Learn. Res., 10, 2009.
- Aleksandar Botev, Hippolyt Ritter, and David Barber. Practical Gauss-Newton optimisation for deep learning. In Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, 2017.
- James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris Leary, Dougal Maclaurin, and Skye Wanderman-Milne. JAX: Composable transformations of Python+NumPy programs, 2018.
- Tianqi Chen, Mu Li, Yutian Li, Min Lin, Naiyan Wang, Minjie Wang, Tianjun Xiao, Bing Xu, Chiyuan Zhang, and Zheng Zhang. MXNet: A flexible and efficient machine learning library for heterogeneous distributed systems. In 31st Conference on Neural Information Processing Systems, Workshop on Machine Learning Systems, 2015.
- Felix Dangel, Stefan Harmeling, and Philipp Hennig. A modular approach to block-diagonal Hessian approximations for second-order optimization methods. CoRR, abs/1902.01813, 2019.
- Thomas George, Cesar Laurent, Xavier Bouthillier, Nicolas Ballas, and Pascal Vincent. Fast approximate natural gradient descent in a Kronecker-factored eigenbasis. 2018.
- Ian J. Goodfellow. Efficient per-example gradient computations. CoRR, abs/1510.01799, 2015.
- Roger B. Grosse and James Martens. A Kronecker-factored approximate Fisher matrix for convolution layers. In Proceedings of the 33rd International Conference on Machine Learning, volume 48 of JMLR Workshop and Conference Proceedings, 2016.
- Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, 2016.
- Sepp Hochreiter and Jurgen Schmidhuber. Long short-term memory. Neural Computation, 9(8), 1997.
- Michael Innes. Flux: Elegant machine learning with Julia. Journal of Open Source Software, 3(25), 2018a.
- Michael Innes. Don’t unroll adjoint: Differentiating SSA-form programs. CoRR, abs/1810.07951, 2018b.
- Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the 32nd International Conference on Machine Learning, volume 37 of JMLR Workshop and Conference Proceedings, 2015.
- Angelos Katharopoulos and Francois Fleuret. Not all samples are created equal: Deep learning with importance sampling. In Proceedings of the 35th International Conference on Machine Learning, 2018.
- Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations, 2015.
- Frederik Kunstner, Lukas Balles, and Philipp Hennig. Limitations of the empirical Fisher approximatiom. In Advances in Neural Information Processing Systems 32, 2019.
- Nicolas Le Roux, Pierre-Antoine Manzagol, and Yoshua Bengio. Topmoumoute online natural gradient algorithm. In Advances in Neural Information Processing Systems 20, 2007.
- Maren Mahsereci and Philipp Hennig. Probabilistic line searches for stochastic optimization. Journal of Machine Learning Research, 18, 2017.
- James Martens. New perspectives on the natural gradient method. CoRR, abs/1412.1193, 2014.
- James Martens and Roger B. Grosse. Optimizing neural networks with Kronecker-factored approximate curvature. In Proceedings of the 32nd International Conference on Machine Learning, volume 37 of JMLR Workshop and Conference Proceedings, 2015.
- James Martens, Jimmy Ba, and Matt Johnson. Kronecker-factored curvature approximations for recurrent neural networks. In 6th International Conference on Learning Representations, 2018.
- Eiji Mizutani and Stuart E. Dreyfus. Second-order stagewise backpropagation for Hessian-matrix analyses and investigation of negative curvature. Neural Networks, 21(2-3), 2008.
- Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. PyTorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32. 2019.
- Frank Schneider, Lukas Balles, and Philipp Hennig. DeepOBS: A deep learning optimizer benchmark suite. In 7th International Conference on Learning Representations, 2019.
- Nicol N. Schraudolph. Fast curvature matrix-vector products for second-order gradient descent. Neural Computation, 14(7), 2002.
- Jost Tobias Springenberg, Alexey Dosovitskiy, Thomas Brox, and Martin A. Riedmiller. Striving for simplicity: The all convolutional net. In 3rd International Conference on Learning Representations, 2015.
- Seiya Tokui, Kenta Oono, Shohei Hido, and Justin Clayton. Chainer: A next-generation open source framework for deep learning. In 29th Conference on Neural Information Processing Systems, Workshop on Machine Learning Systems, 2015.
- Yohei Tsuji, Kazuki Osawa, Yuichiro Ueno, Akira Naruse, Rio Yokota, and Satoshi Matsuoka. Performance optimizations and analysis of distributed deep learning with approximated second-order optimization method. In 48th International Conference on Parallel Processing, Workshop Proceedings, 2019.

Tags

Comments