## AI helps you reading Science

## AI Insight

AI extracts a summary of this paper

Weibo:

# Improving Neural Network Training in Low Dimensional Random Bases

NIPS 2020, (2020)

EI

Keywords

Abstract

Stochastic Gradient Descent (SGD) has proven to be remarkably effective in optimizing deep neural networks that employ ever-larger numbers of parameters. Yet, improving the efficiency of large-scale optimization remains a vital and highly active area of research. Recent work has shown that deep neural networks can be optimized in random...More

Code:

Data:

Introduction

- Despite significant growth in the number of parameters used in deep learning networks, Stochastic Gradient Descent (SGD) continues to be remarkably effective at finding minima of the highly overparameterized weight space [9].
- Li et al [26] utilized random projections to reduce the dimensionality of neural networks, aiming to quantify the difficulty of different tasks
- They constrained the network optimization into a fixed low-dimensional, randomly oriented hyper-plane to investigate how many dimensions are needed to reach 90% accuracy against the SGD baseline on a given task (Fixed Projection Descent, FPD).
- The optimization progress is only defined with respect to the particular projection matrix and constrained to the subspace sampled at initialization

Highlights

- Despite significant growth in the number of parameters used in deep learning networks, Stochastic Gradient Descent (SGD) continues to be remarkably effective at finding minima of the highly overparameterized weight space [9]
- We show that while random subspace projection has computational benefits such as easy distribution on many workers, they become less efficient with growing projection dimensionality, or if the subspace projection is fixed throughout training
- For CIFAR-10 convolutional neural network (CNN) training, we find that using more random basis directions improves the correlation of the Random Bases Descent (RBD) gradient with SGD, as well as the achieved final accuracy
- We find that a switch between the low-dimensional RBD and standard SGD is possible at any point without divergence, and both update schemes are compatible at any point during training (Figure 4.5, further plots in Supplementary Material, Section B.5)
- We find benefits in both achieved accuracy and training time from this compartmentalization scheme
- We introduced an optimization scheme that restricts gradient descent to a few random directions, re-drawn at every step

Methods

- The authors formally propose the method for low-dimensional optimization in random bases.
- SGD optimizes the model using a stochastic gradient gtSGD = ∇θL(f), yj) where are randomly drawn samples of D at timestep t.
- The weights θt are adjusted iteratively following the update equation θt+1 := θt − ηSGD gtSGD with learning rate ηSGD > 0.
- In the commonly used mini-batch version of SGD, the update gradient is formed as an average of gradients gtS,BGD = (1/B) B b=1.
- Over a mini-batch sample of size B.

Results

- The architectures the authors use in the experiments are: a fully-connected (FC) network, a convolutional neural network (CNN), and a residual network (ResNet) [19].
- Since the authors have to generate random vectors of the network size in every training step, for initial evaluation, the authors choose architectures with a moderate (≈ 105) number of parameters.
- All networks use ReLU nonlinearities and are trained with a softmax cross-entropy loss on the image classification tasks MNIST, Fashion-MNIST (FMNIST), and CIFAR-10.

Conclusion

- The authors introduced an optimization scheme that restricts gradient descent to a few random directions, re-drawn at every step.
- This provides further evidence that viable solutions of neural network loss landscape can be found, even if only a small fraction of directions in the weight space are explored.
- It is likely that nonlinear subspace constructions could increase the expressiveness of the random approximation to enable more effective descent

Summary

## Introduction:

Despite significant growth in the number of parameters used in deep learning networks, Stochastic Gradient Descent (SGD) continues to be remarkably effective at finding minima of the highly overparameterized weight space [9].- Li et al [26] utilized random projections to reduce the dimensionality of neural networks, aiming to quantify the difficulty of different tasks
- They constrained the network optimization into a fixed low-dimensional, randomly oriented hyper-plane to investigate how many dimensions are needed to reach 90% accuracy against the SGD baseline on a given task (Fixed Projection Descent, FPD).
- The optimization progress is only defined with respect to the particular projection matrix and constrained to the subspace sampled at initialization
## Methods:

The authors formally propose the method for low-dimensional optimization in random bases.- SGD optimizes the model using a stochastic gradient gtSGD = ∇θL(f), yj) where are randomly drawn samples of D at timestep t.
- The weights θt are adjusted iteratively following the update equation θt+1 := θt − ηSGD gtSGD with learning rate ηSGD > 0.
- In the commonly used mini-batch version of SGD, the update gradient is formed as an average of gradients gtS,BGD = (1/B) B b=1.
- Over a mini-batch sample of size B.
## Results:

The architectures the authors use in the experiments are: a fully-connected (FC) network, a convolutional neural network (CNN), and a residual network (ResNet) [19].- Since the authors have to generate random vectors of the network size in every training step, for initial evaluation, the authors choose architectures with a moderate (≈ 105) number of parameters.
- All networks use ReLU nonlinearities and are trained with a softmax cross-entropy loss on the image classification tasks MNIST, Fashion-MNIST (FMNIST), and CIFAR-10.
## Conclusion:

The authors introduced an optimization scheme that restricts gradient descent to a few random directions, re-drawn at every step.- This provides further evidence that viable solutions of neural network loss landscape can be found, even if only a small fraction of directions in the weight space are explored.
- It is likely that nonlinear subspace constructions could increase the expressiveness of the random approximation to enable more effective descent

- Table1: Validation accuracy after 100 epochs of random subspace training for dimensionality d = 250 compared with the unrestricted SGD baseline (mean ± standard deviation of 3 independent runs using data augmentation). To ease comparison with [<a class="ref-link" id="c26" href="#r26">26</a>] who reported relative accuracies, we additionally denote the achieved accuracy as a fraction of the SGD baseline accuracy in parenthesis. Re-drawing the random subspace at every step (RBD) leads to better convergence than taking steps in a fixed randomly projected space of the same dimensionality (FPD). While training in the 400× smaller subspace can almost match full-dimensional SGD on MNIST, it only reaches 78% of the SGD’s baseline on the harder CIFAR-10 classification task. Black-box optimization using evolution strategies for the same dimensionality leads to far inferior optimization outcomes (NES). While NES’s performance could be improved significantly with more samples (i.e. higher d), the discrepancy demonstrates an advantage of gradient-based subspace optimization in low dimensions
- Table2: Validation accuracy after 100 epochs training with different directional distributions, Uniform in range [−1, 1], unit Gaussian, and zero-mean Bernoulli with probability p = 0.5 (denoted as Bernoulli-0.5). Compared to the Gaussian baseline, the optimization suffers under Uniform and Bernoulli distributions whose sampled directions concentrate in smaller fractions of the highdimensional space
- Table3: Accuracies and correlation with full-dimensional SGD gradient for CIFAR-10 ResNet-8 training with data augmentation under varying numbers of trainable parameters

Funding

- Li et al [26] utilized random projections to reduce the dimensionality of neural networks, aiming to quantify the difficulty of different tasks. They constrained the network optimization into a fixed low-dimensional, randomly oriented hyper-plane to investigate how many dimensions are needed to reach 90% accuracy against the SGD baseline on a given task (Fixed Projection Descent, FPD)
- By changing the basis at each step, RBD manages to decrease the FPD-SGD gap by up to 20%
- We find benefits in both achieved accuracy and training time from this compartmentalization scheme
- RBD reaches 84% of the SGD baseline with a 10× reduction of trainable parameters and outperforms FPD for all compression factors; even at 75× reduction, its relative improvement is over 11%

Study subjects and analysis

workers: 16

In Figure 5, we investigate whether training performance is affected when distributing RBD. We observe constant training performance and almost linear wall-clock scaling for 16 workers. 4.4 Performance on baseline tasks

Reference

- Martín Abadi et al. “Tensorflow: A System for Large-Scale Machine Learning”. In: 12th Symposium on Operating Systems Design and Implementation (OSDI 16). 2016, pp. 265–283.
- Dan Alistarh et al. “QSGD: Communication-Efficient SGD via Gradient Quantization and Encoding”. In: arXiv:1610.02132 [cs] (Oct. 2016). arXiv: 1610.02132 [cs].
- David Barber. Evolutionary Optimization as a Variational Method. 2017. URL: https://davidbarber.github.io/blog/2017/04/03/variational-optimisation/ (visited on 09/27/2018).
- Albert S. Berahas et al. “A Theoretical and Empirical Comparison of Gradient Approximations in Derivative-Free Optimization”. In: arXiv preprint arXiv:1905.01332 (2019).
- Jeremy Bernstein et al. “signSGD: Compressed Optimisation for Non-Convex Problems”. In: arXiv preprint arXiv:1802.04434 [cs] (2018). arXiv: 1802.04434 [cs].
- Tom B Brown et al. “Language Models are Few-Shot Learners”. In: arXiv preprint arXiv:2005.14165 [cs] (2020). arXiv: 2005.14165 [cs].
- Krzysztof Choromanski et al. “Structured Evolution with Compact Architectures for Scalable Policy Optimization”. In: arXiv:1804.02395 [cs, stat] (Apr. 2018). arXiv: 1804.02395 [cs, stat].
- Sanjoy Dasgupta and Anupam Gupta. “An Elementary Proof of a Theorem of Johnson and Lindenstrauss”. In: Random Structures & Algorithms 22.1 (2003), pp. 60–65.
- Misha Denil et al. “Predicting Parameters in Deep Learning”. In: Advances in Neural Information Processing Systems 26. Ed. by C. J. C. Burges et al. Curran Associates, Inc., 2013, pp. 2148–2156.
- Stanislav Fort and Surya Ganguli. “Emergent Properties of the Local Geometry of Neural Loss Landscapes”. In: arXiv:1910.05929 [cs, stat] (Oct. 2019). arXiv: 1910.05929 [cs, stat].
- Stanislav Fort and Adam Scherlis. “The Goldilocks Zone: Towards Better Understanding of Neural Network Loss Landscapes”. In: Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 33. 2019, pp. 3574–3581.
- Eva García-Martína et al. “Estimation of energy consumption in machine learning”. In: Journal of Parallel and Distributed Computing 134 (2019), pp. 75–88.
- Alexander N. Gorban et al. “Approximation with Random Bases: Pro et Contra”. In: Information Sciences 364 (2016), pp. 129–145.
- Guy Gur-Ari, Daniel A. Roberts, and Ethan Dyer. “Gradient Descent Happens in a Tiny Subspace”. In: arXiv:1812.04754 [cs, stat] (Dec. 2018). arXiv: 1812.04754 [cs, stat].
- Song Han, Huizi Mao, and William J. Dally. “Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding”. In: arXiv preprint arXiv:1510.00149 (2015).
- Nikolaus Hansen. “The CMA Evolution Strategy: A Tutorial”. In: arXiv:1604.00772 [cs, stat] (Apr. 2016). arXiv: 1604.00772 [cs, stat].
- Babak Hassibi and David G. Stork. “Second Order Derivatives for Network Pruning: Optimal Brain Surgeon”. In: Advances in Neural Information Processing Systems. 1993, pp. 164–171.
- Haowei He, Gao Huang, and Yang Yuan. “Asymmetric Valleys: Beyond Sharp and Flat Local Minima”. In: arXiv:1902.00744 [cs, stat] (Apr. 2019). arXiv: 1902.00744 [cs, stat].
- Kaiming He et al. “Deep Residual Learning for Image Recognition”. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016, pp. 770–778.
- Boris Igelnik and Yoh-Han Pao. “Stochastic Choice of Basis Functions in Adaptive Function Approximation and the Functional-Link Net”. In: 6.6 (1995), pp. 1320–1329.
- Max Jaderberg, Andrea Vedaldi, and Andrew Zisserman. “Speeding up Convolutional Neural Networks with Low Rank Expansions”. In: arXiv preprint arXiv:1405.3866 [cs] (2014). arXiv: 1405.3866 [cs].
- Zhe Jia et al. “Dissecting the Graphcore IPU Architecture via Microbenchmarking”. In: arXiv:1912.03413 [cs] (Dec. 2019). arXiv: 1912.03413 [cs].
- David Kozak et al. “A Stochastic Subspace Approach to Gradient-Free Optimization in High Dimensions”. In: arXiv preprint arXiv:2003.02684 (2020).
- Quoc Le, Tamás Sarlós, and Alex Smola. “Fastfood-Approximating Kernel Expansions in Loglinear Time”. In: Proceedings of the International Conference on Machine Learning. Vol. 85. 2013.
- Karel Lenc et al. “Non-Differentiable Supervised Learning with Evolution Strategies and Hybrid Methods”. In: arXiv:1906.03139 [cs, stat] (June 2019). arXiv: 1906.03139 [cs, stat].
- Chunyuan Li et al. “Measuring the Intrinsic Dimension of Objective Landscapes”. In: International Conference on Learning Representations. 2018.
- Hao Li et al. “Visualizing the Loss Landscape of Neural Nets”. In: Advances in Neural Information Processing Systems 31. Ed. by S. Bengio et al. Curran Associates, Inc., 2018, pp. 6389–6399.
- Ping Li, Trevor J. Hastie, and Kenneth W. Church. “Very Sparse Random Projections”. In: Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2006, pp. 287–296.
- Yujun Lin et al. “Deep Gradient Compression: Reducing the Communication Bandwidth for Distributed Training”. In: arXiv preprint arXiv:1712.01887 [cs] (2017). arXiv: 1712.01887 [cs].
- Yurii Nesterov and Vladimir Spokoiny. Random Gradient-Free Minimization of Convex Functions. Tech. rep. Université catholique de Louvain, Center for Operations Research and Econometrics (CORE), 2011.
- Ali Rahimi and Benjamin Recht. “Uniform Approximation of Functions with Random Bases”. In: 2008 46th Annual Allerton Conference on Communication, Control, and Computing. IEEE, 2008, pp. 555–561.
- Tara N. Sainath et al. “Low-Rank Matrix Factorization for Deep Neural Network Training with High-Dimensional Output Targets”. In: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 2013, pp. 6655–6659.
- Tim Salimans et al. “Evolution Strategies as a Scalable Alternative to Reinforcement Learning”. In: arXiv:1703.03864 [cs, stat] (Mar. 2017). arXiv: 1703.03864 [cs, stat].
- Nikko Strom. “Scalable Distributed DNN Training Using Commodity GPU Cloud Computing”. In: Sixteenth Annual Conference of the International Speech Communication Association. 2015.
- Emma Strubell, Ananya Ganesh, and Andrew McCallum. “Energy and policy considerations for deep learning in NLP”. In: arXiv preprint arXiv:1906.02243 (2019).
- Thijs Vogels, Sai Praneeth Karimireddy, and Martin Jaggi. “PowerSGD: Practical Low-Rank Gradient Compression for Distributed Optimization”. In: Advances in Neural Information Processing Systems 32. Ed. by H. Wallach et al. Curran Associates, Inc., 2019, pp. 14236– 14245.
- Daan Wierstra et al. “Natural Evolution Strategies.” In: Journal of Machine Learning Research 15.1 (2014), pp. 949–980.
- Xingwen Zhang, Jeff Clune, and Kenneth O. Stanley. “On the Relationship Between the OpenAI Evolution Strategy and Stochastic Gradient Descent”. In: arXiv:1712.06564 [cs] (Dec. 2017). arXiv: 1712.06564 [cs].

Tags

Comments