AI helps you reading Science

AI generates interpretation videos

AI extracts and analyses the key points of the paper to generate videos automatically

Go Generating

AI Traceability

AI parses the academic lineage of this thesis

Master Reading Tree
Generate MRT

AI Insight

AI extracts a summary of this paper

stochastic gradient descent asymptotically converges to stationary points for general nonconvex nonsmooth objectives and corresponds to the case of no delays and i = and this Theorem matches the state of the art

Asynchronous Optimization Methods for Efficient Training of Deep Neural Networks with Guarantees

AAAI, pp.8209-8216, (2021)

Cited by: 0|Views14
Full Text


Asynchronous distributed algorithms are a popular way to reduce synchronization costs in large-scale optimization, and in particular for neural network training. However, for nonsmooth and nonconvex objectives, few convergence guarantees exist beyond cases where closed-form proximal operator solutions are available. As training most popul...More



  • Training deep neural networks (DNNs) is a difficult problem in several respects (Goodfellow et al 2016).
  • Due to multiple layers of nonlinear activation functions, the resulting optimization problems are nonconvex.
  • ReLU activation functions and max-pooling in convolutional networks induce nonsmoothness, i.e., the objective is not differentiable everywhere.
  • In applications it is often unreasonable to store entire data sets in memory in order to compute the objective or subgradients.
  • Machine learning applications, including training deep neural networks, have motivated optimization algo-
  • Training deep neural networks (DNNs) is a difficult problem in several respects (Goodfellow et al 2016)
  • Consider three variations of stochastic subgradient methods discussed in the introduction: 1) standard sequential stochastic gradient descent (SGD) with momentum, 2) Partitioned ASSM (PASSM), where each core updates only a block subset i of x and is lock-free, and 3) Asynchronous Stochastic Subgradient Method (ASSM), which defines the standard parallel asynchronous implementation in which every core updates the entire vector x, taking a lock to ensure consistency
  • SGD asymptotically converges to stationary points for general nonconvex nonsmooth objectives and corresponds to the case of no delays and i = [n] and this Theorem matches the state of the art
  • The Theorem points to their comparative advantages and disadvantages: 1) in order for Theorem 3.1 to apply to ASSM, write locks are necessary, limiting the potential time-to-epoch speedup, 2) whereas if i is the entire vector the limit point of ASSM is a stationary point, i.e., a point wherein zero is in the Clarke subdifferential of the limit, in the case of PASSM, the limit point is only coordinate-wise stationary, zero is only in the i component subdifferential of f
  • We proved the first asymptotic convergence results for asynchronous parallel stochastic subgradient descent methods with momentum, for general nonconvex nonsmooth objectives
  • We showed that a variant of our method can be efficiently implemented on GPUs, demonstrating speed-up versus state-of-the-art methods, without losing generalization accuracy
  • The authors' implementations of image classification tasks are based on Pytorch 1.5 (Paszke et al 2017) and Python multi-processing.
  • Having generated a computation graph during the forward pass, the authors can specify the leaf tensors along which the authors need to generate subgradients in a call to torch.autograd.grad()
  • The authors use this functionality in order to implement “restricted” backpropagation in PASSM.
  • For bigger networks and/or datasets, the on-device memory and compute resources of a single GPU become insufficient
  • For those cases the authors used a multi-GPU machine – referred as S2 further – with four Nvidia GeForce RTX 2080 Ti GPUs and two Intel(R) Xeon(R) CPU E5-2640 v4 @ 2.40GHz CPUs, totaling 40 logical CPU threads.
  • Along with warm-up, having periodic ASSM iterations spread across roughly 50% of all iterations is sufficient to match the baseline accuracy without running for additional time.
  • Bounded invariant sets of the differential inclusion correspond to points x such that 0 ∈ ∂if (x).
  • PASSM, relative to ASSM can exhibit better speedup allowing better use of the hardware, may converge to a weaker notion of stationarity and in practice a higher value of the objective.The authors proved the first asymptotic convergence results for asynchronous parallel stochastic subgradient descent methods with momentum, for general nonconvex nonsmooth objectives
  • This is the first such characterization covering DNNs, matching theory to common practice, closing an important gap in the literature.
  • The experimental results provided a thorough exploration of both the potential and limitations of speedup for a comprehensive set of variants of shared memory asynchronous multi-processing
  • Table1: Resnet20 training for 300 epochs on a Nvidia GeForce RTX 2080 Ti, with a batch size of 128. For large-batch training, we follow (<a class="ref-link" id="cGoyal_et+al_2017_a" href="#rGoyal_et+al_2017_a">Goyal et al 2017</a>). Asynchronous training uses 4 concurrent processes. Standard hyperparameter values (<a class="ref-link" id="cHe_et+al_2016_a" href="#rHe_et+al_2016_a">He et al 2016</a>) were applied
  • Table2: Resnet20 with 272474 parameters contained in 65 trainable tensors training over CIFAR-10 for 300 epochs on the setting S1. Momentum and weight-decay are identical across the methods. Schedulers – MS: Multi-step, Cos: Cosine
Download tables as Excel
Related work
  • HogWild! (HW!) (Recht et al 2011) has become the classic reference for shared-memory based asynchronous SGD. However, despite significant interest in asynchronous methods, HW! or similar methods does not have theoretical guarantees in the general stochastic nonsmooth nonconvex setting considered here. The convergence of HW! under assumptions of Lipschitz smoothness and nonconvexity was derived in e.g. (Nadiradze et al 2020). We note that HW! is well-known to scale on convex optimization tasks, e.g. (Recht et al 2011), however, its performance for large scale CNN training is not very thoroughly studied.

    The basic structure of HW! was extended by HogWild++ (Zhang, Hsieh, and Akella 2016), which focused on multiCPU (multi-socket) machines with non-uniform memory access (NUMA). HogWild++ showed limited throughput advantage over HW! for convex regression problems, but do not provide any convergence analysis. Similarly, Buckwild! (Sa et al 2015) proposed speeding up HW! by using restricted bit-precision update of the model, and has an adapted convergence analysis for convex Lipschitz smooth models, or restricted nonconvex smooth objectives.
  • Support for Vyacheslav Kungurtsev was provided by the OP VVV project CZ.02.1.01/0.0/0.0/16 019/0000765 “Research Center for Informatics.” Bapi Chatterjee was supported by the European Union’s Horizon 2020 research and innovation programme under the Marie SklodowskaCurie grant agreement No 754411 (ISTPlus)
  • Dan Alistarh has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement No 805223 ScaleML)
Study subjects and analysis
training samples: 73257
In the multi-GPU setting S2, we train two larger architectures (a) DN: DenseNet121 (Huang, Liu, and Weinberger 2017) with 6956298 parameters in 362 trainable tensors, and (b) RN: ResNext50 (Xie et al 2016) with 14788772 parameters in 161 trainable tensors, over datasets CIFAR-10/CIFAR-100 (C10/C100). The initial LR, weight-decay, momentum are identical across methods. Here SGD is distributed over 4 GPUs via the state-of-the-art DistributedDataParallel framework in Pytorch. PASSM+ spawns 4 concurrent processes running over individual GPUs. Here, SGD is a large-batch implementation, which computes subgradients at BS=128 and updates the model at BS=512 on aggregation. Compared to the SOTA implementation, PASSM+ provides on average speed-up of 1.4x with improving the validation accuracy. We explain the speed-up in terms of (a) reduced flops during backpropagation, (b) reduced communication cost for partitioned subgradients across GPUs, and (c) reduced synchronization cost. This set of results describes two contrasting cases (a) in the setting S1, ResNet32 with 466906 parameters in 101 tensors is trained over SVHN (Netzer et al 2011) images of small cropped digits with 73257 training samples and 26032 test samples, and (b) in a machine with 4 Nvidia GeForce GTX 1080 Ti GPUs and other system specifications as those of the setting S2, ResNet18 with 11181642 parameters in 62 tensors is trained over the Imagenet dataset (Russakovsky et al 2015) containing training set of 1.3 million images of 1000 classes and 50000 test samples. Both tasks are trained for 90 epochs. SGD follows a multi-step LR scheme and with initial LR warm-up for 5 epochs, whereas PASSM+ follows the Cosine LR rule. We observe that in case (a) PASSM+ provides up to 1.32x speed-up compared to the baseline. However, for the imagenet training task, the speed-up is around 1.08x. The reduced speed-up in case (b) can be explained in terms of high resource requirement resulting in increased contention and thereby a bottleneck over the shared GPU.

  • Bagirov, A.; Karmitsa, N.; and Makela, M. M. 2014. Introduction to Nonsmooth Optimization: theory, practice and software. Springer.
    Google ScholarFindings
  • Borkar, V. S. 2009. Stochastic approximation: a dynamical systems viewpoint. Springer.
    Google ScholarFindings
  • Bottou, L.; Curtis, F.; and Nocedal, J. 2018. Optimization methods for large-scale machine learning. SIAM Review 60(2): 223–311.
    Google ScholarLocate open access versionFindings
  • Cannelli, L.; Facchinei, F.; Kungurtsev, V.; and Scutari, G. 2019. Asynchronous parallel algorithms for nonconvex optimization. Mathematical Programming 1–34.
    Google ScholarFindings
  • Davis, D.; Drusvyatskiy, D.; Kakade, S.; and Lee, J. D. 2018. Stochastic subgradient method converges on tame functions. arXiv preprint arXiv:1804.07795.
  • Dupuis, P.; and Kushner, H. J. 1989. Stochastic approximation and large deviations: Upper bounds and wp 1 convergence. SIAM Journal on Control and Optimization 27(5): 1108–1135.
    Google ScholarLocate open access versionFindings
  • Ermol’ev, Y. M.; and Norkin, V. 1998. Stochastic generalized gradient method for nonconvex nonsmooth stochastic optimization. Cybernetics and Systems Analysis 34(2): 196– 215.
    Google ScholarLocate open access versionFindings
  • Goodfellow, I.; Bengio, Y.; Courville, A.; and Bengio, Y. 2016. Deep learning, volume 1. MIT press Cambridge.
    Google ScholarFindings
  • Goyal, P.; Dollar, P.; Girshick, R.; Noordhuis, P.; Wesolowski, L.; Kyrola, A.; Tulloch, A.; Jia, Y.; and He, K. 2017.
    Google ScholarFindings
  • Harlap, A.; Narayanan, D.; Phanishayee, A.; Seshadri, V.; Devanur, N. R.; Ganger, G. R.; and Gibbons, P. B. 2018. PipeDream: Fast and Efficient Pipeline Parallel DNN Training. CoRR abs/1806.03377.
  • He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, 770–778.
    Google ScholarLocate open access versionFindings
  • Huang, G.; Liu, Z.; and Weinberger, K. Q. 2017. Densely Connected Convolutional Networks. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2261– 2269.
    Google ScholarLocate open access versionFindings
  • Kushner, H.; and Yin, G. G. 2003. Stochastic approximation and recursive algorithms and applications, volume 35. Springer Science & Business Media.
    Google ScholarLocate open access versionFindings
  • Li, Z.; and Li, J. 2018. A Simple Proximal Stochastic Gradient Method for Nonsmooth Nonconvex Optimization. In Advances in Neural Information Processing Systems, 5564– 5574.
    Google ScholarLocate open access versionFindings
  • Lian, X.; Huang, Y.; Li, Y.; and Liu, J. 20Asynchronous parallel stochastic gradient for nonconvex optimization. In Advances in Neural Information Processing Systems, 2737– 2745.
    Google ScholarLocate open access versionFindings
  • Liu, J.; and Wright, S. J. 2015. Asynchronous stochastic coordinate descent: Parallelism and convergence properties. SIAM Journal on Optimization 25(1): 351–376.
    Google ScholarLocate open access versionFindings
  • Majewski, S.; Miasojedow, B.; and Moulines, E. 2018. Analysis of nonsmooth stochastic approximation: the differential inclusion approach. arXiv preprint arXiv:1805.01916.
  • Mitliagkas, I.; Zhang, C.; Hadjis, S.; and Re, C. 2016. Asynchrony begets momentum, with an application to deep learning. In 2016 54th Annual Allerton Conference on Communication, Control, and Computing (Allerton), 997–1004.
    Google ScholarLocate open access versionFindings
  • Nadiradze, G.; Markov, I.; Chatterjee, B.; Kungurtsev, V.; and Alistarh, D. 2020. Elastic Consistency: A General Consistency Model for Distributed Stochastic Gradient Descent. ArXiv abs/2001.05918.
  • Netzer, Y.; Wang, T.; Coates, A.; Bissacco, A.; Wu, B.; and Ng, A. 2011. Reading Digits in Natural Images with Unsupervised Feature Learning.
    Google ScholarFindings
  • Nvidia. 2020.
  • Paszke, A.; Gross, S.; Chintala, S.; Chanan, G.; Yang, E.; DeVito, Z.; Lin, Z.; Desmaison, A.; Antiga, L.; and Lerer, A. 2017. Automatic differentiation in PyTorch.
    Google ScholarFindings
  • Pontes, F. J.; da F. de Amorim, G.; Balestrassi, P.; Paiva, A. P.; and Ferreira, J. R. 2016. Design of experiments and focused grid search for neural network parameter optimization. Neurocomputing 186: 22–34.
    Google ScholarLocate open access versionFindings
  • Recht, B.; Re, C.; Wright, S.; and Niu, F. 2011. Hogwild: A lock-free approach to parallelizing stochastic gradient descent. In Advances in neural information processing systems, 693–701.
    Google ScholarLocate open access versionFindings
  • Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.; Ma, S.; Huang, Z.; Karpathy, A.; Khosla, A.; Bernstein, M.; Berg, A.; and Fei-Fei, L. 2015. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision 115: 211–252.
    Google ScholarLocate open access versionFindings
  • Ruszczynski, A. 1987. A linearization method for nonsmooth stochastic programming problems. Mathematics of Operations Research 12(1): 32–49.
    Google ScholarLocate open access versionFindings
  • Sa, C. D.; Zhang, C.; Olukotun, K.; and Re, C. 2015. Taming the Wild: A Unified Analysis of Hogwild-Style Algorithms. Advances in neural information processing systems 28: 2656–2664.
    Google ScholarFindings
  • Shamir, O. 2020. Can We Find Near-ApproximatelyStationary Points of Nonsmooth Nonconvex Functions? ArXiv abs/2002.11962.
  • Sun, T.; Hannah, R.; and Yin, W. 2017. Asynchronous Coordinate Descent under More Realistic Assumptions. In NIPS.
    Google ScholarFindings
  • Xie, S.; Girshick, R.; Dollar, P.; Tu, Z.; and He, K. 2016. Aggregated Residual Transformations for Deep Neural Networks. arXiv preprint arXiv:1611.05431.
  • Zhang, H.; Hsieh, C.-J.; and Akella, V. 2016. HogWild++: A New Mechanism for Decentralized Asynchronous Stochastic Gradient Descent. 2016 IEEE 16th International Conference on Data Mining (ICDM) 629–638.
    Google ScholarLocate open access versionFindings
  • Zhang, J.; Lin, H.; Sra, S.; and Jadbabaie, A. 2020. On Complexity of Finding Stationary Points of Nonsmooth Nonconvex Functions. ArXiv abs/2002.04130.
  • Zhang, J.; Mitliagkas, I.; and Re, C. 2017. YellowFin and the Art of Momentum Tuning. CoRR abs/1706.03471.
  • Zhang, X.; Zhou, X.; Lin, M.; and Sun, J. 2018. ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices. In CVPR, 6848–6856.
    Google ScholarFindings
  • Zhu, R.; Niu, D.; and Li, Z. 2018. Asynchronous Stochastic Proximal Methods for Nonconvex Nonsmooth Optimization. arXiv preprint arXiv:1802.08880.
Vyacheslav Kungurtsev
Vyacheslav Kungurtsev
Bapi Chatterjee
Bapi Chatterjee
Your rating :