AI helps you reading Science
AI generates interpretation videos
AI extracts and analyses the key points of the paper to generate videos automatically
AI parses the academic lineage of this thesis
AI extracts a summary of this paper
To reduce communication costs of data-parallel Stochastic gradient descent, we introduce two adaptively quantized methods, Adaptive Level Quantization and Adaptive Multiplier Quantization, to learn and adapt gradient quantization method on the fly
Adaptive Gradient Quantization for Data-Parallel SGD
NIPS 2020, (2020)
Many communication-efficient variants of SGD use gradient quantization schemes. These schemes are often heuristic and fixed over the course of training. We empirically observe that the statistics of gradients of deep models change during the training. Motivated by this observation, we introduce two adaptive quantization schemes, ALQ and...More
PPT (Upload PPT)
- Many communication-efficient variants of SGD use gradient quantization schemes. These schemes are often heuristic and fixed over the course of training.
- The authors empirically observe that the statistics of gradients of deep models change during the training
- Motivated by this observation, the authors introduce two adaptive quantization schemes, ALQ and AMQ.
- The authors introduce two adaptive quantization schemes, ALQ and AMQ
- In both schemes, processors update their compression schemes in parallel by efficiently computing sufficient statistics of a parametric distribution.
- 1e4 want distributed optimization methods that match the per- Figure 1: Changes in the average variance of formance of SGD on a single hypothetical super machine, normalized gradient coordinates in a ResNetwhile paying a negligible communication cost
- Many communication-efficient variants of Stochastic gradient descent (SGD) use gradient quantization schemes
- We propose two adaptive methods for quantizing the gradients in data-parallel SGD
- We improve the validation accuracy by almost 2% on CIFAR-10 and 1% on ImageNet in challenging low-cost communication setups
- We provide theoretical guarantees for adaptively quantized SGD (AQSGD) algorithm, obtain variance and code-length bounds, and convergence guarantees for convex, nonconvex, and momentum-based variants of AQSGD
- To reduce communication costs of data-parallel SGD, we introduce two adaptively quantized methods, Adaptive Level Quantization (ALQ) and Adaptive Multiplier Quantization (AMQ), to learn and adapt gradient quantization method on the fly
- We demonstrate the superiority of ALQ and AMQ over nonadaptive methods empirically on deep models and large datasets
- ResNet-110 on CIFAR-10
ResNet-32 on CIFAR-10
ResNet-18 on ImageNet Bucket Size SuperSGD
NUQSGD [21, 22] QSGDinf  TRN 
ALQ ALQ-N AMQ AMQ-N Loss Loss Loss
SuperSGD ALQ AMQ Qinf TRN
Training Iteration (a) ResNet-32 on CIFAR-10
(b) ResNet-110 on CIFAR-10.
- ResNet-18 on ImageNet Bucket Size SuperSGD.
- For bucket size 100 and 3 bits, NUQSGD performs nearly as good as adaptive methods but quickly loses accuracy as the bucket size grows or shrinks.
- QSGDinf stays competitive for a wider range of bucket sizes but still loses accuracy faster than other methods.
- This shows the impact of bucketing as an understudied trick in evaluating quantization methods
- The authors improve the validation accuracy by almost 2% on CIFAR-10 and 1% on ImageNet in challenging low-cost communication setups.
- ALQ, achieves the best overall performance on ImageNet and the gap on CIFAR-10 with ALQ-N is less than 0.3%
- To reduce communication costs of data-parallel SGD, the authors introduce two adaptively quantized methods, ALQ and AMQ, to learn and adapt gradient quantization method on the fly.
- In addition to quantization method, in both methods, processors learn and adapt their coding methods in parallel by efficiently computing sufficient statistics of a parametric distribution.
- The authors establish a number of convergence guarantees for the adaptive methods.
- Table1: Validation accuracy on CIFAR-10 and ImageNet using 3 bits (except for SuperSGD and TRN) with 4 GPUs
- Table2: Validation accuracy of ResNet32 on CIFAR-10 using 3 quantization bits (except for SuperSGD and TRN) and bucket size 16384
- Table3: Training Hyper-parameters for CIFAR-10 and ImageNet
- Table4: Validation Accuracy on Full ImageNet Run
- Table5: Training ResNet50 on ImageNet with min-batch size 512. Time per step for training with 32bits full-precision is 1.2s and with 16 bits full-precision is 0.61s
- Table6: Training ResNet18 on ImageNet with min-batch size 512. Time per step for training with 32bits full-precision is 0.57s and with 16 bits full-precision is 0.28s
- Table7: Additional overhead of proposed methods for training ResNet18 on ImageNet (Table 6). We also show the cost of performing 3 updates relative to the total cost of training for 60 epochs. Full-precision training for 60 epochs with 32 bits takes 95 hours while with 16 bits takes 46 hours
- Adaptive quantization has been used for speech communication and storage . In machine learning, several biased and unbiased schemes have been proposed to compress networks and gradients. Recently, lattice-based quantization has been studied for distributed mean estimation and variance reduction . In this work, we focus on unbiased and coordinate-wise schemes to compress gradients.
Alistarh et al  proposed Quantized SGD (QSGD) focusing on the uniform quantization of stochastic gradients normalized to have unit Euclidean norm. Their experiments illustrate a similar quantization method, where gradients are normalized to have unit L∞ norm, achieves better performance. We refer to this method as QSGDinf or Qinf in short. Wen et al  proposed TernGrad, which can be viewed as a special case of QSGDinf with three quantization levels.
- FF was supported by OGS Scholarship
- DA and IM were supported the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement No 805223 ScaleML)
- DMR was supported by an NSERC Discovery Grant
- ARK was supported by NSERC Postdoctoral Fellowship
- Resources used in preparing this research were provided, in part, by the Province of Ontario, the Government of Canada through CIFAR, and companies sponsoring the Vector Institute.5
- M. Zinkevich, M. Weimer, L. Li, and A. J. Smola. Parallelized stochastic gradient descent. In Proc. Advances in Neural Information Processing Systems (NIPS), 2010.
- R. Bekkerman, M. Bilenko, and J. Langford. Scaling up machine learning: Parallel and distributed approaches. Cambridge University Press, 2011.
- B. Recht, C. Ré, S. Wright, and F. Niu. Hogwild: A lock-free approach to parallelizing stochastic gradient descent. In Proc. Advances in Neural Information Processing Systems (NIPS), 2011.
- J. Dean, G. Corrado, R. Monga K. Chen, M. Devin, M. Mao, M. Ranzato, A. Senior, P. Tucker, K. Yang, Q. V. Le, and A. Y. Ng. Large scale distributed deep networks. In Proc. Advances in Neural Information Processing Systems (NIPS), 2012.
- A. Coates, B. Huval, T. Wang, D. Wu, B. Catanzaro, and A. Ng. Deep learning with COTS HPC systems. In Proc. International Conference on Machine Learning (ICML), 2013.
- T. Chilimbi, Y. Suzue J. Apacible, and K. Kalyanaraman. Project adam: Building an efficient and scalable deep learning training system. In Proc. USENIX Symposium on Operating Systems Design and Implementation (OSDI), 2014.
- M. Li, D. G. Andersen, J. W. Park, A. J. Smola, A. Ahmed, V. Josifovski, J. Long, E. J. Shekita, and B.-Y. Su. Scaling distributed machine learning with the parameter server. In Proc. USENIX Symposium on Operating Systems Design and Implementation (OSDI), 2014.
- J. C. Duchi, S. Chaturapruek, and C. Ré. Asynchronous stochastic convex optimization. In Proc. Advances in Neural Information Processing Systems (NIPS), 2015.
- E. P. Xing, Q. Ho, W. Dai, J. K. Kim, J. Wei, S. Lee, X. Zheng, P. Xie, A. Kumar, and Y. Y. Petuum. Petuum: A new platform for distributed machine learning on big data. IEEE transactions on Big Data, 1(2):49–67, 2015.
- S. Zhang, A. E. Choromanska, and Y. LeCun. Deep learning with elastic averaging SGD. In Proc. Advances in Neural Information Processing Systems (NIPS), 2015.
- F. Seide, H. Fu, J. Droppo, G. Li, and D. Yu. 1-bit stochastic gradient descent and its application to data-parallel distributed training of speech DNNs. In Proc. INTERSPEECH, 2014.
- S. Gupta, A. Agrawal, K. Gopalakrishnan, and P. Narayanan. Deep learning with limited numerical precision. In Proc. International Conference on Machine Learning (ICML), 2015.
- M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, and M. Devin. Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arXiv:1603.04467, 2016.
- S. Zhou, Y. Wu, Z. Ni, X. Zhou, H. Wen, and Y. Zou. DoReFa-Net: Training low bitwidth convolutional neural networks with low bitwidth gradients. arXiv:1606.06160, 2016.
- W. Wen, C. Xu, F. Yan, C. Wu, Y. Wang, Y. Chen, and H. Li. TernGrad: Ternary gradients to reduce communication in distributed deep learning. In Proc. Advances in Neural Information Processing Systems (NIPS), 2017.
- J. Bernstein, Y.-X. Wang, K. Azizzadenesheli, and A. Anandkumar. signSGD: Compressed optimisation for non-convex problems. In Proc. International Conference on Machine Learning (ICML), 2018.
- S. Bubeck. Convex optimization: Algorithms and complexity. Foundations and Trends ® in Machine Learning, 8(3-4):231–358, 2015.
- P. Cummiskey, N. S. Jayant, and J. L. Flanagan. Adaptive quantization in differential PCM coding of speech. Bell System Technical Journal, 52(7):1105–1118, 1973.
- D. Alistarh, S. Ashkboos, and P. Davies. Distributed mean estimation with optimal error bounds. arXiv:2002.09268v2, 2020.
- D. Alistarh, D. Grubic, J. Z. Li, R. Tomioka, and M. Vojnovic. QSGD: Communicationefficient SGD via gradient quantization and encoding. In Proc. Advances in Neural Information Processing Systems (NIPS), 2017.
- A. Ramezani-Kebrya, F. Faghri, and D. M. Roy. NUQSGD: Improved communication efficiency for data-parallel SGD via nonuniform quantization. arXiv preprint arXiv:1908.06077v1, 2019.
- S. Horváth, C.-Y Ho, L. Horváth, A. N. Sahu, M. Canini, and P. Richtárik. Natural compression for distributed deep learning. arXiv:1905.10988v1, 2019.
- H. Zhang, J. Li, K. Kara, D. Alistarh, J. Liu, and C. Zhang. ZipML: Training linear models with end-to-end low precision, and a little bit of deep learning. In Proc. International Conference on Machine Learning (ICML), 2017.
- D. Zhang, J. Yang, D. Ye, and G. Hua. LQ-Nets: Learned quantization for highly accurate and compact deep neural networks. In Proc. European Conference on Computer Vision (ECCV), 2018.
- F. Fu, Y. Hu, Y. He, J. Jiang, Y. Shao, C. Zhang, and B. Cui. Don’t waste your bits! squeeze activations and gradients for deep neural networks via TINYSCRIPT. In Proc. International Conference on Machine Learning (ICML), 2020.
- M. H. Protter and C. B. Morrey. Intermediate Calculus. Springer, 1985.
-  K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
-  A. Krizhevsky. Learning multiple layers of features from tiny images. Technical report, University of Toronto, 2009.
-  J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A large-scale hierarchical image database. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2009.
-  Dami Choi, Christopher J. Shallue, Zachary Nado, Jaehoon Lee, Chris J. Maddison, and George E. Dahl. On Empirical Comparisons of Optimizers for Deep Learning. arXiv e-prints, art. arXiv:1910.05446, October 2019.
-  T. M. Cover and J. A. Thomas. Elements of Information Theory. WILEY, 2006.
-  S. Ghadimi and G. Lan. Stochastic first- and zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization, 23(4):2341–2368, 2013.
-  T. Yang, Q. Lin, and Z. Li. Unified convergence analysis of stochastic momentum methods for convex and non-convex optimization. arXiv:1604.03257v2, 2016.
-  B. T. Polyak. Some methods of speeding up the convergence of iteration methods. USSR Computational Mathematics and Mathematical Physics, 4(5):1–17, 1964.
-  Y. Nesterov. A method of solving a convex programming problem with convergence O(1/k2). Soviet Mathematics Doklady, 27(2):372–376, 1983.