Unifying Data, Model and Hybrid Parallelism in Deep Learning via Tensor Tiling
arXiv: Distributed, Parallel, and Cluster Computing, Volume abs/1805.04170, 2018.
EI
Weibo:
Abstract:
Deep learning systems have become vital tools across many fields, but the increasing model sizes mean that training must be accelerated to maintain such systemsu0027 utility. Current systems like Tensorflow and MXNet focus on one specific parallelization strategy, data parallelism, which requires large training batch sizes in order to sca...More
Code:
Data:
Introduction
- Deep neural networks (DNNs) have delivered tremendous improvements across many machine learning tasks, ranging from computer vision [33, 40] and speech recognition [18, 26] to natural language processing [15, 30].
- The popularity of DNNs has ushered in the development of several machine learning systems focused on DNNs [2, 4, 13, 22]
- These systems allow users to program a DNN model with ease in an array-language frontend, and each enables training on a GPU for performance.
- Training is performed by iterating over the training data many times (i.e. “epoches”) doing SGD in mini-batches, as illustrated in the following pseudocode
Highlights
- Deep neural networks (DNNs) have delivered tremendous improvements across many machine learning tasks, ranging from computer vision [33, 40] and speech recognition [18, 26] to natural language processing [15, 30]
- The popularity of Deep neural networks has ushered in the development of several machine learning systems focused on Deep neural networks [2, 4, 13, 22]
- Compounding the problem, Deep neural networks users must completely retrain the model after any changes in the training parameters while fine-tuning a Deep neural networks
- We evaluated SOYBEAN on a 8-GPU machine and compared its performance against data parallelism with varying configurations
- Deep learning systems have become indispensible in many fields, but as their complexities grow, Deep neural networks training time is becoming increasingly intractable
- We demonstrate effective tiling in SOYBEAN that can achieve performance as good as or better than the best of data parallelism and model parallelism for many types of Deep neural networks on multiple GPUs
Methods
- The authors evaluated SOYBEAN on Amazon’s EC2 cluster. The authors used a p2.8xlarge instance, with 480GB of memory and 32 virtual CPUs as well as 8 NVIDIA GK210 GPUs on the instance.
- Each of the GPUs has 12GB of memory; they are connected by PCI-e, with a maximum peer-to-peer bi-directional bandwidth of 20GB/s.
- In this evaluation, the authors want to know if communication overhead accounts for a large percentage of the overall runtime when the batch size is relatively small or when the weight size is large.
- The communication overhead is strictly smaller than the communication time
Results
- The authors examine SOYBEAN’s performance. the authors want to answer following questions: 1.
Conclusion
- Deep learning systems have become indispensible in many fields, but as their complexities grow, DNN training time is becoming increasingly intractable.
- With the speed and simplicity provided by SOYBEAN’s backend, users can train networks using frontends like TENSORFLOW and MXNET quickly and making DNNs useful for a larger audience
Summary
Introduction:
Deep neural networks (DNNs) have delivered tremendous improvements across many machine learning tasks, ranging from computer vision [33, 40] and speech recognition [18, 26] to natural language processing [15, 30].- The popularity of DNNs has ushered in the development of several machine learning systems focused on DNNs [2, 4, 13, 22]
- These systems allow users to program a DNN model with ease in an array-language frontend, and each enables training on a GPU for performance.
- Training is performed by iterating over the training data many times (i.e. “epoches”) doing SGD in mini-batches, as illustrated in the following pseudocode
Methods:
The authors evaluated SOYBEAN on Amazon’s EC2 cluster. The authors used a p2.8xlarge instance, with 480GB of memory and 32 virtual CPUs as well as 8 NVIDIA GK210 GPUs on the instance.- Each of the GPUs has 12GB of memory; they are connected by PCI-e, with a maximum peer-to-peer bi-directional bandwidth of 20GB/s.
- In this evaluation, the authors want to know if communication overhead accounts for a large percentage of the overall runtime when the batch size is relatively small or when the weight size is large.
- The communication overhead is strictly smaller than the communication time
Results:
The authors examine SOYBEAN’s performance. the authors want to answer following questions: 1.Conclusion:
Deep learning systems have become indispensible in many fields, but as their complexities grow, DNN training time is becoming increasingly intractable.- With the speed and simplicity provided by SOYBEAN’s backend, users can train networks using frontends like TENSORFLOW and MXNET quickly and making DNNs useful for a larger audience
Tables
- Table1: Runtime per batch comparison for a 4-layers MLP network between single GPU and single GPU with SOYBEAN partitions. The weight size is fixed to 8K×8K
Related work
- How to distribute arbitrary data easily for users while achieving high-performance computation at the same time has been a popular research topic. However, it is difficult to design systems that automatically optimizes locality without knowledge of the underlying data and processing. Relatedly, distributed array programs are increasingly important due to the emergence of machine learning and deep learning. As a result, Throughput Speedup Throughput Speedup
SoyBean Data parallelism
Batch Size (a) AlexNet throughput speedup.
Batch size (b) VGG throughput speedup.
significant effort has been expended in optimizing distributed array frameworks. Deep learning systems. Many frameworks, such as Tensorflow [4], MXNet [13], PyTorch [2], Theano [10] and Caffe2 [22] have been proposed to facilitate developing new neural network models. These distributed array frameworks emphasize deep learning and machine learning applications. Besides the common array operations, they also provide many functionalities for neural networks such as automatic differentiation (backpropagation).
Reference
- Julia language. http://julialang.org.
- PyTorch. http://pytorch.org.
- Sparkr: R frontend for spark. http://amplab-extras.github.io/SparkR-pkg.
- M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, M. Kudlur, J. Levenberg, R. Monga, S. Moore, D. G. Murray, B. Steiner, P. Tucker, V. Vasudevan, P. Warden, M. Wicke, Y. Yu, and X. Zheng. Tensorflow: A system for large-scale machine learning. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), 2016.
- E. Anderson, Z. Bai, J. Dongarra, A. Greenbaum, A. McKenney, J. Du Croz, S. Hammerling, J. Demmel, C. Bischof, and D. Sorensen. LAPACK: A portable linear algebra library for high-performance computers. In Proceedings of the 1990 ACM/IEEE conference on Supercomputing, pages 2–11. IEEE Computer Society Press, 1990.
- P. Anderson. The use and limitations of static-analysis tools to improve software quality. CrossTalk: The Journal of Defense Software Engineering, 21(6):18–21, 2008.
- S. Balay, W. D. Gropp, L. C. McInnes, and B. F. Smith. Efficient management of parallelism in object oriented numerical software libraries. In E. Arge, A. M. Bruaset, and H. P. Langtangen, editors, Modern Software Tools in Scientific Computing, pages 163–202. Birkhauser Press, 1997.
- S. Balay, S. Abhyankar, M. F. Adams, J. Brown, P. Brune, K. Buschelman, V. Eijkhout, W. D. Gropp, D. Kaushik, M. G. Knepley, L. C. McInnes, K. Rupp, B. F. Smith, and H. Zhang. PETSc users manual. Technical Report ANL-95/11 - Revision 3.5, Argonne National Laboratory, 2014.
- D. Bau, I. Kodukula, V. Kotlyar, K. Pingali, and P. Stodghill. Solving alignment using elementary linear algebra. In Languages and Compilers for Parallel Computing, pages 46–60.
- J. Bergstra, O. Breuleux, F. Bastien, P. Lamblin, R. Pascanu, G. Desjardins, J. Turian, D. Warde-Farley, and Y. Bengio. Theano: a CPU and GPU math expression compiler. In Proceedings of the Python for Scientific Computing Conference (SciPy), 2010.
- K. J. Brown, H. Lee, T. Rompf, A. K. Sujeeth, C. De Sa, C. Aberger, and K. Olukotun. Have abstraction and eat performance, too: Optimized heterogeneous computing with parallel patterns. In Proceedings of the 2016 International Symposium on Code Generation and Optimization, CGO ’16, 2016.
- J. B. Buck, N. Watkins, J. LeFevre, K. Ioannidou, C. Maltzahn, N. Polyzotis, and S. Brandt. Scihadoop: array-based query processing in hadoop. In Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, 2011.
- T. Chen, M. Li, Y. Li, M. Lin, N. Wang, M. Wang, T. Xiao, B. Xu, C. Zhang, and Z. Zhang. MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems. ArXiv e-prints, Dec. 2015.
- T. Chilimbi, Y. Suzue, J. Apacible, and K. Kalyanaraman. Project adam: Building an efficient and scalable deep learning training system. In Proceedings of the 11th USENIX Conference on Operating Systems Design and Implementation, OSDI’14, 2014.
- K. Cho, B. van Merrienboer, C. Gulcehre, F. Bougares, H. Schwenk, and Y. Bengio. Learning phrase representations using RNN encoder-decoder for statistical machine translation. 2014.
- J. Choi, J. J. Dongarra, R. Pozo, and D. W. Walker. Scalapack: A scalable linear algebra library for distributed memory concurrent computers. In Frontiers of Massively Parallel Computation, 1992., Fourth Symposium on the, pages 120–127. IEEE, 1992.
- M. Chu, R. Ravindran, and S. Mahlke. Data access partitioning for fine-grain parallelism on multicore architectures. In Microarchitecture, 2007. MICRO 2007. 40th Annual IEEE/ACM International Symposium on, pages 369–380. IEEE, 2007.
- G. Dahl, D. Yu, L. Deng, and A. Acero. Context-dependent pre-trained deep neural networks for large vocabulary speech recognition. In IEEE Transactions on Audio, Speech, and Language Processing, 2011.
- J. Dean and S. Ghemawat. Mapreduce: Simplified data processing on large clusters. In Symposium on Operating System Design and Implementation (OSDI), 2004.
- J. Dean, G. S. Corrado, R. Monga, K. Chen, M. Devin, Q. V. Le, M. Z. Mao, M. Ranzato, A. Senior, P. Tucker, K. Yang, and A. Y. Ng. Large scale distributed deep networks. In Neural Information Processing Systems (NIPS), 2012.
- E. D’HOLLANDER. Partitioning and labeling of index sets in do loops with constant dependence vectors. In 1989 International Conference on Parallel Processing, University Park, PA, 1989.
- Facebook. Caffe2: a lightweight, modular, and scalable deep learning framework, 2017. URL https://github.com/caffe2/caffe2.
- J. He, A. E. Snavely, R. F. Van der Wijngaart, and M. A. Frumkin. Automatic recognition of performance idioms in scientific applications. In Parallel & Distributed Processing Symposium (IPDPS), 2011 IEEE International, pages 118– 127. IEEE, 2011.
- T. Henretty, K. Stock, L.-N. Pouchet, F. Franchetti, J. Ramanujam, and P. Sadayappan. Data layout transformation for stencil computations on short-vector simd architectures. In Compiler Construction, pages 225–245.
- C. K. O. Hernandez. Open64-based regular stencil shape recognition in hercules. 2013.
- G. Hinton, L. Deng, D. Yu, G. Dahl, A. rahman Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. Sainath, and B. Kingsbury. Deep neural networks for acoustic modeling in speech recognition. Signal Processing Magazine, 2012.
- C.-C. Huang, Q. Chen, Z. Wang, R. Power, J. Ortiz, J. Li, and Z. Xiao. Spartan: A distributed array framework with smart tiling. In USENIX Annual Technical Conference, 2015.
- C.-H. Huang and P. Sadayappan. Communication-free hyperplane partitioning of nested loops. Journal of Parallel and Distributed Computing, 19(2):90–102, 1993.
- D. E. Hudak and S. G. Abraham. Compiler techniques for data partitioning of sequentially iterated parallel loops. In ACM SIGARCH Computer Architecture News, volume 18, pages 187–200. ACM, 1990.
- Q. L. Ilya Sutskever, Oriol Vinyals. Sequence to sequence learning with neural network. In Advances in Neural Information Processing Systems (NIPS), 2014.
- M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly. Dryad: Distributed data-parallel programs from sequential building blocks. In European Conference on Computer Systems (EuroSys), 2007.
- Y.-J. Ju and H. Dietz. Reduction of cache coherence overhead by compiler data layout and loop transformation. In Languages and Compilers for Parallel Computing, pages 344–358.
- A. Z. K. Simonyan. Very deep convolutional networks for large-scale image recognition. In International Conference on Learning Representations, 2015.
- K. Kennedy and U. Kremer. Automatic data layout for distributed-memory machines. ACM Transactions on Programming Languages and Systems (TOPLAS), 20(4):869–916, 1998.
- N. S. Keskar, D. Mudigere, J. Nocedal, M. Smelyanskiy, and P. T. P. Tang. On large-batch training for deep learning: Generalization gap and sharp minima. In arXiv:1609.04836, 2016.
- C. W. Kessler. Pattern-driven automatic parallelization. Scientific Programming, 5(3):251–274, 1996.
- K. Knobe, J. D. Lukas, and G. L. Steele Jr. Data optimization: Allocation of arrays to reduce communication on simd machines. Journal of Parallel and Distributed Computing, 8 (2):102–118, 1990.
- U. Kremer. Np-completeness of dynamic remapping. In Proceedings of the Fourth Workshop on Compilers for Parallel Computers, Delft, The Netherlands, 1993.
- A. Krizhevsky. One weird trick for parallelizing convolutional neural networks. In arXiv:1404.5997, 2014.
- A. Krizhevsky, I. Sutskever, and G. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems 25, pages 1106– 1114, 2012.
- J. Li and M. Chen. Index domain alignment: Minimizing cost of cross-referencing between distributed arrays. In Frontiers of Massively Parallel Computation, 1990. Proceedings., 3rd Symposium on the, pages 424–433. IEEE, 1990.
- J. Li and M. Chen. The data alignment phase in compiling programs for distributed-memory machines. Journal of parallel and distributed computing, 13(2):213–221, 1991.
- M. Li, D. G. Andersen, J. W. Park, A. J. Smola, A. Ahmed, V. Josifovski, J. Long, E. J. Shekita, and B.-Y. Su. Scaling distributed machine learning with the parameter server. In USENIX OSDI, 2014.
- Y. Low, J. Gonzalez, A. Kyrola, D. Bickson, C. Guestrin, and J. Hellerstein. Graphlab: A new parallel framework for machine learning. In Conference on Uncertainty in Artificial Intelligence (UAI), 2012.
- Q. Lu, C. Alias, U. Bondhugula, T. Henretty, S. Krishnamoorthy, J. Ramanujam, A. Rountev, P. Sadayappan, Y. Chen, H. Lin, et al. Data layout transformation for enhancing data locality on nuca chip multiprocessors. In Parallel Architectures and Compilation Techniques, 2009. PACT’09. 18th International Conference on, pages 348–357. IEEE, 2009.
- A. L. Maas, A. Y. Hannun, C. T. Lengerich, P. Qi, D. Jurafsky, and A. Y. Ng. Increasing deep neural network acoustic model size for large vocabulary continuous speech recognition. CoRR, abs/1406.7806.
- G. Malewicz, M. H. Austern, A. J. Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Czajkowski. Pregel: a system for large-scale graph processing. In SIGMOD ’10: Proceedings of the 2010 international conference on Management of data.
- S. Melnik, A. Gubarev, J. J. Long, G. Romer, S. Shivakumar, M. Tolton, and T. Vassilakis. Dremel: Interactive analysis of web-scale datasets. In VLDB, 2010.
- I. Z. Milosavljevic and M. A. Jabri. Automatic array alignment in parallel matlab scripts. In Parallel Processing, 1999. 13th International and 10th Symposium on Parallel and Distributed Processing, 1999. 1999 IPPS/SPDP. Proceedings, pages 285– 289. IEEE, 1999.
- D. G. Murray, M. Schwarzkopf, C. Smowton, S. Smith, A. Madhavapeddy, and S. Hand. Ciel: a universal execution engine for distributed data-flow computing. NSDI, 2011.
- D. G. Murray, F. McSherry, R. Isaacs, M. Isard, P. Barham, and M. Abadi. Naiad: a timely dataflow system. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles, pages 439–455. ACM, 2013.
- J. Nickolls, I. Buck, M. Garland, and K. Skadron. Scalable parallel programming with cuda. Queue, 6(2):40–53, Mar. 2008..
- J. Nieplocha, R. J. Harrison, and R. J. Littlefield. Global arrays: A nonuniform memory access programming model for highperformance computers. The Journal of Supercomputing, 10 (2):169–189, 1996.
- E. Park, C. Kartsaklis, T. Janjusic, and J. Cavazos. Trace-driven memory access pattern recognition in computational kernels. In Proceedings of the Second Workshop on Optimizing Stencil Computations, pages 25–32. ACM, 2014.
- M. Philippsen. Automatic alignment of array data and processes to reduce communication time on DMPPs, volume 30. ACM, 1995.
- J. Poulson, B. Marker, R. A. van de Geijn, J. R. Hammond, and N. A. Romero. Elemental: A new framework for distributed memory dense matrix computations. ACM Trans. Math. Softw., 39(2):13:1–13:24, feb 2013..
- R. Power and J. Li. Piccolo: Building fast, distributed programs with partitioned tables. In Symposium on Operating System Design and Implementation (OSDI), pages 293–306, 2010.
- Z. Qian, X. Chen, N. Kang, M. Chen, Y. Yu, T. Moscibroda, and Z. Zhang. MadLINQ: large-scale distributed matrix computation for the cloud. In Proceedings of the 7th ACM european conference on Computer Systems, EuroSys ’12, 2012.
- J. Ramanujam and P. Sadayappan. A methodology for parallelizing programs for multicomputers and complex memory multiprocessors. In Proceedings of the 1989 ACM/IEEE conference on Supercomputing, pages 637–646. ACM, 1989.
- J. Ramanujam and P. Sadayappan. Compile-time techniques for data distribution in distributed memory machines. Parallel and Distributed Systems, IEEE Transactions on, 2(4):472–482, 1991.
- C. J. Rossbach, Y. Yu, J. Currey, J.-P. Martin, and D. Fetterly. Dandelion: a compiler and runtime for heterogeneous systems. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles, pages 49–68. ACM, 2013.
- M. Stokely, F. Rohani, and E. Tassone. Large-scale parallel statistical forecasting computations in r. In JSM Proceedings, Section on Physical and Engineering Sciences, Alexandria, VA, 2011.
- S. Venkataraman, E. Bodzsar, I. Roy, A. AuYoung, and R. S. Schreiber. Presto: distributed machine learning and graph processing with sparse matrices. In Proceedings of the 8th ACM European Conference on Computer Systems (Eurosys), 2013.
- R. S. Xin, J. Rosen, M. Zaharia, M. J. Franklin, S. Shenker, and I. Stoica. Shark: Sql and rich analytics at scale. In SIGMOD, 2013.
- Y. Yu, M. Isard, D. Fetterly, M. Budiu, U. Erlingsson, P. K. Gunda, and J. Currey. DryadLINQ: A system for generalpurpose distributed data-parallel computing using a high-level language. In Symposium on Operating System Design and Implementation (OSDI), 2008.
- M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica. Spark: cluster computing with working sets. In Proceedings of the 2nd USENIX conference on Hot topics in cloud computing, pages 10–10, 2010.
- M. Zhang, Y. Wu, K. Chen, T. Ma, and W. Zheng. Measuring and optimizing distributed array programs. Proc. VLDB Endow., 9(12):912–923, Aug. 2016..
Full Text
Tags
Comments