Unifying Data, Model and Hybrid Parallelism in Deep Learning via Tensor Tiling

Chien-chin Huang
Chien-chin Huang

arXiv: Distributed, Parallel, and Cluster Computing, Volume abs/1805.04170, 2018.

Cited by: 0|Bibtex|Views97
EI
Other Links: dblp.uni-trier.de|academic.microsoft.com|arxiv.org
Weibo:
Deep learning systems have become indispensible in many fields, but as their complexities grow, Deep neural networks training time is becoming increasingly intractable

Abstract:

Deep learning systems have become vital tools across many fields, but the increasing model sizes mean that training must be accelerated to maintain such systemsu0027 utility. Current systems like Tensorflow and MXNet focus on one specific parallelization strategy, data parallelism, which requires large training batch sizes in order to sca...More

Code:

Data:

0
Introduction
  • Deep neural networks (DNNs) have delivered tremendous improvements across many machine learning tasks, ranging from computer vision [33, 40] and speech recognition [18, 26] to natural language processing [15, 30].
  • The popularity of DNNs has ushered in the development of several machine learning systems focused on DNNs [2, 4, 13, 22]
  • These systems allow users to program a DNN model with ease in an array-language frontend, and each enables training on a GPU for performance.
  • Training is performed by iterating over the training data many times (i.e. “epoches”) doing SGD in mini-batches, as illustrated in the following pseudocode
Highlights
  • Deep neural networks (DNNs) have delivered tremendous improvements across many machine learning tasks, ranging from computer vision [33, 40] and speech recognition [18, 26] to natural language processing [15, 30]
  • The popularity of Deep neural networks has ushered in the development of several machine learning systems focused on Deep neural networks [2, 4, 13, 22]
  • Compounding the problem, Deep neural networks users must completely retrain the model after any changes in the training parameters while fine-tuning a Deep neural networks
  • We evaluated SOYBEAN on a 8-GPU machine and compared its performance against data parallelism with varying configurations
  • Deep learning systems have become indispensible in many fields, but as their complexities grow, Deep neural networks training time is becoming increasingly intractable
  • We demonstrate effective tiling in SOYBEAN that can achieve performance as good as or better than the best of data parallelism and model parallelism for many types of Deep neural networks on multiple GPUs
Methods
  • The authors evaluated SOYBEAN on Amazon’s EC2 cluster. The authors used a p2.8xlarge instance, with 480GB of memory and 32 virtual CPUs as well as 8 NVIDIA GK210 GPUs on the instance.
  • Each of the GPUs has 12GB of memory; they are connected by PCI-e, with a maximum peer-to-peer bi-directional bandwidth of 20GB/s.
  • In this evaluation, the authors want to know if communication overhead accounts for a large percentage of the overall runtime when the batch size is relatively small or when the weight size is large.
  • The communication overhead is strictly smaller than the communication time
Results
  • The authors examine SOYBEAN’s performance. the authors want to answer following questions: 1.
Conclusion
  • Deep learning systems have become indispensible in many fields, but as their complexities grow, DNN training time is becoming increasingly intractable.
  • With the speed and simplicity provided by SOYBEAN’s backend, users can train networks using frontends like TENSORFLOW and MXNET quickly and making DNNs useful for a larger audience
Summary
  • Introduction:

    Deep neural networks (DNNs) have delivered tremendous improvements across many machine learning tasks, ranging from computer vision [33, 40] and speech recognition [18, 26] to natural language processing [15, 30].
  • The popularity of DNNs has ushered in the development of several machine learning systems focused on DNNs [2, 4, 13, 22]
  • These systems allow users to program a DNN model with ease in an array-language frontend, and each enables training on a GPU for performance.
  • Training is performed by iterating over the training data many times (i.e. “epoches”) doing SGD in mini-batches, as illustrated in the following pseudocode
  • Methods:

    The authors evaluated SOYBEAN on Amazon’s EC2 cluster. The authors used a p2.8xlarge instance, with 480GB of memory and 32 virtual CPUs as well as 8 NVIDIA GK210 GPUs on the instance.
  • Each of the GPUs has 12GB of memory; they are connected by PCI-e, with a maximum peer-to-peer bi-directional bandwidth of 20GB/s.
  • In this evaluation, the authors want to know if communication overhead accounts for a large percentage of the overall runtime when the batch size is relatively small or when the weight size is large.
  • The communication overhead is strictly smaller than the communication time
  • Results:

    The authors examine SOYBEAN’s performance. the authors want to answer following questions: 1.
  • Conclusion:

    Deep learning systems have become indispensible in many fields, but as their complexities grow, DNN training time is becoming increasingly intractable.
  • With the speed and simplicity provided by SOYBEAN’s backend, users can train networks using frontends like TENSORFLOW and MXNET quickly and making DNNs useful for a larger audience
Tables
  • Table1: Runtime per batch comparison for a 4-layers MLP network between single GPU and single GPU with SOYBEAN partitions. The weight size is fixed to 8K×8K
Download tables as Excel
Related work
  • How to distribute arbitrary data easily for users while achieving high-performance computation at the same time has been a popular research topic. However, it is difficult to design systems that automatically optimizes locality without knowledge of the underlying data and processing. Relatedly, distributed array programs are increasingly important due to the emergence of machine learning and deep learning. As a result, Throughput Speedup Throughput Speedup

    SoyBean Data parallelism

    Batch Size (a) AlexNet throughput speedup.

    Batch size (b) VGG throughput speedup.

    significant effort has been expended in optimizing distributed array frameworks. Deep learning systems. Many frameworks, such as Tensorflow [4], MXNet [13], PyTorch [2], Theano [10] and Caffe2 [22] have been proposed to facilitate developing new neural network models. These distributed array frameworks emphasize deep learning and machine learning applications. Besides the common array operations, they also provide many functionalities for neural networks such as automatic differentiation (backpropagation).
Reference
  • Julia language. http://julialang.org.
    Findings
  • PyTorch. http://pytorch.org.
    Findings
  • Sparkr: R frontend for spark. http://amplab-extras.github.io/SparkR-pkg.
    Findings
  • M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, M. Kudlur, J. Levenberg, R. Monga, S. Moore, D. G. Murray, B. Steiner, P. Tucker, V. Vasudevan, P. Warden, M. Wicke, Y. Yu, and X. Zheng. Tensorflow: A system for large-scale machine learning. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), 2016.
    Google ScholarLocate open access versionFindings
  • E. Anderson, Z. Bai, J. Dongarra, A. Greenbaum, A. McKenney, J. Du Croz, S. Hammerling, J. Demmel, C. Bischof, and D. Sorensen. LAPACK: A portable linear algebra library for high-performance computers. In Proceedings of the 1990 ACM/IEEE conference on Supercomputing, pages 2–11. IEEE Computer Society Press, 1990.
    Google ScholarLocate open access versionFindings
  • P. Anderson. The use and limitations of static-analysis tools to improve software quality. CrossTalk: The Journal of Defense Software Engineering, 21(6):18–21, 2008.
    Google ScholarLocate open access versionFindings
  • S. Balay, W. D. Gropp, L. C. McInnes, and B. F. Smith. Efficient management of parallelism in object oriented numerical software libraries. In E. Arge, A. M. Bruaset, and H. P. Langtangen, editors, Modern Software Tools in Scientific Computing, pages 163–202. Birkhauser Press, 1997.
    Google ScholarLocate open access versionFindings
  • S. Balay, S. Abhyankar, M. F. Adams, J. Brown, P. Brune, K. Buschelman, V. Eijkhout, W. D. Gropp, D. Kaushik, M. G. Knepley, L. C. McInnes, K. Rupp, B. F. Smith, and H. Zhang. PETSc users manual. Technical Report ANL-95/11 - Revision 3.5, Argonne National Laboratory, 2014.
    Google ScholarFindings
  • D. Bau, I. Kodukula, V. Kotlyar, K. Pingali, and P. Stodghill. Solving alignment using elementary linear algebra. In Languages and Compilers for Parallel Computing, pages 46–60.
    Google ScholarLocate open access versionFindings
  • J. Bergstra, O. Breuleux, F. Bastien, P. Lamblin, R. Pascanu, G. Desjardins, J. Turian, D. Warde-Farley, and Y. Bengio. Theano: a CPU and GPU math expression compiler. In Proceedings of the Python for Scientific Computing Conference (SciPy), 2010.
    Google ScholarLocate open access versionFindings
  • K. J. Brown, H. Lee, T. Rompf, A. K. Sujeeth, C. De Sa, C. Aberger, and K. Olukotun. Have abstraction and eat performance, too: Optimized heterogeneous computing with parallel patterns. In Proceedings of the 2016 International Symposium on Code Generation and Optimization, CGO ’16, 2016.
    Google ScholarLocate open access versionFindings
  • J. B. Buck, N. Watkins, J. LeFevre, K. Ioannidou, C. Maltzahn, N. Polyzotis, and S. Brandt. Scihadoop: array-based query processing in hadoop. In Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, 2011.
    Google ScholarLocate open access versionFindings
  • T. Chen, M. Li, Y. Li, M. Lin, N. Wang, M. Wang, T. Xiao, B. Xu, C. Zhang, and Z. Zhang. MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems. ArXiv e-prints, Dec. 2015.
    Google ScholarFindings
  • T. Chilimbi, Y. Suzue, J. Apacible, and K. Kalyanaraman. Project adam: Building an efficient and scalable deep learning training system. In Proceedings of the 11th USENIX Conference on Operating Systems Design and Implementation, OSDI’14, 2014.
    Google ScholarLocate open access versionFindings
  • K. Cho, B. van Merrienboer, C. Gulcehre, F. Bougares, H. Schwenk, and Y. Bengio. Learning phrase representations using RNN encoder-decoder for statistical machine translation. 2014.
    Google ScholarFindings
  • J. Choi, J. J. Dongarra, R. Pozo, and D. W. Walker. Scalapack: A scalable linear algebra library for distributed memory concurrent computers. In Frontiers of Massively Parallel Computation, 1992., Fourth Symposium on the, pages 120–127. IEEE, 1992.
    Google ScholarLocate open access versionFindings
  • M. Chu, R. Ravindran, and S. Mahlke. Data access partitioning for fine-grain parallelism on multicore architectures. In Microarchitecture, 2007. MICRO 2007. 40th Annual IEEE/ACM International Symposium on, pages 369–380. IEEE, 2007.
    Google ScholarLocate open access versionFindings
  • G. Dahl, D. Yu, L. Deng, and A. Acero. Context-dependent pre-trained deep neural networks for large vocabulary speech recognition. In IEEE Transactions on Audio, Speech, and Language Processing, 2011.
    Google ScholarLocate open access versionFindings
  • J. Dean and S. Ghemawat. Mapreduce: Simplified data processing on large clusters. In Symposium on Operating System Design and Implementation (OSDI), 2004.
    Google ScholarLocate open access versionFindings
  • J. Dean, G. S. Corrado, R. Monga, K. Chen, M. Devin, Q. V. Le, M. Z. Mao, M. Ranzato, A. Senior, P. Tucker, K. Yang, and A. Y. Ng. Large scale distributed deep networks. In Neural Information Processing Systems (NIPS), 2012.
    Google ScholarLocate open access versionFindings
  • E. D’HOLLANDER. Partitioning and labeling of index sets in do loops with constant dependence vectors. In 1989 International Conference on Parallel Processing, University Park, PA, 1989.
    Google ScholarLocate open access versionFindings
  • Facebook. Caffe2: a lightweight, modular, and scalable deep learning framework, 2017. URL https://github.com/caffe2/caffe2.
    Findings
  • J. He, A. E. Snavely, R. F. Van der Wijngaart, and M. A. Frumkin. Automatic recognition of performance idioms in scientific applications. In Parallel & Distributed Processing Symposium (IPDPS), 2011 IEEE International, pages 118– 127. IEEE, 2011.
    Google ScholarLocate open access versionFindings
  • T. Henretty, K. Stock, L.-N. Pouchet, F. Franchetti, J. Ramanujam, and P. Sadayappan. Data layout transformation for stencil computations on short-vector simd architectures. In Compiler Construction, pages 225–245.
    Google ScholarLocate open access versionFindings
  • C. K. O. Hernandez. Open64-based regular stencil shape recognition in hercules. 2013.
    Google ScholarFindings
  • G. Hinton, L. Deng, D. Yu, G. Dahl, A. rahman Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. Sainath, and B. Kingsbury. Deep neural networks for acoustic modeling in speech recognition. Signal Processing Magazine, 2012.
    Google ScholarLocate open access versionFindings
  • C.-C. Huang, Q. Chen, Z. Wang, R. Power, J. Ortiz, J. Li, and Z. Xiao. Spartan: A distributed array framework with smart tiling. In USENIX Annual Technical Conference, 2015.
    Google ScholarLocate open access versionFindings
  • C.-H. Huang and P. Sadayappan. Communication-free hyperplane partitioning of nested loops. Journal of Parallel and Distributed Computing, 19(2):90–102, 1993.
    Google ScholarLocate open access versionFindings
  • D. E. Hudak and S. G. Abraham. Compiler techniques for data partitioning of sequentially iterated parallel loops. In ACM SIGARCH Computer Architecture News, volume 18, pages 187–200. ACM, 1990.
    Google ScholarLocate open access versionFindings
  • Q. L. Ilya Sutskever, Oriol Vinyals. Sequence to sequence learning with neural network. In Advances in Neural Information Processing Systems (NIPS), 2014.
    Google ScholarLocate open access versionFindings
  • M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly. Dryad: Distributed data-parallel programs from sequential building blocks. In European Conference on Computer Systems (EuroSys), 2007.
    Google ScholarLocate open access versionFindings
  • Y.-J. Ju and H. Dietz. Reduction of cache coherence overhead by compiler data layout and loop transformation. In Languages and Compilers for Parallel Computing, pages 344–358.
    Google ScholarLocate open access versionFindings
  • A. Z. K. Simonyan. Very deep convolutional networks for large-scale image recognition. In International Conference on Learning Representations, 2015.
    Google ScholarLocate open access versionFindings
  • K. Kennedy and U. Kremer. Automatic data layout for distributed-memory machines. ACM Transactions on Programming Languages and Systems (TOPLAS), 20(4):869–916, 1998.
    Google ScholarLocate open access versionFindings
  • N. S. Keskar, D. Mudigere, J. Nocedal, M. Smelyanskiy, and P. T. P. Tang. On large-batch training for deep learning: Generalization gap and sharp minima. In arXiv:1609.04836, 2016.
    Findings
  • C. W. Kessler. Pattern-driven automatic parallelization. Scientific Programming, 5(3):251–274, 1996.
    Google ScholarLocate open access versionFindings
  • K. Knobe, J. D. Lukas, and G. L. Steele Jr. Data optimization: Allocation of arrays to reduce communication on simd machines. Journal of Parallel and Distributed Computing, 8 (2):102–118, 1990.
    Google ScholarLocate open access versionFindings
  • U. Kremer. Np-completeness of dynamic remapping. In Proceedings of the Fourth Workshop on Compilers for Parallel Computers, Delft, The Netherlands, 1993.
    Google ScholarLocate open access versionFindings
  • A. Krizhevsky. One weird trick for parallelizing convolutional neural networks. In arXiv:1404.5997, 2014.
    Findings
  • A. Krizhevsky, I. Sutskever, and G. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems 25, pages 1106– 1114, 2012.
    Google ScholarLocate open access versionFindings
  • J. Li and M. Chen. Index domain alignment: Minimizing cost of cross-referencing between distributed arrays. In Frontiers of Massively Parallel Computation, 1990. Proceedings., 3rd Symposium on the, pages 424–433. IEEE, 1990.
    Google ScholarLocate open access versionFindings
  • J. Li and M. Chen. The data alignment phase in compiling programs for distributed-memory machines. Journal of parallel and distributed computing, 13(2):213–221, 1991.
    Google ScholarLocate open access versionFindings
  • M. Li, D. G. Andersen, J. W. Park, A. J. Smola, A. Ahmed, V. Josifovski, J. Long, E. J. Shekita, and B.-Y. Su. Scaling distributed machine learning with the parameter server. In USENIX OSDI, 2014.
    Google ScholarLocate open access versionFindings
  • Y. Low, J. Gonzalez, A. Kyrola, D. Bickson, C. Guestrin, and J. Hellerstein. Graphlab: A new parallel framework for machine learning. In Conference on Uncertainty in Artificial Intelligence (UAI), 2012.
    Google ScholarLocate open access versionFindings
  • Q. Lu, C. Alias, U. Bondhugula, T. Henretty, S. Krishnamoorthy, J. Ramanujam, A. Rountev, P. Sadayappan, Y. Chen, H. Lin, et al. Data layout transformation for enhancing data locality on nuca chip multiprocessors. In Parallel Architectures and Compilation Techniques, 2009. PACT’09. 18th International Conference on, pages 348–357. IEEE, 2009.
    Google ScholarLocate open access versionFindings
  • A. L. Maas, A. Y. Hannun, C. T. Lengerich, P. Qi, D. Jurafsky, and A. Y. Ng. Increasing deep neural network acoustic model size for large vocabulary continuous speech recognition. CoRR, abs/1406.7806.
    Findings
  • G. Malewicz, M. H. Austern, A. J. Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Czajkowski. Pregel: a system for large-scale graph processing. In SIGMOD ’10: Proceedings of the 2010 international conference on Management of data.
    Google ScholarLocate open access versionFindings
  • S. Melnik, A. Gubarev, J. J. Long, G. Romer, S. Shivakumar, M. Tolton, and T. Vassilakis. Dremel: Interactive analysis of web-scale datasets. In VLDB, 2010.
    Google ScholarLocate open access versionFindings
  • I. Z. Milosavljevic and M. A. Jabri. Automatic array alignment in parallel matlab scripts. In Parallel Processing, 1999. 13th International and 10th Symposium on Parallel and Distributed Processing, 1999. 1999 IPPS/SPDP. Proceedings, pages 285– 289. IEEE, 1999.
    Google ScholarLocate open access versionFindings
  • D. G. Murray, M. Schwarzkopf, C. Smowton, S. Smith, A. Madhavapeddy, and S. Hand. Ciel: a universal execution engine for distributed data-flow computing. NSDI, 2011.
    Google ScholarLocate open access versionFindings
  • D. G. Murray, F. McSherry, R. Isaacs, M. Isard, P. Barham, and M. Abadi. Naiad: a timely dataflow system. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles, pages 439–455. ACM, 2013.
    Google ScholarLocate open access versionFindings
  • J. Nickolls, I. Buck, M. Garland, and K. Skadron. Scalable parallel programming with cuda. Queue, 6(2):40–53, Mar. 2008..
    Google ScholarLocate open access versionFindings
  • J. Nieplocha, R. J. Harrison, and R. J. Littlefield. Global arrays: A nonuniform memory access programming model for highperformance computers. The Journal of Supercomputing, 10 (2):169–189, 1996.
    Google ScholarLocate open access versionFindings
  • E. Park, C. Kartsaklis, T. Janjusic, and J. Cavazos. Trace-driven memory access pattern recognition in computational kernels. In Proceedings of the Second Workshop on Optimizing Stencil Computations, pages 25–32. ACM, 2014.
    Google ScholarLocate open access versionFindings
  • M. Philippsen. Automatic alignment of array data and processes to reduce communication time on DMPPs, volume 30. ACM, 1995.
    Google ScholarLocate open access versionFindings
  • J. Poulson, B. Marker, R. A. van de Geijn, J. R. Hammond, and N. A. Romero. Elemental: A new framework for distributed memory dense matrix computations. ACM Trans. Math. Softw., 39(2):13:1–13:24, feb 2013..
    Google ScholarLocate open access versionFindings
  • R. Power and J. Li. Piccolo: Building fast, distributed programs with partitioned tables. In Symposium on Operating System Design and Implementation (OSDI), pages 293–306, 2010.
    Google ScholarLocate open access versionFindings
  • Z. Qian, X. Chen, N. Kang, M. Chen, Y. Yu, T. Moscibroda, and Z. Zhang. MadLINQ: large-scale distributed matrix computation for the cloud. In Proceedings of the 7th ACM european conference on Computer Systems, EuroSys ’12, 2012.
    Google ScholarLocate open access versionFindings
  • J. Ramanujam and P. Sadayappan. A methodology for parallelizing programs for multicomputers and complex memory multiprocessors. In Proceedings of the 1989 ACM/IEEE conference on Supercomputing, pages 637–646. ACM, 1989.
    Google ScholarLocate open access versionFindings
  • J. Ramanujam and P. Sadayappan. Compile-time techniques for data distribution in distributed memory machines. Parallel and Distributed Systems, IEEE Transactions on, 2(4):472–482, 1991.
    Google ScholarLocate open access versionFindings
  • C. J. Rossbach, Y. Yu, J. Currey, J.-P. Martin, and D. Fetterly. Dandelion: a compiler and runtime for heterogeneous systems. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles, pages 49–68. ACM, 2013.
    Google ScholarLocate open access versionFindings
  • M. Stokely, F. Rohani, and E. Tassone. Large-scale parallel statistical forecasting computations in r. In JSM Proceedings, Section on Physical and Engineering Sciences, Alexandria, VA, 2011.
    Google ScholarLocate open access versionFindings
  • S. Venkataraman, E. Bodzsar, I. Roy, A. AuYoung, and R. S. Schreiber. Presto: distributed machine learning and graph processing with sparse matrices. In Proceedings of the 8th ACM European Conference on Computer Systems (Eurosys), 2013.
    Google ScholarLocate open access versionFindings
  • R. S. Xin, J. Rosen, M. Zaharia, M. J. Franklin, S. Shenker, and I. Stoica. Shark: Sql and rich analytics at scale. In SIGMOD, 2013.
    Google ScholarLocate open access versionFindings
  • Y. Yu, M. Isard, D. Fetterly, M. Budiu, U. Erlingsson, P. K. Gunda, and J. Currey. DryadLINQ: A system for generalpurpose distributed data-parallel computing using a high-level language. In Symposium on Operating System Design and Implementation (OSDI), 2008.
    Google ScholarLocate open access versionFindings
  • M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica. Spark: cluster computing with working sets. In Proceedings of the 2nd USENIX conference on Hot topics in cloud computing, pages 10–10, 2010.
    Google ScholarLocate open access versionFindings
  • M. Zhang, Y. Wu, K. Chen, T. Ma, and W. Zheng. Measuring and optimizing distributed array programs. Proc. VLDB Endow., 9(12):912–923, Aug. 2016..
    Google ScholarLocate open access versionFindings
Full Text
Your rating :
0

 

Tags
Comments