Supporting Very Large Models using Automatic Dataflow Graph Partitioning

Chien-chin Huang
Chien-chin Huang

Proceedings of the Fourteenth EuroSys Conference 2019, Volume abs/1807.08887, 2019, Pages 3303953

Cited by: 23|Bibtex|Views117|DOI:https://doi.org/10.1145/3302424.3303953
EI
Other Links: dblp.uni-trier.de|academic.microsoft.com|dl.acm.org|arxiv.org
Weibo:
We present the Tofu system, which enables the training of very large deep neural network models by partitioning a dataflow graph of tensors across multiple GPU devices

Abstract:

This paper presents Tofu, a system that partitions very large DNN models across multiple GPU devices to reduce per-GPU memory footprint. Tofu is designed to partition a dataflow graph of fine-grained tensor operators used by platforms like MXNet and TensorFlow. In order to automatically partition each operator, we propose to describe the ...More

Code:

Data:

0
Introduction
  • The deep learning community has been using larger deep neural network (DNN) models to achieve higher accuracy on more complex tasks over the past few years [1, 2].
  • Some proposals try to fit larger models into a single GPU, e.g. by using the much larger CPU memory as a swap area for the GPU [4] or by discarding intermediate results to save memory at the cost of re-computation [5,6,7]
  • Another promising solution is to partition a DNN model across multiple GPU devices.
  • Partitioning each tensor in the DNN computation across multiple devices can lower per-GPU memory footprint, thereby allowing very large models to be trained.
Highlights
  • The deep learning community has been using larger deep neural network (DNN) models to achieve higher accuracy on more complex tasks over the past few years [1, 2]
  • The size of a deep neural network model that can be explored today is constrained by the limited GPU device memory
  • Training very large deep neural network models is limited by the size of GPU device memory today
  • Deep neural network Benchmarks: We evaluate the WResNet [1] convolutional neural network and recurrent neural network (RNN)
  • We present the Tofu system, which enables the training of very large deep neural network models by partitioning a dataflow graph of tensors across multiple GPU devices
  • Tofu uses a recursive search algorithm based on dynamic programming and deep neural network-specific heuristics to find the best partition plan that minimizes communication for the entire dataflow graph
Results
  • As the weight tensors are larger in the higher layers, Tofu switches to partition strategies that fetch the relatively smaller activation tensors
Conclusion
  • The authors present the Tofu system, which enables the training of very large DNN models by partitioning a dataflow graph of tensors across multiple GPU devices.
  • To automate this process, Tofu infers each operator’s valid partition strategies by analyzing its semantics written in a simple description language (TDL).
  • Tofu uses a recursive search algorithm based on dynamic programming and DNN-specific heuristics to find the best partition plan that minimizes communication for the entire dataflow graph
Summary
  • Introduction:

    The deep learning community has been using larger deep neural network (DNN) models to achieve higher accuracy on more complex tasks over the past few years [1, 2].
  • Some proposals try to fit larger models into a single GPU, e.g. by using the much larger CPU memory as a swap area for the GPU [4] or by discarding intermediate results to save memory at the cost of re-computation [5,6,7]
  • Another promising solution is to partition a DNN model across multiple GPU devices.
  • Partitioning each tensor in the DNN computation across multiple devices can lower per-GPU memory footprint, thereby allowing very large models to be trained.
  • Results:

    As the weight tensors are larger in the higher layers, Tofu switches to partition strategies that fetch the relatively smaller activation tensors
  • Conclusion:

    The authors present the Tofu system, which enables the training of very large DNN models by partitioning a dataflow graph of tensors across multiple GPU devices.
  • To automate this process, Tofu infers each operator’s valid partition strategies by analyzing its semantics written in a simple description language (TDL).
  • Tofu uses a recursive search algorithm based on dynamic programming and DNN-specific heuristics to find the best partition plan that minimizes communication for the entire dataflow graph
Tables
  • Table1: Time to search for the best partition for 8 workers. WRestNet-152 and RNN-10 are two large DNN
  • Table2: Total weight tensor sizes (GB) of our benchmarks
  • Table3: Comparison of throughput (samples/second) for
Download tables as Excel
Related work
  • Parallel DNN training. Many parallel strategies have been developed to speedup DNN training. Some strategies such as the popular data parallelism [47,48,49,50] cannot be used for training very large models because the parameters are replicated to each device. Model parallelism spreads out the model parameters to multiple GPUs, thus is suitable for training very large models. Early work[8, 9, 46] parallelizes specific classes of DNN models, and is limited in flexibility and generality. Minerva[51] and Strads[52] require users to implement extra interfaces to partition model parameters while Tofu requires no change to the user program. Another approach is to assign different layers/operators to different devices via heuristics [45] or stochastic search [28, 44]. However, operator placement only works well only when there are sufficiently many concurrent operators, and thus is not suitable for DNN models with a deep stack of layers. Out-of-core DNN training. This includes recomputation on demand [5,6,7] , swapping and prefetching from host memory [4, 42, 43]. Recomputation is not viable for large weight tensors. Swapping with host memory reduces the opportunity of co-locating computation and data, and scales poorly when there are multiple GPUs. None of them can efficiently utilize the aggregated memory capacity of multiple cards as Tofu does. Moreover, Tofu can also be combined with these techniques.
Funding
  • This work is supported in part by the National Science Foundation under award CNS-1816717, NVIDIA AI Lab (NVAIL) at NYU, and AWS cloud credits for research
Reference
  • Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. In arXiv:1605.07146, 2016.
    Findings
  • Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, and Mohammad Norouzi. Google’s neural machine translation system: Bridging the gap between human and machine translation. In arxiv.org:1609.08144, 2016.
    Findings
  • Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press, 2016. http://www.deeplearningbook.org.
    Findings
  • Chen Meng, Minmin Sun, Jun Yang, Minghui Qiu, and Yang Gu. Training deeper models by gpu memory optimization on tensorflow. In Proc. of ML Systems Workshop in NIPS, 2017.
    Google ScholarLocate open access versionFindings
  • Audrunas Gruslys, Rémi Munos, Ivo Danihelka, Marc Lanctot, and Alex Graves. Memory-efficient backpropagation through time. In Advances in Neural Information Processing Systems, pages 4125–4133, 2016.
    Google ScholarLocate open access versionFindings
  • James Martens and Ilya Sutskever. Training deep and recurrent networks with hessian-free optimization. In Neural networks: Tricks of the trade, pages 479–535.
    Google ScholarLocate open access versionFindings
  • Tianqi Chen, Bing Xu, Chiyuan Zhang, and Carlos Guestrin. Training deep nets with sublinear memory cost. arXiv preprint arXiv:1604.06174, 2016.
    Findings
  • Jeffrey Dean, Greg S. Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Quoc V. Le, Mark Z. Mao, Marc’Aurelio Ranzato, Andrew Senior, Paul Tucker, Ke Yang, and Andrew Y. Ng. Large scale distributed deep networks. In Neural Information Processing Systems (NIPS), 2012.
    Google ScholarLocate open access versionFindings
  • Adam Coates, Brody Huval, Tao Wang, David Wu, Bryan Catanzaro, and Ng Andrew. Deep learning with COTS HPC systems. In Proceedings of the 30th International Conference on Machine Learning (ICML-13), pages 1337–1345, 2013.
    Google ScholarLocate open access versionFindings
  • Trishul Chilimbi, Yutaka Suzue, Johnson Apacible, and Karthik Kalyanaraman. Project adam: Building an efficient and scalable deep learning training system. In Proceedings of the 11th USENIX Conference on Operating Systems Design and Implementation, OSDI’14, 2014.
    Google ScholarLocate open access versionFindings
  • Xuan Yang, Jing Pu, Blaine Burton Rister, Nikhil Bhagdikar, Stephen Richardson, Shahar Kvatinsky, Jonathan Ragan-Kelley, Ardavan Pedram, and Mark Horowitz. A systematic approach to blocking convolutional neural networks. arXiv preprint arXiv:1606.04209, 2016.
    Findings
  • Duckhwan Kim, Jaeha Kung, Sek Chai, Sudhakar Yalamanchili, and Saibal Mukhopadhyay. Neurocube: A programmable digital neuromorphic architecture with high-density 3d memory. In Computer Architecture (ISCA), 2016 ACM/IEEE 43rd Annual International Symposium on, pages 380–392. IEEE, 2016.
    Google ScholarLocate open access versionFindings
  • Mingyu Gao, Jing Pu, Xuan Yang, Mark Horowitz, and Christos Kozyrakis. Tetris: Scalable and efficient neural network acceleration with 3d memory. ACM SIGOPS Operating Systems Review, 51(2):751–764, 2017.
    Google ScholarLocate open access versionFindings
  • Zhihao Jia, Sina Lin, Charles R. Qi, and Alex Aiken. Exploring hidden dimensions in parallelizing convolutional neural networks. In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, pages 2279–2288, 2018.
    Google ScholarLocate open access versionFindings
  • Zhihao Jia, Matei Zaharia, and Alex Aiken. Beyond data and model parallelism for deep neural networks. arXiv preprint arXiv:1807.05358, 2018.
    Findings
  • Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek G. Murray, Benoit Steiner, Paul Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. Tensorflow: A system for large-scale machine learning. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), 2016.
    Google ScholarLocate open access versionFindings
  • Tianqi Chen, Mu Li, Yutian Li, Min Lin, Naiyan Wang, Minjie Wang, Tianjun Xiao, Bing Xu, Chiyuan Zhang, and Zheng Zhang. Mxnet: A flexible and efficient machine learning library for heterogeneous distributed systems. arXiv preprint arXiv:1512.01274, 2015.
    Findings
  • PyTorch. http://pytorch.org.
    Findings
  • Jonathan Ragan-Kelley, Connelly Barnes, Andrew Adams, Sylvain Paris, Frédo Durand, and Saman Amarasinghe. Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. ACM SIGPLAN Notices, 48(6):519–530, 2013.
    Google ScholarLocate open access versionFindings
  • Rafal Józefowicz, Oriol Vinyals, Mike Schuster, Noam Shazeer, and Yonghui Wu. Exploring the limits of language modeling. CoRR, abs/1602.02410, 2016.
    Findings
  • Google Cloud. Tpu: System architecture.
    Google ScholarFindings
  • Zhengping Qian, Xiuwei Chen, Nanxi Kang, Mingcheng Chen, Yuan Yu, Thomas Moscibroda, and Zheng Zhang. MadLINQ: large-scale distributed matrix computation for the cloud. In Proceedings of the 7th ACM european conference on Computer Systems, EuroSys ’12, 2012.
    Google ScholarLocate open access versionFindings
  • L. E. Cannon. A cellular computer to implement the Kalman Filter Algorithm. PhD thesis, Montana State University, 1969.
    Google ScholarFindings
  • Ken Kennedy and Ulrich Kremer. Automatic data layout for distributed-memory machines. ACM Transactions on Programming Languages and Systems (TOPLAS), 20(4):869–916, 1998.
    Google ScholarLocate open access versionFindings
  • Ulrich Kremer. Np-completeness of dynamic remapping. In Proceedings of the Fourth Workshop on Compilers for Parallel Computers, Delft, The Netherlands, 1993.
    Google ScholarLocate open access versionFindings
  • Jingke Li and Marina Chen. Index domain alignment: Minimizing cost of cross-referencing between distributed arrays. In Frontiers of Massively Parallel Computation, 1990. Proceedings., 3rd Symposium on the, pages 424–433. IEEE, 1990.
    Google ScholarLocate open access versionFindings
  • Jingke Li and Marina Chen. The data alignment phase in compiling programs for distributed-memory machines. Journal of parallel and distributed computing, 13(2):213–221, 1991.
    Google ScholarLocate open access versionFindings
  • Azalia Mirhoseini, Hieu Pham, Quoc V Le, Benoit Steiner, Rasmus Larsen, Yuefeng Zhou, Naveen Kumar, Mohammad Norouzi, Samy Bengio, and Jeff Dean. Device placement optimization with reinforcement learning. arXiv preprint arXiv:1706.04972, 2017.
    Findings
  • Azalia Mirhoseini, Anna Goldie, Hieu Pham, Benoit Steiner, Quoc V. Le, and Jeff Dean. A hierarchical model for device placement. In ICLR, 2018.
    Google ScholarLocate open access versionFindings
  • Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. TVM: An automated end-to-end optimizing compiler for deep learning. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18), Carlsbad, CA, 2018. USENIX Association.
    Google ScholarLocate open access versionFindings
  • Nicolas Vasilache, Oleksandr Zinenko, Theodoros Theodoridis, Priya Goyal, Zachary DeVito, William S. Moses, Sven Verdoolaege, Andrew Adams, and Albert Cohen. Tensor comprehensions: Framework-agnostic high-performance machine learning abstractions. In arXiv:1802.04730v2, 2018.
    Findings
  • Arnaud J Venet. The gauge domain: scalable analysis of linear inequality invariants. In International Conference on Computer Aided Verification, pages 139–154.
    Google ScholarLocate open access versionFindings
  • Radu Rugina and Martin Rinard. Symbolic bounds analysis of pointers, array indices, and accessed memory regions. In ACM Sigplan Notices, volume 35, pages 182–195. ACM, 2000.
    Google ScholarLocate open access versionFindings
  • Xueguang Wu, Liqian Chen, and Ji Wang. An abstract domain to infer symbolic ranges over nonnegative parameters. Electronic Notes in Theoretical Computer Science, 307:33–45, 2014.
    Google ScholarLocate open access versionFindings
  • Chien-Chin Huang, Qi Chen, Zhaoguo Wang, Russell Power, Jorge Ortiz, Jinyang Li, and Zhen Xiao. Spartan: A distributed array framework with smart tiling. In USENIX Annual Technical Conference, 2015.
    Google ScholarLocate open access versionFindings
  • J.A Bondy and U.S.R. Murty. Graph Theory with Applications. Elseyier Science Publishing, 1976.
    Google ScholarFindings
  • Minjie Wang, Chien-chin Huang, and Jinyang Li. Supporting very large models using automatic dataflow graph partitioning. arXiv preprint arXiv:1807.08887, 2018.
    Findings
  • Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
    Google ScholarLocate open access versionFindings
  • Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
    Google ScholarLocate open access versionFindings
  • Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. CoRR, abs/1412.6980, 2014.
    Findings
  • John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(Jul):2121–2159, 2011.
    Google ScholarLocate open access versionFindings
  • Taro Sekiyama, Takashi Imamichi, Haruki Imai, and Rudy Raymond. Profile-guided memory optimization for deep neural networks. arXiv preprint arXiv:1804.10001, 2018.
    Findings
  • Minsoo Rhu, Natalia Gimelshein, Jason Clemons, Arslan Zulfiqar, and Stephen W Keckler. vdnn: Virtualized deep neural networks for scalable, memory-efficient neural network design. In Microarchitecture (MICRO), 2016 49th Annual IEEE/ACM International Symposium on, pages 1–13. IEEE, 2016.
    Google ScholarLocate open access versionFindings
  • Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pages 3104–3112, 2014.
    Google ScholarLocate open access versionFindings
  • Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538, 2017.
    Findings
  • Alex Krizhevsky. One weird trick for parallelizing convolutional neural networks. In arXiv:1404.5997, 2014.
    Findings
  • Mu Li, David G. Andersen, Jun Woo Park, Alexander J. Smola, Amr Ahmed, Vanja Josifovski, James Long, Eugene J. Shekita, and Bor-Yiing Su. Scaling distributed machine learning with the parameter server. In USENIX OSDI, 2014.
    Google ScholarLocate open access versionFindings
  • H. Cui, J. Cipar, Q. Ho, J.K. Kim, S. Lee, A. Kumar, J.Wei, W. Dai, G. R. Ganger, P.B. Gibbons, G. A. Gibson, and E. P. Xing. Exploiting bounded staleness to speed up big data analytics. In USENIX Annual Technical Conference, 2014.
    Google ScholarLocate open access versionFindings
  • J. Wei, W. Dai, A. Qiao, H. Cui, Q. Ho, G. R. Ganger, P. B. Gibbons, G. A. Gibson, and E.P. Xing. Managed communication and consistency for fast data-parallel iterative analytics. In ACM Symposium on Cloud Computing (SoCC), 2015.
    Google ScholarLocate open access versionFindings
  • Henggang Cui, Hao Zhang, Gregory R. Ganger, Phillip B. Gibbons, and Eric P. Xing. Geeps: Scalable deep learning on distributed gpus with a gpu-specialized parameter server. In Eurosys, 2016.
    Google ScholarLocate open access versionFindings
  • Minjie Wang, Tianjun Xiao, Jianpeng Li, Jiaxing Zhang, Chuntao Hong, and Zheng Zhang. Minerva: A scalable and highly efficient training platform for deep learning. In NIPS Workshop, Distributed Machine Learning and Matrix Computations, 2014.
    Google ScholarLocate open access versionFindings
  • Jin Kyu Kim, Qirong Ho, Seunghak Lee, Xun Zheng, Wei Dai, Garth Gibson, and Eric Xing. Strads: A distributed framework for scheduled model parallel machine learning. In Eurosys, 2016.
    Google ScholarLocate open access versionFindings
  • Song Han, Jeff Pool, John Tran, and William Dally. Learning both weights and connections for efficient neural network. In Advances in neural information processing systems, pages 1135– 1143, 2015.
    Google ScholarLocate open access versionFindings
  • Song Han, Huizi Mao, and William J Dally. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149, 2015.
    Findings
  • Yunchao Gong, Liu Liu, Ming Yang, and Lubomir Bourdev. Compressing deep convolutional networks using vector quantization. arXiv preprint arXiv:1412.6115, 2014.
    Findings
  • Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. Binarized neural networks. In Advances in neural information processing systems, pages 4107–4115, 2016.
    Google ScholarLocate open access versionFindings
  • Edward Anderson, Zhaojun Bai, J Dongarra, A Greenbaum, A McKenney, Jeremy Du Croz, S Hammerling, J Demmel, C Bischof, and Danny Sorensen. LAPACK: A portable linear algebra library for high-performance computers. In Proceedings of the 1990 ACM/IEEE conference on Supercomputing, pages 2–11. IEEE Computer Society Press, 1990.
    Google ScholarLocate open access versionFindings
  • Jaeyoung Choi, Jack J Dongarra, Roldan Pozo, and David W Walker. Scalapack: A scalable linear algebra library for distributed memory concurrent computers. In Frontiers of Massively Parallel Computation, 1992., Fourth Symposium on the, pages 120–127. IEEE, 1992.
    Google ScholarLocate open access versionFindings
  • Jack Poulson, Bryan Marker, Robert A. van de Geijn, Jeff R. Hammond, and Nichols A. Romero. Elemental: A new framework for distributed memory dense matrix computations. ACM Trans. Math. Softw., 39(2):13:1–13:24, feb 2013.
    Google ScholarLocate open access versionFindings
  • Jaroslaw Nieplocha, Robert J Harrison, and Richard J Littlefield. Global arrays: A nonuniform memory access programming model for high-performance computers. The Journal of Supercomputing, 10(2):169–189, 1996.
    Google ScholarLocate open access versionFindings
  • Satish Balay, William D. Gropp, Lois Curfman McInnes, and Barry F. Smith. Efficient management of parallelism in object oriented numerical software libraries. In E. Arge, A. M. Bruaset, and H. P. Langtangen, editors, Modern Software Tools in Scientific Computing, pages 163–202. Birkhäuser Press, 1997.
    Google ScholarLocate open access versionFindings
  • Robert A. van de Geijn and Jerrell Watts. Summa: Scalable universal matrix multiplication algorithm. Technical report, Austin, TX, USA, 1995.
    Google ScholarFindings
  • Edgar Solomonik, Devin Matthews, Jeff R Hammond, John F Stanton, and James Demmel. A massively parallel tensor contraction framework for coupled-cluster computations. Journal of Parallel and Distributed Computing, 74(12):3176–3190, 2014.
    Google ScholarLocate open access versionFindings
  • Calvin Lin and Lawrence Snyder. ZPL: An array sublanguage. In Languages and Compilers for Parallel Computing, pages 96–114.
    Google ScholarLocate open access versionFindings
  • B.L. Chamberlain, D. Callahan, and H.P. Zima. Parallel programmability and the chapel language. International Journal of High Performance Computing Applications, 2007.
    Google ScholarLocate open access versionFindings
  • UPC Consortium. UPC language specifications, v1.2. Technical report, Lawrence Berkeley National Lab, 2005.
    Google ScholarFindings
  • Joe B. Buck, Noah Watkins, Jeff LeFevre, Kleoni Ioannidou, Carlos Maltzahn, Neoklis Polyzotis, and Scott Brandt. Scihadoop: array-based query processing in hadoop. In Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, 2011.
    Google ScholarLocate open access versionFindings
  • Murray Stokely, Farzan Rohani, and Eric Tassone. Large-scale parallel statistical forecasting computations in r. In JSM Proceedings, Section on Physical and Engineering Sciences, Alexandria, VA, 2011.
    Google ScholarLocate open access versionFindings
  • SparkR: R frontend for Spark. http://amplab-extras.github.io/ SparkR-pkg.
    Findings
  • Mingxing Zhang, Yongwei Wu, Kang Chen, Teng Ma, and Weimin Zheng. Measuring and optimizing distributed array programs. Proc. VLDB Endow., 9(12):912–923, August 2016.
    Google ScholarLocate open access versionFindings
  • Kevin J. Brown, HyoukJoong Lee, Tiark Rompf, Arvind K. Sujeeth, Christopher De Sa, Christopher Aberger, and Kunle Olukotun. Have abstraction and eat performance, too: Optimized heterogeneous computing with parallel patterns. In Proceedings of the 2016 International Symposium on Code Generation and Optimization, CGO ’16, 2016.
    Google ScholarLocate open access versionFindings
  • Shivaram Venkataraman, Erik Bodzsar, Indrajit Roy, Alvin AuYoung, and Robert S. Schreiber. Presto: distributed machine learning and graph processing with sparse matrices. In Proceedings of the 8th ACM European Conference on Computer Systems (Eurosys), 2013.
    Google ScholarLocate open access versionFindings
  • Tyler Denniston, Shoaib Kamil, and Saman Amarasinghe. Distributed halide. In Principles and Practice of Parallel Programming (PPoPP), 2016.
    Google ScholarLocate open access versionFindings
  • Edgar Solomonik, Devin Matthews, Jeff Hammond, and James Demmel. Cyclops tensor framework: Reducing communication and eliminating load imbalance in massively parallel contractions. In Parallel & Distributed Processing (IPDPS), 2013 IEEE 27th International Symposium on, pages 813–824. IEEE, 2013.
    Google ScholarLocate open access versionFindings
  • So Hirata. Tensor contraction engine: Abstraction and automated parallel implementation of configuration-interaction, coupledcluster, and many-body perturbation theories. The Journal of Physical Chemistry A, 107(46):9887–9897, 2003.
    Google ScholarLocate open access versionFindings
  • Stefan C. Müller, Gustavo Alonso, Adam Amara, and André Csillaghy. Pydron: Semi-automatic parallelization for multi-core and the cloud. In 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 14), pages 645–659, Broomfield, CO, October 2014. USENIX Association.
    Google ScholarLocate open access versionFindings
  • David E Hudak and Santosh G Abraham. Compiler techniques for data partitioning of sequentially iterated parallel loops. In ACM SIGARCH Computer Architecture News, volume 18, pages 187–200. ACM, 1990.
    Google ScholarLocate open access versionFindings
  • Kathleen Knobe, Joan D Lukas, and Guy L Steele Jr. Data optimization: Allocation of arrays to reduce communication on simd machines. Journal of Parallel and Distributed Computing, 8(2):102–118, 1990.
    Google ScholarLocate open access versionFindings
  • Michael Philippsen. Automatic alignment of array data and processes to reduce communication time on DMPPs, volume 30. ACM, 1995.
    Google ScholarLocate open access versionFindings
  • Igor Z Milosavljevic and Marwan A Jabri. Automatic array alignment in parallel matlab scripts. In Parallel Processing, 1999. 13th International and 10th Symposium on Parallel and Distributed Processing, 1999. 1999 IPPS/SPDP. Proceedings, pages 285–289. IEEE, 1999.
    Google ScholarLocate open access versionFindings
  • J Ramanujam and P Sadayappan. Compile-time techniques for data distribution in distributed memory machines. Parallel and Distributed Systems, IEEE Transactions on, 2(4):472–482, 1991.
    Google ScholarLocate open access versionFindings
  • J Ramanujam and P Sadayappan. A methodology for parallelizing programs for multicomputers and complex memory multiprocessors. In Proceedings of the 1989 ACM/IEEE conference on Supercomputing, pages 637–646. ACM, 1989.
    Google ScholarLocate open access versionFindings
  • David Bau, Induprakas Kodukula, Vladimir Kotlyar, Keshav Pingali, and Paul Stodghill. Solving alignment using elementary linear algebra. In Languages and Compilers for Parallel Computing, pages 46–60.
    Google ScholarLocate open access versionFindings
  • ERIKH D’HOLLANDER. Partitioning and labeling of index sets in do loops with constant dependence vectors. In 1989 International Conference on Parallel Processing, University Park, PA, 1989.
    Google ScholarLocate open access versionFindings
  • Chua-Huang Huang and Ponnuswamy Sadayappan. Communication-free hyperplane partitioning of nested loops. Journal of Parallel and Distributed Computing, 19(2):90–102, 1993.
    Google ScholarLocate open access versionFindings
  • Y-J Ju and H Dietz. Reduction of cache coherence overhead by compiler data layout and loop transformation. In Languages and Compilers for Parallel Computing, pages 344–358.
    Google ScholarLocate open access versionFindings
  • Qingda Lu, Christophe Alias, Uday Bondhugula, Thomas Henretty, Sriram Krishnamoorthy, Jagannathan Ramanujam, Atanas Rountev, Ponnuswamy Sadayappan, Yongjian Chen, Haibo Lin, et al. Data layout transformation for enhancing data locality on nuca chip multiprocessors. In Parallel Architectures and Compilation Techniques, 2009. PACT’09. 18th International Conference on, pages 348–357. IEEE, 2009.
    Google ScholarLocate open access versionFindings
  • Joseph E. Gonzalez, Yucheng Low, Haijie Gu, Danny Bickson, and Carlos Guestrin. Powergraph: Distributed graph-parallel computation on natural graphs. In OSDI, 2012.
    Google ScholarLocate open access versionFindings
  • Jeff Dean and Sanjay Ghemawat. Mapreduce: Simplified data processing on large clusters. In Symposium on Operating System Design and Implementation (OSDI), 2004.
    Google ScholarLocate open access versionFindings
Full Text
Your rating :
0

 

Tags
Comments