AI helps you reading Science

AI generates interpretation videos

AI extracts and analyses the key points of the paper to generate videos automatically


pub
Go Generating

AI Traceability

AI parses the academic lineage of this thesis


Master Reading Tree
Generate MRT

AI Insight

AI extracts a summary of this paper


Weibo:
Optimized code can be achieved through existing deep learning compilers, an obvious drawback is their long code optimization time, required to generate many versions of a tensor program and to profile these versions on hardware

AdaTune: Adaptive Tensor Program Compilation Made Efficient

NIPS 2020, (2020)

Cited by: 0|Views8
EI
Full Text
Bibtex
Weibo

Abstract

Deep learning models are computationally intense, and implementations often have to be highly optimized by experts or hardware vendors to be usable in practice. The DL compiler, together with Learning-to-Compile has proven to be a powerful technique for optimizing tensor programs. However, a limitation of this approach is that it still su...More

Code:

Data:

0
Introduction
  • The enormous computational intensity of Deep Neural Network (DNN) models has attracted great interest in optimizing their performance.
  • AutoTVM optimizes the code by generating many versions of a tensor program and chooses the best through simulated annealing search over a large space of code transformation choices
  • It employs a learned cost model trained by actual hardware performance measures to predict the performance of diverse inference computations on real hardware targets.
  • Recent work such as AutoTVM [14] extends the pipeline with another pass, which is a black-box target-dependent pass, which uses learning machinery to perform optimizations
Highlights
  • The enormous computational intensity of Deep Neural Network (DNN) models has attracted great interest in optimizing their performance
  • Prior study [14] claims that considering variance does not help. They use a regression model (e.g., XGBoost) to learn and predict the performance of a transformation plan. Different from their observations, we find that in practice, based on the scenario, a model might need to be optimized against different hardware, including x86 CPU [43, 49], GPUs [21], ARM, and various ML accelerators [31, 25], all of which have very different architectures, which appear to have very different variance behaviors
  • For AdaTune, we propose another two improvements: (1) We create a surrogate model with uncertainty quantification, which takes both mean and variance into consideration to adapt performance modeling and drives the exploration of the transformation space by continuously gathering feedback on the quality of the explored transformation plans; (2) We introduce the contextual simulated annealing optimizer, which dynamically balances the trade-off between exploration and exploitation based on the expected improvement from the surrogate model
  • We include four tasks: one convolutional layer sampled from ResNet-18 [20] and one batched GEMM operator from Transformer [41] on both CPU (Intel Xeon CPU E5-2690 v3 @ 2.60GHz 2600 MHz) and GPUs (Nvidia Tesla P100)
  • Highly optimized code can be achieved through existing deep learning (DL) compilers, an obvious drawback is their long code optimization time, required to generate many versions of a tensor program and to profile these versions on hardware
  • The experiment results show that AdaTune obtains up to 115% higher GFLOPS than the baseline under the same optimization time budget
  • In this paper we have introduced a method, called AdaTune, to make the code optimization process in DL compilers more adaptive to different hardware and models
Results
  • The authors evaluate AdaTune experimentally, seeking answers to how AdaTune helps accelerate the optimization process.
  • The authors integrate AdaTune with TVM [13] and use AutoTVM [14] as the baseline for comparison.
  • 5.1 Comparison of AutoTVM and AdaTune for Searching Transformation Space.
  • The authors compare the performance of AutoTVM and AdaTune on how much optimization speedup the authors obtain as a function of the wall-clock time.
  • Note that the predicted performance is only used in the transformation space searching process, and the authors report real measured latency in the end-to-end evaluation results
Conclusion
  • Highly optimized code can be achieved through existing DL compilers, an obvious drawback is their long code optimization time, required to generate many versions of a tensor program and to profile these versions on hardware.
  • In this paper the authors have introduced a method, called AdaTune, to make the code optimization process in DL compilers more adaptive to different hardware and models.
  • The adaptive evaluator allows cut hardware measurement cost significantly without losing much accuracy.
  • The uncertainty-aware surrogate model and the contextual optimizer allow them to more efficiently explore the transformation space.
  • AdaTune achieves higher speedups in terms of finding a good transformation plan on different types of hardware and models, outperforming AutoTVM, a state-of-the-art approach.
Summary
  • Introduction:

    The enormous computational intensity of Deep Neural Network (DNN) models has attracted great interest in optimizing their performance.
  • AutoTVM optimizes the code by generating many versions of a tensor program and chooses the best through simulated annealing search over a large space of code transformation choices
  • It employs a learned cost model trained by actual hardware performance measures to predict the performance of diverse inference computations on real hardware targets.
  • Recent work such as AutoTVM [14] extends the pipeline with another pass, which is a black-box target-dependent pass, which uses learning machinery to perform optimizations
  • Results:

    The authors evaluate AdaTune experimentally, seeking answers to how AdaTune helps accelerate the optimization process.
  • The authors integrate AdaTune with TVM [13] and use AutoTVM [14] as the baseline for comparison.
  • 5.1 Comparison of AutoTVM and AdaTune for Searching Transformation Space.
  • The authors compare the performance of AutoTVM and AdaTune on how much optimization speedup the authors obtain as a function of the wall-clock time.
  • Note that the predicted performance is only used in the transformation space searching process, and the authors report real measured latency in the end-to-end evaluation results
  • Conclusion:

    Highly optimized code can be achieved through existing DL compilers, an obvious drawback is their long code optimization time, required to generate many versions of a tensor program and to profile these versions on hardware.
  • In this paper the authors have introduced a method, called AdaTune, to make the code optimization process in DL compilers more adaptive to different hardware and models.
  • The adaptive evaluator allows cut hardware measurement cost significantly without losing much accuracy.
  • The uncertainty-aware surrogate model and the contextual optimizer allow them to more efficiently explore the transformation space.
  • AdaTune achieves higher speedups in terms of finding a good transformation plan on different types of hardware and models, outperforming AutoTVM, a state-of-the-art approach.
Tables
  • Table1: Example of TVM knobs
Download tables as Excel
Funding
  • All authors are not funded by any other agency
Reference
  • Forest Confidence Interval. http://contrib.scikit-learn.org/
    Findings
  • Intel(R) Math Kernel Library for Deep Neural Networks. https://github.com/01org/mkl-dnn.
    Findings
  • Nvidia A100 Tensor Core GPU Architecture.
    Google ScholarFindings
  • The Accelerated Linear Algebra Compiler Framework. https://www.tensorflow.org/performance/xla/.
    Findings
  • Turing-NLG: A 17-billion-parameter language model by Microsoft. https://www.microsoft.com/en-us/research/blog/
    Findings
  • Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek G. Murray, Benoit Steiner, Paul Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. TensorFlow: A System for Large-scale Machine Learning. In Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation, OSDI ’16, pages 265–283, 2016.
    Google ScholarLocate open access versionFindings
  • Andrew Adams, Karima Ma, Luke Anderson, Riyadh Baghdadi, Tzu-Mao Li, Michaël Gharbi, Benoit Steiner, Steven Johnson, Kayvon Fatahalian, Frédo Durand, and Jonathan Ragan-Kelley. Learning to optimize halide with tree search and random programs. ACM Trans. Graph., 38(4):121:1–121:12, 2019.
    Google ScholarLocate open access versionFindings
  • Byung Hoon Ahn, Prannoy Pilligundla, Amir Yazdanbakhsh, and Hadi Esmaeilzadeh. Chameleon: Adaptive code optimization for expedited deep neural network compilation. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020, 2020.
    Google ScholarLocate open access versionFindings
  • Jason Ansel, Shoaib Kamil, Kalyan Veeramachaneni, Jonathan Ragan-Kelley, Jeffrey Bosboom, Una-May O’Reilly, and Saman P. Amarasinghe. Opentuner: an extensible framework for program autotuning. In José Nelson Amaral and Josep Torrellas, editors, International Conference on Parallel Architectures and Compilation, PACT ’14, Edmonton, AB, Canada, August 24-27, 2014, pages 303–316. ACM, 2014.
    Google ScholarLocate open access versionFindings
  • Amir H. Ashouri, William Killian, John Cavazos, Gianluca Palermo, and Cristina Silvano. A survey on compiler autotuning using machine learning. ACM Comput. Surv., 51(5):96:1–96:42, 2019.
    Google ScholarLocate open access versionFindings
  • Eric Brochu, Vlad M. Cora, and Nando de Freitas. A tutorial on bayesian optimization of expensive cost functions, with application to active user modeling and hierarchical reinforcement learning. CoRR, abs/1012.2599, 2010.
    Findings
  • Tianqi Chen, Mu Li, Yutian Li, Min Lin, Naiyan Wang, Minjie Wang, Tianjun Xiao, Bing Xu, Chiyuan Zhang, and Zheng Zhang. MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems. arXiv preprint arXiv:1512.01274, 2015.
    Findings
  • Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Q. Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. TVM: an automated end-to-end optimizing compiler for deep learning. In 13th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2018, Carlsbad, CA, USA, October 8-10, 2018, pages 578–594, 2018.
    Google ScholarLocate open access versionFindings
  • Tianqi Chen, Lianmin Zheng, Eddie Q. Yan, Ziheng Jiang, Thierry Moreau, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. Learning to optimize tensor programs. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, 3-8 December 2018, Montréal, Canada, pages 3393– 3404, 2018.
    Google ScholarLocate open access versionFindings
  • Sharan Chetlur, Cliff Woolley, Philippe Vandermersch, Jonathan Cohen, John Tran, Bryan Catanzaro, and Evan Shelhamer. cuDNN: Efficient Primitives for Deep Learning. arXiv preprint arXiv:1410.0759, 2014.
    Findings
  • Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019), pages 4171–4186, 2019.
    Google ScholarLocate open access versionFindings
  • Stefan Falkner, Aaron Klein, and Frank Hutter. BOHB: robust and efficient hyperparameter optimization at scale. In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, pages 1436–1445, 2018.
    Google ScholarLocate open access versionFindings
  • Shanghua Gao, Ming-Ming Cheng, Kai Zhao, Xinyu Zhang, Ming-Hsuan Yang, and Philip H. S. Torr. Res2net: A new multi-scale backbone architecture. CoRR, abs/1904.01169, 2019.
    Findings
  • Priya Goyal, Piotr Dollár, Ross B. Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. Accurate, large minibatch SGD: training imagenet in 1 hour. CoRR, abs/1706.02677, 2017.
    Findings
  • Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, pages 770–778, 2016.
    Google ScholarLocate open access versionFindings
  • Connor Holmes, Daniel Mawhirter, Yuxiong He, Feng Yan, and Bo Wu. GRNN: low-latency and scalable RNN inference on gpus. In Proceedings of the Fourteenth EuroSys Conference 2019, Dresden, Germany, March 25-28, 2019, pages 41:1–41:16, 2019.
    Google ScholarLocate open access versionFindings
  • Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. CoRR, abs/1704.04861, 2017.
    Findings
  • Frank Hutter, Holger H. Hoos, and Kevin Leyton-Brown. Sequential model-based optimization for general algorithm configuration. In Learning and Intelligent Optimization - 5th International Conference, LION 5, Rome, Italy, January 17-21, 2011. Selected Papers, pages 507–523, 2011.
    Google ScholarLocate open access versionFindings
  • Forrest N. Iandola, Matthew W. Moskewicz, Khalid Ashraf, Song Han, William J. Dally, and Kurt Keutzer. SqueezeNet: AlexNet-level Accuracy with 50x Fewer Parameters and <1MB Model Size. arXiv preprint arXiv:1602.07360, 2016.
    Findings
  • [26] Chris Lattner, Jacques Pienaar, Mehdi Amini, Uday Bondhugula, River Riddle, Albert Cohen, Tatiana Shpeisman, Andy Davis, Nicolas Vasilache, and Oleksandr Zinenko. Mlir: A compiler infrastructure for the end of moore’s law. arXiv preprint arXiv:2002.11054, 2020.
    Findings
  • [27] Lisha Li, Kevin G. Jamieson, Giulia DeSalvo, Afshin Rostamizadeh, and Ameet Talwalkar. Hyperband: Bandit-based configuration evaluation for hyperparameter optimization. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings, 2017.
    Google ScholarLocate open access versionFindings
  • [28] Shuang Liang, Shouyi Yin, Leibo Liu, Wayne Luk, and Shaojun Wei. FP-BNN: binarized neural network on FPGA. Neurocomputing, 275:1072–1086, 2018.
    Google ScholarLocate open access versionFindings
  • [29] Ji Lin, Chuang Gan, and Song Han. Training kinetics in 15 minutes: Large-scale distributed training on videos. arXiv preprint arXiv:1910.00932, 2019.
    Findings
  • [30] Changxi Liu, Hailong Yang, Rujun Sun, Zhongzhi Luan, Lin Gan, Guangwen Yang, and Depei Qian. Swtvm: exploring the automated compilation for deep learning on sunway architecture. arXiv preprint arXiv:1904.07404, 2019.
    Findings
  • [31] Thierry Moreau, Tianqi Chen, Ziheng Jiang, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. VTA: an open hardware-software stack for deep learning. CoRR, abs/1807.04188, 2018.
    Findings
  • [32] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Köpf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, 8-14 December 2019, Vancouver, BC, Canada, pages 8024–8035, 2019.
    Google ScholarLocate open access versionFindings
  • [33] Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, et al. Scikitlearn: Machine learning in python. Journal of machine learning research, 12(Oct):2825–2830, 2011.
    Google ScholarLocate open access versionFindings
  • [34] Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. 2019.
    Google ScholarFindings
  • [35] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. CoRR, abs/1910.10683, 2019.
    Findings
  • [36] Jonathan Ragan-Kelley, Connelly Barnes, Andrew Adams, Sylvain Paris, Frédo Durand, and Saman Amarasinghe. Halide: A Language and Compiler for Optimizing Parallelism, Locality, and Recomputation in Image Processing Pipelines. In Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI ’13, pages 519–530, 2013.
    Google ScholarLocate open access versionFindings
  • [37] Nadav Rotem, Jordan Fix, Saleem Abdulrasool, Summer Deng, Roman Dzhabarov, James Hegeman, Roman Levenstein, Bert Maher, Nadathur Satish, Jakob Olesen, Jongsoo Park, Artem Rakhov, and Misha Smelyanskiy. Glow: Graph lowering compiler techniques for neural networks. CoRR, abs/1805.00907, 2018.
    Findings
  • [38] Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron-lm: Training multi-billion parameter language models using model parallelism. CoRR, abs/1909.08053, 2019.
    Findings
  • [39] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015.
    Google ScholarLocate open access versionFindings
  • [40] Nicolas Vasilache, Oleksandr Zinenko, Theodoros Theodoridis, Priya Goyal, Zachary DeVito, William S. Moses, Sven Verdoolaege, Andrew Adams, and Albert Cohen. Tensor comprehensions: Framework-agnostic high-performance machine learning abstractions. CoRR, abs/1802.04730, 2018.
    Findings
  • [41] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, pages 5998–6008, 2017.
    Google ScholarLocate open access versionFindings
  • [42] Zheng Wang and Michael F. P. O’Boyle. Machine learning in compiler optimization. Proc. IEEE, 106(11):1879–1901, 2018.
    Google ScholarLocate open access versionFindings
  • [43] Carole-Jean Wu, David Brooks, Kevin Chen, Douglas Chen, Sy Choudhury, Marat Dukhan, Kim Hazelwood, Eldad Isaac, Yangqing Jia, Bill Jia, et al. Machine learning at facebook: Understanding inference at the edge. In 2019 IEEE International Symposium on High Performance Computer Architecture (HPCA), pages 331–344. IEEE, 2019.
    Google ScholarLocate open access versionFindings
  • [44] Carole-Jean Wu, David Brooks, Kevin Chen, Douglas Chen, Sy Choudhury, Marat Dukhan, Kim Hazelwood, Eldad Isaac, Yangqing Jia, Bill Jia, et al. Machine learning at facebook: Understanding inference at the edge. In 2019 IEEE International Symposium on High Performance Computer Architecture (HPCA), pages 331–344. IEEE, 2019.
    Google ScholarLocate open access versionFindings
  • [45] Zifeng Wu, Chunhua Shen, and Anton van den Hengel. Wider or deeper: Revisiting the resnet model for visual recognition. Pattern Recognit., 90:119–133, 2019.
    Google ScholarLocate open access versionFindings
  • [46] Masafumi Yamazaki, Akihiko Kasagi, Akihiro Tabuchi, Takumi Honda, Masahiro Miwa, Naoto Fukumoto, Tsuguchika Tabaru, Atsushi Ike, and Kohta Nakashima. Yet another accelerated SGD: resnet-50 training on imagenet in 74.7 seconds. CoRR, abs/1903.12650, 2019.
    Findings
  • [47] Yang You, Igor Gitman, and Boris Ginsburg. Large batch training of convolutional networks. arXiv preprint arXiv:1708.03888, 2017.
    Findings
  • [48] Yang You, Jing Li, Jonathan Hseu, Xiaodan Song, James Demmel, and Cho-Jui Hsieh. Reducing BERT pre-training time from 3 days to 76 minutes. CoRR, abs/1904.00962, 2019.
    Findings
  • [49] Minjia Zhang, Samyam Rajbhandari, Wenhan Wang, Elton Zheng, Olatunji Ruwase, Jeff Rasley, Jason Li, Junhua Wang, and Yuxiong He. Accelerating large scale deep learning inference through deepcpu at microsoft. In 2019 USENIX Conference on Operational Machine Learning, OpML 2019, Santa Clara, CA, USA, May 20, 2019, pages 5–7, 2019.
    Google ScholarLocate open access versionFindings
Author
Menghao Li
Menghao Li
Minjia Zhang
Minjia Zhang
Chi Wang
Chi Wang
Mingqin Li
Mingqin Li
Your rating :
0

 

Tags
Comments
小科