Low latency RNN inference with cellular batching

EuroSys '18: Thirteenth EuroSys Conference 2018 Porto Portugal April, 2018, pp. 1-15, 2018.

被引用24|引用|浏览44
EI
其它链接dl.acm.org
微博一下
We present a novel approach, called cellular batching, to achieve low-latency inference on Recurrent Neural Network models. cellular batching batches the execution of an inference request at the granularity of an RNN cell

摘要

Performing inference on pre-trained neural network models must meet the requirement of low-latency, which is often at odds with achieving high throughput. Existing deep learning systems use batching to improve throughput, which do not perform well when serving Recurrent Neural Networks with dynamic dataflow graphs. We propose the techniqu...更多

代码

数据

0
简介
  • Deep learning methods have rapidly matured from experimental research to real world deployments.
  • Let fθ be a function parameterized with θ , RNNs represent the recursive computation h(t ) = fθ (h(t −1), x (t ) ), where h(t ) is viewed as the value of the hidden unit after processing the input sequence up to the t-th position.
  • A basic Seq2Seq model contains two types of RNN cells: encoder and decoder, as depicted in Figure 12.
  • The encoder and decoder will convert a word to a vector by doing an embedding lookup, feed this vector to an RNN Cell.
重点内容
  • In recent years, deep learning methods have rapidly matured from experimental research to real world deployments
  • We present a novel approach, called cellular batching, to achieve low-latency inference on Recurrent Neural Network models. cellular batching batches the execution of an inference request at the granularity of an RNN cell
  • Experiments on three popular RNN applications using real world dataset show that BatchMaker reduces latency by 17.5-90.5% and improves throughput by 25-80% compared with state-of-the-art systems including TensorFlow, MXNet, TensorFlow Fold, and DyNet
  • We note that cellular batching is only beneficial for RNN inference
  • It does not improve the performance of training because, unlike inference, all training inputs are ready at the same time and the weight update algorithm typically requires waiting for all inputs within a batch to finish
  • Our evaluation shows that BatchMaker benefits workloads whose inputs vary in length or structure we hypothesize that cellular batching would not improve inference for deep neural network (DNN) with fixed inputs such as Convolution Neural Networks (CNNs) and Multilayer Perceptrons (MLPs)
方法
  • The authors run the tests on a Linux server with 4 NVIDIA TESLA V100 GPU cards connected by NVLink; each GPU has 16GB memory.
  • The operating system is Ubuntu 16.04.1 LTS with Linux kernel version 4.13.0.
  • NVIDIA CUDA Toolkit version is 9.0.
  • Applications, datasets, and workloads.
  • The authors choose three popular RNN applications, LSTM, Seq2Seq, and TreeLSTM.
  • All RNN cells used in these applications use hidden state size 1024.
  • LSTM and Seq2Seq are both chain-structured RNNs. The authors use WMT-15 [42]
结果
  • The authors evaluate BatchMaker on microbenchmarks and several popular RNN applications with real-world datasets.
  • The authors' evaluation shows that BatchMaker provides significant performance advantages over existing systems.
  • BatchMaker achieves much lower latency than existing systems.
  • The authors reduce the 90-percentile latency by 37.5%-90.5% and 17.5%-82.6% compared to TensorFlow and MXNet. For TreeLSTM, the authors reduce 90-percentile latency by 28% and 87% compared to DyNet and TensorFlow Fold respectively.
  • BatchMaker provides good throughput improvements.
  • The throughput improvement over MXNet and TensorFlow is 25% and 60%.
  • For TreeLSTM, the throughput of BatchMaker is 1.8× that of DyNet and 4× that of TensorFlow Fold
结论
  • The authors present a novel approach, called cellular batching, to achieve low-latency inference on Recurrent Neural Network models. cellular batching batches the execution of an inference request at the granularity of an RNN cell.
  • Experiments on three popular RNN applications using real world dataset show that BatchMaker reduces latency by 17.5-90.5% and improves throughput by 25-80% compared with state-of-the-art systems including TensorFlow, MXNet, TensorFlow Fold, and DyNet. The authors note that cellular batching is only beneficial for RNN inference.
  • The authors note that cellular batching is only beneficial for RNN inference
  • It does not improve the performance of training because, unlike inference, all training inputs are ready at the same time and the weight update algorithm typically requires waiting for all inputs within a batch to finish.
总结
  • Introduction:

    Deep learning methods have rapidly matured from experimental research to real world deployments.
  • Let fθ be a function parameterized with θ , RNNs represent the recursive computation h(t ) = fθ (h(t −1), x (t ) ), where h(t ) is viewed as the value of the hidden unit after processing the input sequence up to the t-th position.
  • A basic Seq2Seq model contains two types of RNN cells: encoder and decoder, as depicted in Figure 12.
  • The encoder and decoder will convert a word to a vector by doing an embedding lookup, feed this vector to an RNN Cell.
  • Methods:

    The authors run the tests on a Linux server with 4 NVIDIA TESLA V100 GPU cards connected by NVLink; each GPU has 16GB memory.
  • The operating system is Ubuntu 16.04.1 LTS with Linux kernel version 4.13.0.
  • NVIDIA CUDA Toolkit version is 9.0.
  • Applications, datasets, and workloads.
  • The authors choose three popular RNN applications, LSTM, Seq2Seq, and TreeLSTM.
  • All RNN cells used in these applications use hidden state size 1024.
  • LSTM and Seq2Seq are both chain-structured RNNs. The authors use WMT-15 [42]
  • Results:

    The authors evaluate BatchMaker on microbenchmarks and several popular RNN applications with real-world datasets.
  • The authors' evaluation shows that BatchMaker provides significant performance advantages over existing systems.
  • BatchMaker achieves much lower latency than existing systems.
  • The authors reduce the 90-percentile latency by 37.5%-90.5% and 17.5%-82.6% compared to TensorFlow and MXNet. For TreeLSTM, the authors reduce 90-percentile latency by 28% and 87% compared to DyNet and TensorFlow Fold respectively.
  • BatchMaker provides good throughput improvements.
  • The throughput improvement over MXNet and TensorFlow is 25% and 60%.
  • For TreeLSTM, the throughput of BatchMaker is 1.8× that of DyNet and 4× that of TensorFlow Fold
  • Conclusion:

    The authors present a novel approach, called cellular batching, to achieve low-latency inference on Recurrent Neural Network models. cellular batching batches the execution of an inference request at the granularity of an RNN cell.
  • Experiments on three popular RNN applications using real world dataset show that BatchMaker reduces latency by 17.5-90.5% and improves throughput by 25-80% compared with state-of-the-art systems including TensorFlow, MXNet, TensorFlow Fold, and DyNet. The authors note that cellular batching is only beneficial for RNN inference.
  • The authors note that cellular batching is only beneficial for RNN inference
  • It does not improve the performance of training because, unlike inference, all training inputs are ready at the same time and the weight update algorithm typically requires waiting for all inputs within a batch to finish.
相关工作
  • Batching via padding. Theano [5], Caffe [25], TensorFlow [1], MXNet [7], Torch [8], PyTorch [34] and CNTK [13] are widely-used deep learning frameworks. Theano, TensorFlow, MXNet, and CNTK require users to build a static dataflow graph before training or inference. PyTorch is more imperative and allows the computation graph to be built dynamically as execution happens [41]. Gluon [9] is a recent package for MXNet supporting dynamic computation graph. When handling variable-sized inputs, all of these systems support batching via padding. CNTK [18] additionally introduce an optimization on padding that tries to fill up padded space with shorter requests. Doing so can improve system throughput by reducing the amount of wasted computation due to padding. As we mentioned earlier, padding does not work for non-chain-structured RNNs such as the TreeLSTM. Therefore, these systems do not natively support batching for the TreeLSTM.
基金
  • This work is supported by NVIDIA AI Lab (NVAIL) and GPU Center of Excellence, National Key Research & Development Program of China (2016YFB1000504), Natural Science Foundation of China (61433008, 61373145, 61572280, 61133004, 61502019, U1435216), National Basic Research (973) Program of China (2014CB340402)
  • Pin Gao’s work is also supported by the China Scholarship Council
引用论文
  • Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. 2016. TensorFlow: A System for Large-Scale Machine Learning.. In OSDI, Vol. 16. 265–283.
    Google ScholarLocate open access versionFindings
  • Deepak Agarwal, Bo Long, Jonathan Traupman, Doris Xin, and Liang Zhang. 2014. Laser: A scalable response prediction platform for online advertising. In Proceedings of the 7th ACM international conference on Web search and data mining. ACM, 173–182.
    Google ScholarLocate open access versionFindings
  • Dario Amodei, Sundaram Ananthanarayanan, Rishita Anubhai, Jingliang Bai, Eric Battenberg, Carl Case, Jared Casper, Bryan Catanzaro, Qiang Cheng, Guoliang Chen, et al. 2016. Deep speech 2: End-to-end speech recognition in english and mandarin. In International Conference on Machine Learning. 173–182.
    Google ScholarLocate open access versionFindings
  • Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 201Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014).
    Findings
  • James Bergstra, Olivier Breuleux, Frédéric Bastien, Pascal Lamblin, Razvan Pascanu, Guillaume Desjardins, Joseph Turian, David Warde-Farley, and Yoshua Bengio. 2010. Theano: A CPU and GPU math compiler in Python. In Proc. 9th Python in Science Conf. 1–7.
    Google ScholarLocate open access versionFindings
  • George Candea, Neoklis Polyzotis, and Radek Vingralek. 2009. A scalable, predictable join operator for highly concurrent data warehouses. Proceedings of the VLDB Endowment 2, 1 (2009), 277–288.
    Google ScholarLocate open access versionFindings
  • Tianqi Chen, Mu Li, Yutian Li, Min Lin, Naiyan Wang, Minjie Wang, Tianjun Xiao, Bing Xu, Chiyuan Zhang, and Zheng Zhang. 2015. Mxnet: A flexible and efficient machine learning library for heterogeneous distributed systems. arXiv preprint arXiv:1512.01274 (2015).
    Findings
  • Ronan Collobert, Koray Kavukcuoglu, and Clément Farabet. 2011. Torch7: A matlab-like environment for machine learning. In BigLearn, NIPS Workshop.
    Google ScholarLocate open access versionFindings
  • Distributed (Deep) Machine Learning Community. 2017. The Gluon Package. http://gluon.mxnet.io. (2017).
    Findings
  • Distributed (Deep) Machine Learning Community. 2017. NNVM: Open Compiler for AI Frameworks. https://github.com/dmlc/nnvm/. (2017).
    Findings
  • Tensorflow Community. 2017. Tensorflow XLA. https://www.tensorflow.org/performance/xla/. (2017).
    Findings
  • The Wikipedia Community. 2017. Argmax. https://en.wikipedia.org/wiki/Arg_max. (2017).
    Findings
  • Microsoft Corporation. 2015. Microsoft Cognitive Toolkit (CNTK). https://github.com/Microsoft/CNTK. (2015).
    Findings
  • Nvidia Corporation. 2017. TensorRT. https://developer.nvidia.com/tensorrt. (2017).
    Findings
  • Nvidia Corporation. 2018. CUDA Runtime API. http://docs.nvidia.com/cuda/cuda-runtime-api/index.html. (2018).
    Findings
  • Daniel Crankshaw, Peter Bailis, Joseph E Gonzalez, Haoyuan Li, Zhao Zhang, Michael J Franklin, Ali Ghodsi, and Michael I Jordan. 2014. The missing piece in complex analytics: Low latency, scalable model management and serving with velox. arXiv preprint arXiv:1409.3809 (2014).
    Findings
  • Daniel Crankshaw, Xin Wang, Guilio Zhou, Michael J Franklin, Joseph E Gonzalez, and Ion Stoica. 20Clipper: A Low-Latency Online Prediction Serving System..
    Google ScholarFindings
  • William Darling. 2016. //www.microsoft.com/en- us/cognitive- toolkit/blog/2016/08/
    Findings
  • Greg Diamos, Shubho Sengupta, Bryan Catanzaro, Mike Chrzanowski, Adam Coates, Erich Elsen, Jesse Engel, Awni Hannun, and Sanjeev Satheesh. 2016.
    Google ScholarFindings
  • Persistent rnns: Stashing recurrent weights on-chip. In International Conference on Machine Learning. 2024–2033.
    Google ScholarLocate open access versionFindings
  • [20] Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, and Trevor Darrell. 2015. Long-term recurrent convolutional networks for visual recognition and description. In Proceedings of the IEEE conference on computer vision and pattern recognition.
    Google ScholarLocate open access versionFindings
  • [21] Georgios Giannikis, Gustavo Alonso, and Donald Kossmann. 2012. SharedDB: killing one thousand queries with one stone. Proceedings of the VLDB Endowment 5, 6 (2012), 526–537.
    Google ScholarLocate open access versionFindings
  • [22] Awni Hannun, Carl Case, Jared Casper, Bryan Catanzaro, Greg Diamos, Erich
    Google ScholarFindings
  • 2014. Deep speech: Scaling up end-to-end speech recognition. arXiv preprint arXiv:1412.5567 (2014).
    Findings
  • [23] Stavros Harizopoulos, Vladislav Shkapenyuk, and Anastassia Ailamaki. 2005. QPipe: a simultaneously pipelined relational query engine. In Proceedings of the 2005 ACM SIGMOD international conference on Management of data. ACM, 383–394.
    Google ScholarLocate open access versionFindings
  • [24] Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation 9, 8 (1997), 1735–1780.
    Google ScholarLocate open access versionFindings
  • [25] Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. 2014. Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the 22nd ACM
    Google ScholarLocate open access versionFindings
  • [26] Moshe Looks, Marcello Herreshoff, DeLesley Hutchins, and Peter Norvig. 2017. Deep learning with dynamic computation graphs. arXiv preprint arXiv:1702.02181 (2017).
    Findings
  • [27] Darko Makreshanski, Georgios Giannikis, Gustavo Alonso, and Donald Kossmann. 2017. Many-query join: efficient shared execution of relational joins on modern hardware. The VLDB Journal (2017), 1–24.
    Google ScholarLocate open access versionFindings
  • [28] Marco Marelli, Luisa Bentivogli, Marco Baroni, Raffaella Bernardi, Stefano Menini, and Roberto Zamparelli. 2014. Semeval-2014 task 1: Evaluation of compositional distributional semantic models on full sentences through semantic relatedness and textual entailment. In Proceedings of the 8th international workshop on semantic evaluation (SemEval 2014). 1–8.
    Google ScholarLocate open access versionFindings
  • [29] Xiangrui Meng, Joseph Bradley, Burak Yavuz, Evan Sparks, Shivaram Venkataraman, Davies Liu, Jeremy Freeman, DB Tsai, Manish Amde, Sean Owen, et al.
    Google ScholarLocate open access versionFindings
  • 2016. Mllib: Machine learning in apache spark. The Journal of Machine Learning Research 17, 1 (2016), 1235–1241.
    Google ScholarLocate open access versionFindings
  • [30] Graham Neubig, Chris Dyer, Yoav Goldberg, Austin Matthews, Waleed Ammar, Antonios Anastasopoulos, Miguel Ballesteros, David Chiang, Daniel Clothiaux, Trevor Cohn, et al. 2017. DyNet: The Dynamic Neural Network Toolkit. arXiv preprint arXiv:1701.03980 (2017).
    Findings
  • [31] Graham Neubig, Yoav Goldberg, and Chris Dyer. 2017. On-the-fly Operation
    Google ScholarFindings
  • [32] Christopher Olston, Noah Fiedel, Kiril Gorovoy, Jeremiah Harmsen, Li Lao, Fangwei Li, Vinu Rajashekhar, Sukriti Ramesh, and Jordan Soyke. 2017. TensorFlow- Serving: Flexible, High-Performance ML Serving. arXiv preprint arXiv:1712.06139 (2017).
    Findings
  • [33] Bo Pang, Lillian Lee, and Shivakumar Vaithyanathan. 2002. Thumbs up?: sentiment classification using machine learning techniques. In Proceedings of the
    Google ScholarLocate open access versionFindings
  • 10. Association for Computational Linguistics, 79–86.
    Google ScholarFindings
  • [34] Adam Paszke, Sam Gross, Soumith Chintala, and Gregory Chanan. 2017. PyTorch. http://pytorch.org/. (2017).
    Findings
  • [35] Lin Qiao, Vijayshankar Raman, Frederick Reiss, Peter J Haas, and Guy M Lohman.
    Google ScholarFindings
  • 2008. Main-memory scan sharing for multi-core CPUs. Proceedings of the VLDB Endowment 1, 1 (2008), 610–621.
    Google ScholarLocate open access versionFindings
  • [36] Richard Socher, Cliff C Lin, Chris Manning, and Andrew Y Ng. 2011. Parsing natural scenes and natural language with recursive neural networks. In Proceedings of the 28th international conference on machine learning (ICML-11). 129–136.
    Google ScholarLocate open access versionFindings
  • [37] Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D Manning, Andrew Ng, and Christopher Potts. 2013. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on empirical methods in natural language processing. 1631–1642.
    Google ScholarLocate open access versionFindings
  • [38] Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to sequence learning with neural networks. In Advances in neural information processing systems. 3104– 3112.
    Google ScholarFindings
  • [39] Kai Sheng Tai, Richard Socher, and Christopher D Manning. 2015. Improved semantic representations from tree-structured long short-term memory networks. arXiv preprint arXiv:1503.00075 (2015).
    Findings
  • [40] Ming Tan, Cicero dos Santos, Bing Xiang, and Bowen Zhou. 2015. LSTMbased deep learning models for non-factoid answer selection. arXiv preprint arXiv:1511.04108 (2015).
    Findings
  • [41] Seiya Tokui, Kenta Oono, Shohei Hido, and Justin Clayton. 2015. Chainer: a next-generation open source framework for deep learning. In Proceedings of workshop on machine learning systems (LearningSys) in the twenty-ninth annual conference on neural information processing systems (NIPS), Vol. 5.
    Google ScholarLocate open access versionFindings
  • [42] TUG 2015. WMT15 Machine Translation Task. http://www.statmt.org/wmt15/translation-task.html. (2015).
    Findings
  • [43] Philipp Unterbrunner, Georgios Giannikis, Gustavo Alonso, Dietmar Fauser, and Donald Kossmann. 2009. Predictable performance for unpredictable workloads. Proceedings of the VLDB Endowment 2, 1 (2009), 706–717.
    Google ScholarLocate open access versionFindings
  • [44] Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2015. Show and tell: A neural image caption generator. In Proceedings of the IEEE conference on computer vision and pattern recognition. 3156–3164.
    Google ScholarLocate open access versionFindings
  • [45] Matt Welsh, David Culler, and Eric Brewer. 2001. SEDA: an architecture for well-conditioned, scalable internet services. In ACM SIGOPS Operating Systems Review, Vol. 35. ACM, 230–243.
    Google ScholarLocate open access versionFindings
  • [46] Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. 2016. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144 (2016).
    Findings
  • [47] Jun Yin, Xin Jiang, Zhengdong Lu, Lifeng Shang, Hang Li, and Xiaoming Li. 2015. Neural generative question answering. arXiv preprint arXiv:1512.01337 (2015).
    Findings
下载 PDF 全文
您的评分 :
0

 

标签
评论