TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems

arXiv: Distributed, Parallel, and Cluster Computing, Volume abs/1603.04467, 2015.

Cited by: 23391|Bibtex|Views596
EI
Other Links: dblp.uni-trier.de|academic.microsoft.com|arxiv.org
Weibo:
We have described TensorFlow, a flexible data flowbased programming model, as well as single machine and distributed implementations of this programming model

Abstract:

TensorFlow is an interface for expressing machine learning algorithms, and an implementation for executing such algorithms. A computation expressed using TensorFlow can be executed with little or no change on a wide variety of heterogeneous systems, ranging from mobile devices such as phones and tablets up to large-scale distributed syste...More

Code:

Data:

0
Introduction
  • Introduction chines with thousands of GPUs

    Having a single system that can span such a broad range of platforms signifi-

    The Google Brain project started in 2011 to explore the use of very-large-scale deep neural networks, both for research and for use in Google’s products.
  • Dozens of the internal clients of DistBelief have already switched to TensorFlow
  • These clients rely on TensorFlow for research and production, with tasks as diverse as running inference for computer vision models on mobile phones to large-scale training of deep neural networks with hundreds of billions of parameters on hundreds of billions of example records using many hundreds of machines [11, 47, 48, 18, 53, 41].
  • The authors have open-sourced the TensorFlow API and a reference implementation under the Apache 2.0 license in November, 2015, available at www.tensorflow.org
Highlights
  • Introduction chines with thousands of GPUs

    Having a single system that can span such a broad range of platforms signifi-

    The Google Brain project started in 2011 to explore the use of very-large-scale deep neural networks, both for research and for use in Google’s products
  • Much as in the dataflow-machine approach described by Arvind [3], we introduce a small set of primitive control flow operators into TensorFlow and generalize TensorFlow to handle cyclic dataflow graphs
  • We have a number of concrete directions to improve the performance of TensorFlow
  • We have described TensorFlow, a flexible data flowbased programming model, as well as single machine and distributed implementations of this programming model
  • The system is borne from real-world experience in conducting research and deploying more than one hundred machine learning projects throughout a wide range of Google products and services
  • We have open sourced a version of TensorFlow, and hope that a vibrant shared community develops around the use of TensorFlow
Results
  • The authors describe some of the optimizations in the TensorFlow implementation that improve performance or resource usage of the system.
  • The authors have a number of concrete directions to improve the performance of TensorFlow
Conclusion
  • The authors have described TensorFlow, a flexible data flowbased programming model, as well as single machine and distributed implementations of this programming model.
  • The system is borne from real-world experience in conducting research and deploying more than one hundred machine learning projects throughout a wide range of Google products and services.
  • The authors have open sourced a version of TensorFlow, and hope that a vibrant shared community develops around the use of TensorFlow.
  • The authors are excited to see how others outside of Google make use of TensorFlow in their own work
Summary
  • Introduction:

    Introduction chines with thousands of GPUs

    Having a single system that can span such a broad range of platforms signifi-

    The Google Brain project started in 2011 to explore the use of very-large-scale deep neural networks, both for research and for use in Google’s products.
  • Dozens of the internal clients of DistBelief have already switched to TensorFlow
  • These clients rely on TensorFlow for research and production, with tasks as diverse as running inference for computer vision models on mobile phones to large-scale training of deep neural networks with hundreds of billions of parameters on hundreds of billions of example records using many hundreds of machines [11, 47, 48, 18, 53, 41].
  • The authors have open-sourced the TensorFlow API and a reference implementation under the Apache 2.0 license in November, 2015, available at www.tensorflow.org
  • Results:

    The authors describe some of the optimizations in the TensorFlow implementation that improve performance or resource usage of the system.
  • The authors have a number of concrete directions to improve the performance of TensorFlow
  • Conclusion:

    The authors have described TensorFlow, a flexible data flowbased programming model, as well as single machine and distributed implementations of this programming model.
  • The system is borne from real-world experience in conducting research and deploying more than one hundred machine learning projects throughout a wide range of Google products and services.
  • The authors have open sourced a version of TensorFlow, and hope that a vibrant shared community develops around the use of TensorFlow.
  • The authors are excited to see how others outside of Google make use of TensorFlow in their own work
Tables
  • Table1: Example TensorFlow operation types by the session interface is Run, which takes a set of output names that need to be computed, as well as an optional set of tensors to be fed into the graph in place of certain outputs of nodes. Using the arguments to Run, the TensorFlow implementation can compute the transitive closure of all nodes that must be executed in order to compute the outputs that were requested, and can then arrange to execute the appropriate nodes in an order that respects their dependencies (as described in more detail in 3.1). Most of our uses of TensorFlow set up a Session with a graph once, and then execute the full graph or a few distinct subgraphs thousands or millions of times via Run calls
Download tables as Excel
Related work
  • There are many other systems that are comparable in various ways with TensorFlow. Theano [7], Torch [13], Caffe [26], Chainer [49] and the Computational Network Toolkit [54] are a few systems designed primarily for the training of neural networks. Each of these systems maps the computation onto a single machine, unlike the distributed TensorFlow implementation. Like Theano and Chainer, TensorFlow supports symbolic differentiation, thus making it easier to define and work with gradientbased optimization algorithms. Like Caffe, TensorFlow has a core written in C++, simplifying the deployment of trained models in a wide variety of production settings, including memory- and computation-constrained environments such as mobile devices.

    The TensorFlow system shares some design characteristics with its predecessor system, DistBelief [14], and with later systems with similar designs like Project Adam [10] and the Parameter Server project [33]. Like DistBelief and Project Adam, TensorFlow allows computations to be spread out across many computational devices across many machines, and allows users to specify machine learning models using relatively high-level descriptions. Unlike DistBelief and Project Adam, though, the general-purpose dataflow graph model in TensorFlow is more flexible and more amenable to expressing a wider variety of machine learning models and optimization algorithms. It also permits a significant simplification by allowing the expression of stateful parameter nodes as variables, and variable update operations that are just additional nodes in the graph; in contrast, DistBelief, Project Adam and the Parameter Server systems all have whole separate parameter server subsystems devoted to communicating and updating parameter values.
Reference
  • Martın Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dan Mane, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viegas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. TensorFlow: Large-scale machine learning on heterogeneous systems, 2015. Software available from tensorflow.org.
    Google ScholarFindings
  • Anelia Angelova, Alex Krizhevsky, and Vincent Vanhoucke. Pedestrian detection with a large-field-of-view deep network. In Robotics and Automation (ICRA), 2015 IEEE International Conference on, pages 704–711. IEEE, 2015. CalTech PDF.
    Google ScholarLocate open access versionFindings
  • Arvind and David E. Culler. Annual review of computer science vol. 1, 1986. Dataflow Architectures, pages 225–251986. www.dtic.mil/cgi-bin/GetTRDoc?Location=U2& doc=GetTRDoc.pdf&AD=ADA166235.
    Locate open access versionFindings
  • Arvind and Rishiyur S. Nikhil. Executing a program on the MIT tagged-token dataflow architecture. IEEE Trans. Comput., 39(3):300–318, 1990. dl.acm.org/citation.cfm?id=78583.
    Google ScholarLocate open access versionFindings
  • Jimmy Ba, Volodymyr Mnih, and Koray Kavukcuoglu. arXiv preprint arXiv:1412.7755, 2014.
    Findings
  • Francoise Beaufays. The neural networks behind Google Voice transcription, 2015.
    Google ScholarFindings
  • James Bergstra, Olivier Breuleux, Frederic Bastien, Pascal Lamblin, Razvan Pascanu, Guillaume Desjardins, Joseph Turian, David Warde-Farley, and Yoshua Bengio. Theano: A CPU and GPU math expression compiler. In Proceedings of the Python for scientific computing conference (SciPy), volume 4, page 3. Austin, TX, 2010. UMontreal PDF.
    Google ScholarLocate open access versionFindings
  • Craig Chambers, Ashish Raniwala, Frances Perry, Stephen Adams, Robert R Henry, Robert Bradshaw, and Nathan Weizenbaum. FlumeJava: easy, efficient data-parallel pipelines. In ACM Sigplan Notices, volume 45, pages 363–375. ACM, 2010. research.google.com/pubs/archive/35650.pdf.
    Google ScholarLocate open access versionFindings
  • Sharan Chetlur, Cliff Woolley, Philippe Vandermersch, Jonathan Cohen, John Tran, Bryan Catanzaro, and Evan Shelhamer. cuDNN: Efficient primitives for deep learning. arXiv preprint arXiv:1410.0759, 2014. arxiv.org/abs/1410.0759.
    Findings
  • Trishul Chilimbi, Yutaka Suzue, Johnson Apacible, and Karthik Kalyanaraman. Project Adam: Building an efficient and scalable deep learning training system. In 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 14), pages 571–582, 2014. www.usenix.org/system/files/conference/osdi14/osdi14paper-chilimbi.pdf.
    Locate open access versionFindings
  • Jack Clark. Google turning its lucrative web search over to AI machines, 2015.
    Google ScholarFindings
  • www.bloomberg.com/news/articles/2015-10-26/googleturning-its-lucrative-web-search-over-to-ai-machines.
    Findings
  • [12] Cliff Click. Global code motion/global value numbering. In ACM SIGPLAN Notices, volume 30, pages 246– 257. ACM, 1995. courses.cs.washington.edu/courses/ cse501/06wi/reading/click-pldi95.pdf.
    Google ScholarLocate open access versionFindings
  • [13] Ronan Collobert, Samy Bengio, and Johnny Mariethoz. Torch: A modular machine learning software library. Technical report, IDIAP, 2002. infoscience.epfl.ch/record/82802/files/rr02-46.pdf.
    Google ScholarFindings
  • [14] Jeffrey Dean, Gregory S. Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Quoc V. Le, Mark Z. Mao, Marc’Aurelio Ranzato, Andrew Senior, Paul Tucker, Ke Yang, and Andrew Y. Ng. Large scale distributed deep networks. In NIPS, 2012. Google Research PDF.
    Google ScholarLocate open access versionFindings
  • [15] Jack J Dongarra, Jeremy Du Croz, Sven Hammarling, and Iain S Duff. A set of level 3 basic linear algebra subprograms. ACM Transactions on Mathematical Software (TOMS), 16(1):1–17, 1990. www.maths.manchester.ac.uk/ ̃sven/pubs/Level3BLAS1-TOMS16-90.pdf.
    Locate open access versionFindings
  • [16] Andrea Frome, Greg S Corrado, Jonathon Shlens, Samy Bengio, Jeff Dean, Tomas Mikolov, et al. DeVISE: A deep visual-semantic embedding model. In Advances in Neural Information Processing Systems, pages 2121–2129, 2013. research.google.com/pubs/archive/41473.pdf.
    Google ScholarLocate open access versionFindings
  • [17] Javier Gonzalez-Dominguez, Ignacio Lopez-Moreno, Pedro J Moreno, and Joaquin Gonzalez-Rodriguez. Frameby-frame language identification in short utterances using deep neural networks. Neural Networks, 64:49–58, 2015.
    Google ScholarLocate open access versionFindings
  • [18] Otavio Good. How Google Translate squeezes deep learning onto a phone, 2015.
    Google ScholarFindings
  • [19] Ian J. Goodfellow, Yaroslav Bulatov, Julian Ibarz, Sacha Arnoud, and Vinay Shet. Multi-digit number recognition from Street View imagery using deep convolutional neural networks. In International Conference on Learning Representations, 2014. arxiv.org/pdf/1312.6082.
    Findings
  • [20] Georg Heigold, Vincent Vanhoucke, Alan Senior, Patrick Nguyen, Marc’Aurelio Ranzato, Matthieu Devin, and Jeffrey Dean. Multilingual acoustic models using distributed deep neural networks. In Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on, pages 8619–8623. IEEE, 2013. research.google.com/pubs/archive/40807.pdf.
    Google ScholarLocate open access versionFindings
  • [21] Geoffrey E. Hinton, Li Deng, Dong Yu, George E. www.cs.toronto.edu/ ̃gdahl/papers/
    Findings
  • [22] Sepp Hochreiter and Jurgen Schmidhuber. Long shortterm memory. Neural computation, 9(8):1735–1780, 1997. ftp.idsia.ch/pub/juergen/lstm.pdf.
    Google ScholarLocate open access versionFindings
  • [23] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. CoRR, abs/1502.03167, 2015. arxiv.org/abs/1502.03167.
    Findings
  • [24] Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, and Dennis Fetterly. Dryad: distributed data-parallel programs from sequential building blocks. In ACM SIGOPS Operating Systems Review, volume 41, pages 59–72. ACM, 2007. www.michaelisard.com/pubs/eurosys07.pdf.
    Locate open access versionFindings
  • [25] Benoıt Jacob, Gael Guennebaud, et al. Eigen library for linear algebra. eigen.tuxfamily.org.
    Google ScholarFindings
  • [26] Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the ACM International Conference on Multimedia, pages 675–678. ACM, 2014. arxiv.org/pdf/1408.5093.
    Findings
  • [27] Andrej Karpathy, George Toderici, Sachin Shetty, Tommy Leung, Rahul Sukthankar, and Li FeiFei. Large-scale video classification with convolutional neural networks. In Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on, pages 1725–1732. IEEE, 2014. research.google.com/pubs/archive/42455.pdf.
    Google ScholarLocate open access versionFindings
  • Alex Krizhevsky. One weird trick for parallelizing convolutional neural networks. arXiv preprint arXiv:1404.5997, 2014. arxiv.org/abs/1404.5997.
    Findings
  • Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton. The CIFAR-10 dataset. www.cs.toronto.edu/ ̃kriz/cifar.html.
    Findings
  • Quoc Le, Marc’Aurelio Ranzato, Rajat Monga, Matthieu Devin, Greg Corrado, Kai Chen, Jeff Dean, and Andrew Ng. Building high-level features using large scale unsupervised learning. In ICML’2012, 2012. Google Research PDF.
    Google ScholarLocate open access versionFindings
  • Yann LeCun, Corinna Cortes, and Christopher JC Burges. The MNIST database of handwritten digits, 1998. yann.lecun.com/exdb/mnist/.
    Google ScholarFindings
  • Mu Li, Dave Andersen, and Alex Smola. Parameter server. parameterserver.org.
    Google ScholarFindings
  • Chris J Maddison, Aja Huang, Ilya Sutskever, and David Silver. Move evaluation in Go using deep convolutional neural networks. arXiv preprint arXiv:1412.6564, 2014. arxiv.org/abs/1412.6564.
    Findings
  • Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector space. In International Conference on Learning Representations: Workshops Track, 2013. arxiv.org/abs/1301.3781.
    Findings
  • Derek G Murray, Frank McSherry, Rebecca Isaacs, Michael Isard, Paul Barham, and Martın Abadi. Naiad: a timely dataflow system. In Proceedings of the TwentyFourth ACM Symposium on Operating Systems Principles, pages 439–455. ACM, 2013. Microsoft Research PDF.
    Google ScholarLocate open access versionFindings
  • Derek G. Murray, Malte Schwarzkopf, Christopher Smowton, Steven Smit, Anil Madhavapeddy, and Steven Hand. Ciel: a universal execution engine for distributed data-flow computing. In Proceedings of the Ninth USENIX Symposium on Networked Systems Design and Implementation, 2011. Usenix PDF.
    Google ScholarLocate open access versionFindings
  • Arun Nair, Praveen Srinivasan, Sam Blackwell, Cagdas Alcicek, Rory Fearon, Alessandro De Maria, Vedavyas Panneershelvam, Mustafa Suleyman, Charles Beattie, Stig Petersen, et al. Massively parallel methods for deep reinforcement learning. arXiv preprint arXiv:1507.04296, 2015. arxiv.org/abs/1507.04296.
    Findings
  • CUDA Nvidia. Cublas library. NVIDIA Corporation, Santa Clara, California, 15, 2008. developer.nvidia.com/cublas.
    Google ScholarFindings
  • Jonathan Ragan-Kelley, Connelly Barnes, Andrew Adams, Sylvain Paris, Fredo Durand, and Saman Amarasinghe. Halide: A language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. ACM SIGPLAN Notices, 48(6):519– 530, 2013. people.csail.mit.edu/fredo/tmp/Halide5min.pdf.
    Google ScholarLocate open access versionFindings
  • Bharath Ramsundar, Steven Kearnes, Patrick Riley, Dale Webster, David Konerding, and Vijay Pande. Massively multitask networks for drug discovery. arXiv preprint arXiv:1502.02072, 2015. arxiv.org/abs/1502.02072.
    Findings
  • Benjamin Recht, Christopher Re, Stephen Wright, and Feng Niu. Hogwild: A lock-free approach to parallelizing stochastic gradient descent. In Advances in Neural Information Processing Systems, pages 693–701, 2011. papers.nips.cc/paper/4390-hogwild-a-lock-freeapproach-to-parallelizing-stochastic-gradient-descent.
    Google ScholarLocate open access versionFindings
  • Chuck Rosenberg. Improving Photo Search: A step across the semantic gap, 2013.
    Google ScholarFindings
  • Christopher J Rossbach, Yuan Yu, Jon Currey, JeanPhilippe Martin, and Dennis Fetterly. Dandelion: a compiler and runtime for heterogeneous systems. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles, pages 49–68. ACM, 2013. research-srv.microsoft.com/pubs/201110/sosp13dandelion-final.pdf.
    Google ScholarLocate open access versionFindings
  • David E Rumelhart, Geoffrey E Hinton, and Ronald J Learning representations by backpropagating errors. Cognitive modeling, 5:3, 1988.
    Google ScholarLocate open access versionFindings
  • www.cs.toronto.edu/ hinton/absps/naturebp.pdf.
    Findings
  • [46] Hasim Sak, Andrew Senior, Kanishka Rao, Francoise Beaufays, and Johan Schalkwyk. Google Voice Search: faster and more accurate, 2015. googleresearch.blogspot.com/2015/09/google-voicesearch-faster-and-more.html.
    Google ScholarFindings
  • [47] Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. Sequence to sequence learning with neural networks. In NIPS, 2014. papers.nips.cc/paper/5346-sequence-to-sequencelearning-with-neural.
    Google ScholarLocate open access versionFindings
  • [48] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In CVPR’2015, 2015. arxiv.org/abs/1409.4842.
    Findings
  • [49] Seiya Tokui. Chainer: A powerful, flexible and intuitive framework of neural networks. chainer.org.
    Google ScholarFindings
  • [50] Vincent Vanhoucke. Speech recognition and deep learning, 2015. googleresearch.blogspot.com/2012/08/speechrecognition-and-deep-learning.html.
    Google ScholarFindings
  • [51] Abhishek Verma, Luis Pedrosa, Madhukar Korupolu, David Oppenheimer, Eric Tune, and John Wilkes. Large-scale cluster management at Google with Borg. In Proceedings of the Tenth European Conference on Computer Systems, page 18. ACM, 2015. research.google.com/pubs/archive/43438.pdf.
    Google ScholarLocate open access versionFindings
  • [52] O. Vinyals, L. Kaiser, T. Koo, S. Petrov, I. Sutskever, and G. Hinton. Grammar as a foreign language. Technical report, arXiv:1412.7449, 2014. arxiv.org/abs/1412.7449.
    Findings
  • [53] Oriol Vinyals, Meire Fortunato, and Navdeep Pointer networks. In NIPS, 2015.
    Google ScholarLocate open access versionFindings
  • [54] Dong Yu, Adam Eversole, Mike Seltzer, Kaisheng Yao, Zhiheng Huang, Brian Guenter, Oleksii Kuchaiev, Yu Zhang, Frank Seide, Huaming Wang, et al. An introduction to computational networks and the computational network toolkit. Technical report, Tech. Rep. MSR, Microsoft Research, 2014, 2014. research.microsoft.com/apps/pubs/?id=226641.
    Google ScholarFindings
  • [55] Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J Franklin, Scott Shenker, and Ion Stoica. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation. USENIX Association, 2012. www.usenix.org/system/files/conference/nsdi12/nsdi12final138.pdf.
    Locate open access versionFindings
  • [56] Matthew D. Zeiler, Marc’Aurelio Ranzato, Rajat Monga, Mark Mao, Ke Yang, Quoc Le, Patrick Nguyen, Andrew Senior, Vincent Vanhoucke, Jeff Dean, and Geoffrey E. Hinton. On rectified linear units for speech processing. In ICASSP, 2013. research.google.com/pubs/archive/40811.pdf.
    Google ScholarFindings
Full Text
Your rating :
0

 

Tags
Comments