Scaling distributed machine learning with the parameter server

OSDI, pp. 583-598, 2014.

Cited by: 800|Bibtex|Views326|DOI:https://doi.org/10.1145/2640087.2644155
EI
Other Links: dl.acm.org|dblp.uni-trier.de|academic.microsoft.com
Weibo:
We described a parameter server framework to solve distributed machine learning problems

Abstract:

We propose a parameter server framework for distributed machine learning problems. Both data and workloads are distributed over worker nodes, while the server nodes maintain globally shared parameters, represented as dense or sparse vectors and matrices. The framework manages asynchronous data communication between nodes, and supports fle...More

Code:

Data:

0
Introduction
  • Distributed optimization and inference is becoming a prerequisite for solving large scale machine learning problems.
  • No single machine can solve these problems sufficiently rapidly, due to the growth of data and the resulting model complexity, often manifesting itself in an increased number of parameters.
  • Implementing an efficient distributed algorithm, is not easy
  • Both intensive computational workloads and the volume of data communication demand careful system design.
  • Realistic quantities of training data can range between 1TB and 1PB
  • This allows one to create powerful and complex models with 109 to 1012 parameters [9].
Highlights
  • Distributed optimization and inference is becoming a prerequisite for solving large scale machine learning problems
  • We evaluate our parameter server based on the use cases of Section 2 — Sparse Logistic Regression and Latent Dirichlet Allocation
  • We described a parameter server framework to solve distributed machine learning problems
  • This framework is easy to use: Globally shared parameters can be used as local sparse vectors or matrices to perform linear algebra operations with local training data
Methods
  • Bounded Delay KKT Filter time regularizer5 of Section 2.2
  • The latter biases a compact solution with a large portion of 0 value entries.
  • The authors collected an ad click prediction dataset with 170 billion examples and 65 billion unique features
  • This dataset is 636 TB uncompressed (141 TB compressed).
  • The authors ran the parameter server on 1000 machines, each with 16 physical cores, 192GB DRAM, and connected by 10 Gb Ethernet.
  • 800 machines acted as workers, and 200 were parameter servers.
  • The cluster was in concurrent use by other tasks during operation
Results
  • The authors first compare these three systems by running them to reach the same objective value.
  • The parameter server, in turn, outperforms System B while using the same algorithm
  • It does so because of the efficacy of reducing the network traffic and the relaxed consistency model.
  • Asynchronous updates with the parameter server require more iterations to achieve the same objective value.
  • The system achieves very high insert rates, which are shown in Table 5
  • It performs well for two reasons: First, bulk communication reduces the communication cost.
  • Peak inserts per second Average inserts per second Peak net bandwidth per machine Time to recover a failed node
Conclusion
  • The authors described a parameter server framework to solve distributed machine learning problems
  • This framework is easy to use: Globally shared parameters can be used as local sparse vectors or matrices to perform linear algebra operations with local training data.
  • The authors show experiments for several challenging tasks on real datasets with billions of variables to demonstrate its efficiency
  • The authors believe that this third generation parameter server is an important building block for scalable machine learning.
Summary
  • Introduction:

    Distributed optimization and inference is becoming a prerequisite for solving large scale machine learning problems.
  • No single machine can solve these problems sufficiently rapidly, due to the growth of data and the resulting model complexity, often manifesting itself in an increased number of parameters.
  • Implementing an efficient distributed algorithm, is not easy
  • Both intensive computational workloads and the volume of data communication demand careful system design.
  • Realistic quantities of training data can range between 1TB and 1PB
  • This allows one to create powerful and complex models with 109 to 1012 parameters [9].
  • Methods:

    Bounded Delay KKT Filter time regularizer5 of Section 2.2
  • The latter biases a compact solution with a large portion of 0 value entries.
  • The authors collected an ad click prediction dataset with 170 billion examples and 65 billion unique features
  • This dataset is 636 TB uncompressed (141 TB compressed).
  • The authors ran the parameter server on 1000 machines, each with 16 physical cores, 192GB DRAM, and connected by 10 Gb Ethernet.
  • 800 machines acted as workers, and 200 were parameter servers.
  • The cluster was in concurrent use by other tasks during operation
  • Results:

    The authors first compare these three systems by running them to reach the same objective value.
  • The parameter server, in turn, outperforms System B while using the same algorithm
  • It does so because of the efficacy of reducing the network traffic and the relaxed consistency model.
  • Asynchronous updates with the parameter server require more iterations to achieve the same objective value.
  • The system achieves very high insert rates, which are shown in Table 5
  • It performs well for two reasons: First, bulk communication reduces the communication cost.
  • Peak inserts per second Average inserts per second Peak net bandwidth per machine Time to recover a failed node
  • Conclusion:

    The authors described a parameter server framework to solve distributed machine learning problems
  • This framework is easy to use: Globally shared parameters can be used as local sparse vectors or matrices to perform linear algebra operations with local training data.
  • The authors show experiments for several challenging tasks on real datasets with billions of variables to demonstrate its efficiency
  • The authors believe that this third generation parameter server is an important building block for scalable machine learning.
Tables
  • Table1: Statistics of machine learning jobs for a three month period in a data center
  • Table2: Attributes of distributed data analysis systems. cisions were guided by the workloads found in real sys-
  • Table3: Systems evaluated
  • Table4: Example topics learned using LDA over the .5 billion dataset. Each topic represents a user interest
  • Table5: Results of distributed CountMin age (key,value) size to around 50 bits. Importantly, when we terminated a server node during the insertion, the parameter server was able to recover the failed node within 1 second, making our system well equipped for realtime
Download tables as Excel
Related work
  • Related systems have been implemented at Amazon, Baidu, Facebook, Google [13], Microsoft, and Yahoo [1]. Open source codes also exist, such as YahooLDA [1] and Petuum [24]. Furthermore, Graphlab [34] supports parameter synchronization on a best effort model.

    The first generation of such parameter servers, as introduced by [43], lacked flexibility and performance — it repurposed memcached distributed (key,value) store as synchronization mechanism. YahooLDA improved this design by implementing a dedicated server with userdefinable update primitives (set, get, update) and a more principled load distribution algorithm [1]. This second generation of application specific parameter servers can also be found in Distbelief [13] and the synchronization mechanism of [33]. A first step towards a general platform was undertaken by Petuum [24]. It improves YahooLDA with a bounded delay model while placing further constraints on the worker threading model. We describe a third generation system overcoming these limitations.
Funding
  • This work was supported in part by gifts and/or machine time from Google, Amazon, Baidu, PRObE, and Microsoft; by NSF award 1409802; and by the Intel Science and Technology Center for Cloud Computing
Reference
  • A. Ahmed, M. Aly, J. Gonzalez, S. Narayanamurthy, and A. J. Smola. Scalable inference in latent variable models. In Proceedings of The 5th ACM International Conference on Web Search and Data Mining (WSDM), 2012.
    Google ScholarLocate open access versionFindings
  • A. Ahmed, Y. Low, M. Aly, V. Josifovski, and A. J. Smola. Scalable inference of dynamic user interests for behavioural targeting. In Knowledge Discovery and Data Mining, 2011.
    Google ScholarLocate open access versionFindings
  • E. Anderson, Z. Bai, C. Bischof, J. Demmel, J. Dongarra, J. Du Croz, A. Greenbaum, S. Hammarling, A. McKenney, S. Ostrouchov, and D. Sorensen. LAPACK Users’ Guide. SIAM, Philadelphia, second edition, 1995.
    Google ScholarFindings
  • Apache Foundation. Mahout project, 2012. http://mahout.apache.org.
    Findings
  • R. Berinde, G. Cormode, P. Indyk, and M.J. Strauss. Space-optimal heavy hitters with strong error bounds. In J. Paredaens and J. Su, editors, Proceedings of the TwentyEigth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS, pages 157–166. ACM, 2009.
    Google ScholarLocate open access versionFindings
  • C. Bishop. Pattern Recognition and Machine Learning. Springer, 2006.
    Google ScholarFindings
  • D. Blei, A. Ng, and M. Jordan. Latent Dirichlet allocation. Journal of Machine Learning Research, 3:993–1022, January 2003.
    Google ScholarLocate open access versionFindings
  • J. Byers, J. Considine, and M. Mitzenmacher. Simple load balancing for distributed hash tables. In Peer-to-peer systems II, pages 80–87.
    Google ScholarLocate open access versionFindings
  • K. Canini. Sibyl: A system for large scale supervised machine learning. Technical Talk, 2012.
    Google ScholarLocate open access versionFindings
  • B.-G. Chun, T. Condie, C. Curino, C. Douglas, S. Matusevych, B. Myers, S. Narayanamurthy, R. Ramakrishnan, S. Rao, J. Rosen, R. Sears, and M. Weimer. Reef: Retainable evaluator execution framework. Proceedings of the VLDB Endowment, 6(12):1370–1373, 2013.
    Google ScholarLocate open access versionFindings
  • G. Cormode and S. Muthukrishnan. Summarizing and mining skewed data streams. In SDM, 2005.
    Google ScholarLocate open access versionFindings
  • W. Dai, J. Wei, X. Zheng, J. K. Kim, S. Lee, J. Yin, Q. Ho, and E. P. Xing. Petuum: A framework for iterative-convergent distributed ml. arXiv preprint arXiv:1312.7651, 2013.
    Findings
  • J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, Q. Le, M. Mao, M. Ranzato, A. Senior, P. Tucker, K. Yang, and A. Ng. Large scale distributed deep networks. In Neural Information Processing Systems, 2012.
    Google ScholarLocate open access versionFindings
  • J. Dean and S. Ghemawat. MapReduce: simplified data processing on large clusters. CACM, 51(1):107–113, 2008.
    Google ScholarLocate open access versionFindings
  • G. DeCandia, D. Hastorun, M. Jampani, G. Kakulapati, A. Lakshman, A. Pilchin, S. Sivasubramanian, P. Vosshall, and W. Vogels. Dynamo: Amazon’s highly available keyvalue store. In T. C. Bressoud and M. F. Kaashoek, editors, Symposium on Operating Systems Principles, pages 205– 220. ACM, 2007.
    Google ScholarLocate open access versionFindings
  • J. J. Dongarra, J. Du Croz, S. Hammarling, and R. J. Hanson. An extended set of fortran basic linear algebra subprograms. ACM Transactions on Mathematical Software, 14:18–32, 1988.
    Google ScholarLocate open access versionFindings
  • The Apache Software Foundation. Apache hadoop nextgen mapreduce (yarn).
    Google ScholarFindings
  • The Apache Software Foundation. Apache hadoop, 2009. http://hadoop.apache.org/core/.
    Findings
  • F. Girosi, M. Jones, and T. Poggio. Priors, stabilizers and basis functions: From regularization to radial, tensor and additive splines. A.I. Memo 1430, Artificial Intelligence Laboratory, Massachusetts Institute of Technology, 1993.
    Google ScholarFindings
  • T.L. Griffiths and M. Steyvers. Finding scientific topics. Proceedings of the National Academy of Sciences, 101:5228–5235, 2004.
    Google ScholarLocate open access versionFindings
  • S. H. Gunderson. Snappy: A fast compressor/decompressor. https://code.google.com/p/snappy/.
    Findings
  • T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning. Springer, New York, 2 edition, 2009.
    Google ScholarFindings
  • B. Hindman, A. Konwinski, M. Zaharia, A. Ghodsi, A. D. Joseph, R. Katz, S. Shenker, and I. Stoica. Mesos: A platform for fine-grained resource sharing in the data center. In Proceedings of the 8th USENIX conference on Networked systems design and implementation, pages 22–22, 2011.
    Google ScholarLocate open access versionFindings
  • Q. Ho, J. Cipar, H. Cui, S. Lee, J. Kim, P. Gibbons, G. Gibson, G. Ganger, and E. Xing. More effective distributed ml via a stale synchronous parallel parameter server. In NIPS, 2013.
    Google ScholarFindings
  • M. Hoffman, D. M. Blei, C. Wang, and J. Paisley. Stochastic variational inference. In International Conference on Machine Learning, 2012.
    Google ScholarLocate open access versionFindings
  • W. Karush. Minima of functions of several variables with inequalities as side constraints. Master’s thesis, Dept. of Mathematics, Univ. of Chicago, 1939.
    Google ScholarFindings
  • L. Kim. How many ads does Google serve in a day?, 2012. http://goo.gl/oIidXO.
    Findings
  • D. Koller and N. Friedman. Probabilistic Graphical Models: Principles and Techniques. MIT Press, 2009.
    Google ScholarFindings
  • T. Kraska, A. Talwalkar, J. C. Duchi, R. Griffith, M. J. Franklin, and M. I. Jordan. Mlbase: A distributed machine-learning system. In CIDR, 2013.
    Google ScholarLocate open access versionFindings
  • L. Lamport. Paxos made simple. ACM Sigact News, 32(4):18–25, 2001.
    Google ScholarLocate open access versionFindings
  • M. Li, D. G. Andersen, and A. J. Smola. Distributed delayed proximal gradient methods. In NIPS Workshop on Optimization for Machine Learning, 2013.
    Google ScholarLocate open access versionFindings
  • M. Li, D. G. Andersen, and A. J. Smola. Communication Efficient Distributed Machine Learning with the Parameter Server. In Neural Information Processing Systems, 2014.
    Google ScholarLocate open access versionFindings
  • M. Li, L. Zhou, Z. Yang, A. Li, F. Xia, D.G. Andersen, and A. J. Smola. Parameter server for distributed machine learning. In Big Learning NIPS Workshop, 2013.
    Google ScholarFindings
  • Y. Low, J. Gonzalez, A. Kyrola, D. Bickson, C. Guestrin, and J. M. Hellerstein. Distributed Graphlab: A framework for machine learning and data mining in the cloud. In PVLDB, 2012.
    Google ScholarLocate open access versionFindings
  • H. B. McMahan, G. Holt, D. Sculley, M. Young, D. Ebner, J. Grady, L. Nie, T. Phillips, E. Davydov, and D. Golovin. Ad click prediction: a view from the trenches. In KDD, 2013.
    Google ScholarLocate open access versionFindings
  • K. P. Murphy. Machine learning: a probabilistic perspective. MIT Press, 2012.
    Google ScholarFindings
  • [46] C.H. Teo, Q. Le, A. J. Smola, and S. V. N. Vishwanathan. A scalable modular convex solver for regularized risk minimization. In Proc. ACM Conf. Knowledge Discovery and Data Mining (KDD). ACM, 2007.
    Google ScholarLocate open access versionFindings
  • [47] R. van Renesse and F. B. Schneider. Chain replication for supporting high throughput and availability. In OSDI, volume 4, pages 91–104, 2004.
    Google ScholarLocate open access versionFindings
  • [48] V. Vapnik. The Nature of Statistical Learning Theory. Springer, New York, 1995.
    Google ScholarFindings
  • [49] R.C. Whaley, A. Petitet, and J.J. Dongarra. Automated empirical optimization of software and the ATLAS project. Parallel Computing, 27(1–2):3–35, 2001.
    Google ScholarLocate open access versionFindings
  • [50] M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. M. Ma, M. McCauley, M. J. Franklin, S. Shenker, and I. Stoica. Fast and interactive analytics over Hadoop data with Spark. USENIX;login:, 37(4):45–51, August 2012.
    Google ScholarLocate open access versionFindings
  • [37] D. G. Murray, F. McSherry, R. Isaacs, M. Isard, P. Barham, and M. Abadi. Naiad: a timely dataflow system. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles, pages 439–455. ACM, 2013.
    Google ScholarLocate open access versionFindings
  • [38] A. Phanishayee, D. G. Andersen, H. Pucha, A. Povzner, and W. Belluomini. Flex-KV: Enabling high-performance and flexible KV systems. In Proceedings of the 2012 workshop on Management of big data systems, pages 19–24. ACM, 2012.
    Google ScholarLocate open access versionFindings
  • [39] R. Power and J. Li. Piccolo: Building fast, distributed programs with partitioned tables. In R. H. Arpaci-Dusseau and B. Chen, editors, Operating Systems Design and Implementation, OSDI, pages 293–306. USENIX Association, 2010.
    Google ScholarLocate open access versionFindings
  • [40] PRObE Project. Parallel Reconfigurable Observational Environment. https://www.nmc-probe.org/wiki/ Machines:Susitna,
    Findings
  • [41] A. Rowstron and P. Druschel. Pastry: Scalable, decentralized object location and routing for large-scale peer-topeer systems. In IFIP/ACM International Conference on Distributed Systems Platforms (Middleware), pages 329– 350, Heidelberg, Germany, November 2001.
    Google ScholarLocate open access versionFindings
  • [42] B. Scholkopf and A. J. Smola. Learning with Kernels. MIT Press, Cambridge, MA, 2002.
    Google ScholarFindings
  • [43] A. J. Smola and S. Narayanamurthy. An architecture for parallel topic models. In Very Large Databases (VLDB), 2010.
    Google ScholarLocate open access versionFindings
  • [44] E. Sparks, A. Talwalkar, V. Smith, J. Kottalam, X. Pan, J. Gonzalez, M. J. Franklin, M. I. Jordan, and T. Kraska. Mli: An api for distributed machine learning. 2013.
    Google ScholarFindings
  • [45] I. Stoica, R. Morris, D. Karger, M. F. Kaashoek, and H. Balakrishnan. Chord: A scalable peer-to-peer lookup service for internet applications. ACM SIGCOMM Computer Communication Review, 31(4):149–160, 2001.
    Google ScholarLocate open access versionFindings
Full Text
Your rating :
0

 

Tags
Comments