# Scaling distributed machine learning with the parameter server

OSDI, pp. 583-598, 2014.

EI

Weibo:

Abstract:

We propose a parameter server framework for distributed machine learning problems. Both data and workloads are distributed over worker nodes, while the server nodes maintain globally shared parameters, represented as dense or sparse vectors and matrices. The framework manages asynchronous data communication between nodes, and supports fle...More

Code:

Data:

Introduction

- Distributed optimization and inference is becoming a prerequisite for solving large scale machine learning problems.
- No single machine can solve these problems sufficiently rapidly, due to the growth of data and the resulting model complexity, often manifesting itself in an increased number of parameters.
- Implementing an efficient distributed algorithm, is not easy
- Both intensive computational workloads and the volume of data communication demand careful system design.
- Realistic quantities of training data can range between 1TB and 1PB
- This allows one to create powerful and complex models with 109 to 1012 parameters [9].

Highlights

- Distributed optimization and inference is becoming a prerequisite for solving large scale machine learning problems
- We evaluate our parameter server based on the use cases of Section 2 — Sparse Logistic Regression and Latent Dirichlet Allocation
- We described a parameter server framework to solve distributed machine learning problems
- This framework is easy to use: Globally shared parameters can be used as local sparse vectors or matrices to perform linear algebra operations with local training data

Methods

- Bounded Delay KKT Filter time regularizer5 of Section 2.2
- The latter biases a compact solution with a large portion of 0 value entries.
- The authors collected an ad click prediction dataset with 170 billion examples and 65 billion unique features
- This dataset is 636 TB uncompressed (141 TB compressed).
- The authors ran the parameter server on 1000 machines, each with 16 physical cores, 192GB DRAM, and connected by 10 Gb Ethernet.
- 800 machines acted as workers, and 200 were parameter servers.
- The cluster was in concurrent use by other tasks during operation

Results

- The authors first compare these three systems by running them to reach the same objective value.
- The parameter server, in turn, outperforms System B while using the same algorithm
- It does so because of the efficacy of reducing the network traffic and the relaxed consistency model.
- Asynchronous updates with the parameter server require more iterations to achieve the same objective value.
- The system achieves very high insert rates, which are shown in Table 5
- It performs well for two reasons: First, bulk communication reduces the communication cost.
- Peak inserts per second Average inserts per second Peak net bandwidth per machine Time to recover a failed node

Conclusion

- The authors described a parameter server framework to solve distributed machine learning problems
- This framework is easy to use: Globally shared parameters can be used as local sparse vectors or matrices to perform linear algebra operations with local training data.
- The authors show experiments for several challenging tasks on real datasets with billions of variables to demonstrate its efficiency
- The authors believe that this third generation parameter server is an important building block for scalable machine learning.

Summary

## Introduction:

Distributed optimization and inference is becoming a prerequisite for solving large scale machine learning problems.- No single machine can solve these problems sufficiently rapidly, due to the growth of data and the resulting model complexity, often manifesting itself in an increased number of parameters.
- Implementing an efficient distributed algorithm, is not easy
- Both intensive computational workloads and the volume of data communication demand careful system design.
- Realistic quantities of training data can range between 1TB and 1PB
- This allows one to create powerful and complex models with 109 to 1012 parameters [9].
## Methods:

Bounded Delay KKT Filter time regularizer5 of Section 2.2- The latter biases a compact solution with a large portion of 0 value entries.
- The authors collected an ad click prediction dataset with 170 billion examples and 65 billion unique features
- This dataset is 636 TB uncompressed (141 TB compressed).
- The authors ran the parameter server on 1000 machines, each with 16 physical cores, 192GB DRAM, and connected by 10 Gb Ethernet.
- 800 machines acted as workers, and 200 were parameter servers.
- The cluster was in concurrent use by other tasks during operation
## Results:

The authors first compare these three systems by running them to reach the same objective value.- The parameter server, in turn, outperforms System B while using the same algorithm
- It does so because of the efficacy of reducing the network traffic and the relaxed consistency model.
- Asynchronous updates with the parameter server require more iterations to achieve the same objective value.
- The system achieves very high insert rates, which are shown in Table 5
- It performs well for two reasons: First, bulk communication reduces the communication cost.
- Peak inserts per second Average inserts per second Peak net bandwidth per machine Time to recover a failed node
## Conclusion:

The authors described a parameter server framework to solve distributed machine learning problems- This framework is easy to use: Globally shared parameters can be used as local sparse vectors or matrices to perform linear algebra operations with local training data.
- The authors show experiments for several challenging tasks on real datasets with billions of variables to demonstrate its efficiency
- The authors believe that this third generation parameter server is an important building block for scalable machine learning.

- Table1: Statistics of machine learning jobs for a three month period in a data center
- Table2: Attributes of distributed data analysis systems. cisions were guided by the workloads found in real sys-
- Table3: Systems evaluated
- Table4: Example topics learned using LDA over the .5 billion dataset. Each topic represents a user interest
- Table5: Results of distributed CountMin age (key,value) size to around 50 bits. Importantly, when we terminated a server node during the insertion, the parameter server was able to recover the failed node within 1 second, making our system well equipped for realtime

Related work

- Related systems have been implemented at Amazon, Baidu, Facebook, Google [13], Microsoft, and Yahoo [1]. Open source codes also exist, such as YahooLDA [1] and Petuum [24]. Furthermore, Graphlab [34] supports parameter synchronization on a best effort model.

The first generation of such parameter servers, as introduced by [43], lacked flexibility and performance — it repurposed memcached distributed (key,value) store as synchronization mechanism. YahooLDA improved this design by implementing a dedicated server with userdefinable update primitives (set, get, update) and a more principled load distribution algorithm [1]. This second generation of application specific parameter servers can also be found in Distbelief [13] and the synchronization mechanism of [33]. A first step towards a general platform was undertaken by Petuum [24]. It improves YahooLDA with a bounded delay model while placing further constraints on the worker threading model. We describe a third generation system overcoming these limitations.

Funding

- This work was supported in part by gifts and/or machine time from Google, Amazon, Baidu, PRObE, and Microsoft; by NSF award 1409802; and by the Intel Science and Technology Center for Cloud Computing

Reference

- A. Ahmed, M. Aly, J. Gonzalez, S. Narayanamurthy, and A. J. Smola. Scalable inference in latent variable models. In Proceedings of The 5th ACM International Conference on Web Search and Data Mining (WSDM), 2012.
- A. Ahmed, Y. Low, M. Aly, V. Josifovski, and A. J. Smola. Scalable inference of dynamic user interests for behavioural targeting. In Knowledge Discovery and Data Mining, 2011.
- E. Anderson, Z. Bai, C. Bischof, J. Demmel, J. Dongarra, J. Du Croz, A. Greenbaum, S. Hammarling, A. McKenney, S. Ostrouchov, and D. Sorensen. LAPACK Users’ Guide. SIAM, Philadelphia, second edition, 1995.
- Apache Foundation. Mahout project, 2012. http://mahout.apache.org.
- R. Berinde, G. Cormode, P. Indyk, and M.J. Strauss. Space-optimal heavy hitters with strong error bounds. In J. Paredaens and J. Su, editors, Proceedings of the TwentyEigth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS, pages 157–166. ACM, 2009.
- C. Bishop. Pattern Recognition and Machine Learning. Springer, 2006.
- D. Blei, A. Ng, and M. Jordan. Latent Dirichlet allocation. Journal of Machine Learning Research, 3:993–1022, January 2003.
- J. Byers, J. Considine, and M. Mitzenmacher. Simple load balancing for distributed hash tables. In Peer-to-peer systems II, pages 80–87.
- K. Canini. Sibyl: A system for large scale supervised machine learning. Technical Talk, 2012.
- B.-G. Chun, T. Condie, C. Curino, C. Douglas, S. Matusevych, B. Myers, S. Narayanamurthy, R. Ramakrishnan, S. Rao, J. Rosen, R. Sears, and M. Weimer. Reef: Retainable evaluator execution framework. Proceedings of the VLDB Endowment, 6(12):1370–1373, 2013.
- G. Cormode and S. Muthukrishnan. Summarizing and mining skewed data streams. In SDM, 2005.
- W. Dai, J. Wei, X. Zheng, J. K. Kim, S. Lee, J. Yin, Q. Ho, and E. P. Xing. Petuum: A framework for iterative-convergent distributed ml. arXiv preprint arXiv:1312.7651, 2013.
- J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, Q. Le, M. Mao, M. Ranzato, A. Senior, P. Tucker, K. Yang, and A. Ng. Large scale distributed deep networks. In Neural Information Processing Systems, 2012.
- J. Dean and S. Ghemawat. MapReduce: simplified data processing on large clusters. CACM, 51(1):107–113, 2008.
- G. DeCandia, D. Hastorun, M. Jampani, G. Kakulapati, A. Lakshman, A. Pilchin, S. Sivasubramanian, P. Vosshall, and W. Vogels. Dynamo: Amazon’s highly available keyvalue store. In T. C. Bressoud and M. F. Kaashoek, editors, Symposium on Operating Systems Principles, pages 205– 220. ACM, 2007.
- J. J. Dongarra, J. Du Croz, S. Hammarling, and R. J. Hanson. An extended set of fortran basic linear algebra subprograms. ACM Transactions on Mathematical Software, 14:18–32, 1988.
- The Apache Software Foundation. Apache hadoop nextgen mapreduce (yarn).
- The Apache Software Foundation. Apache hadoop, 2009. http://hadoop.apache.org/core/.
- F. Girosi, M. Jones, and T. Poggio. Priors, stabilizers and basis functions: From regularization to radial, tensor and additive splines. A.I. Memo 1430, Artificial Intelligence Laboratory, Massachusetts Institute of Technology, 1993.
- T.L. Griffiths and M. Steyvers. Finding scientific topics. Proceedings of the National Academy of Sciences, 101:5228–5235, 2004.
- S. H. Gunderson. Snappy: A fast compressor/decompressor. https://code.google.com/p/snappy/.
- T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning. Springer, New York, 2 edition, 2009.
- B. Hindman, A. Konwinski, M. Zaharia, A. Ghodsi, A. D. Joseph, R. Katz, S. Shenker, and I. Stoica. Mesos: A platform for fine-grained resource sharing in the data center. In Proceedings of the 8th USENIX conference on Networked systems design and implementation, pages 22–22, 2011.
- Q. Ho, J. Cipar, H. Cui, S. Lee, J. Kim, P. Gibbons, G. Gibson, G. Ganger, and E. Xing. More effective distributed ml via a stale synchronous parallel parameter server. In NIPS, 2013.
- M. Hoffman, D. M. Blei, C. Wang, and J. Paisley. Stochastic variational inference. In International Conference on Machine Learning, 2012.
- W. Karush. Minima of functions of several variables with inequalities as side constraints. Master’s thesis, Dept. of Mathematics, Univ. of Chicago, 1939.
- L. Kim. How many ads does Google serve in a day?, 2012. http://goo.gl/oIidXO.
- D. Koller and N. Friedman. Probabilistic Graphical Models: Principles and Techniques. MIT Press, 2009.
- T. Kraska, A. Talwalkar, J. C. Duchi, R. Griffith, M. J. Franklin, and M. I. Jordan. Mlbase: A distributed machine-learning system. In CIDR, 2013.
- L. Lamport. Paxos made simple. ACM Sigact News, 32(4):18–25, 2001.
- M. Li, D. G. Andersen, and A. J. Smola. Distributed delayed proximal gradient methods. In NIPS Workshop on Optimization for Machine Learning, 2013.
- M. Li, D. G. Andersen, and A. J. Smola. Communication Efficient Distributed Machine Learning with the Parameter Server. In Neural Information Processing Systems, 2014.
- M. Li, L. Zhou, Z. Yang, A. Li, F. Xia, D.G. Andersen, and A. J. Smola. Parameter server for distributed machine learning. In Big Learning NIPS Workshop, 2013.
- Y. Low, J. Gonzalez, A. Kyrola, D. Bickson, C. Guestrin, and J. M. Hellerstein. Distributed Graphlab: A framework for machine learning and data mining in the cloud. In PVLDB, 2012.
- H. B. McMahan, G. Holt, D. Sculley, M. Young, D. Ebner, J. Grady, L. Nie, T. Phillips, E. Davydov, and D. Golovin. Ad click prediction: a view from the trenches. In KDD, 2013.
- K. P. Murphy. Machine learning: a probabilistic perspective. MIT Press, 2012.
- [46] C.H. Teo, Q. Le, A. J. Smola, and S. V. N. Vishwanathan. A scalable modular convex solver for regularized risk minimization. In Proc. ACM Conf. Knowledge Discovery and Data Mining (KDD). ACM, 2007.
- [47] R. van Renesse and F. B. Schneider. Chain replication for supporting high throughput and availability. In OSDI, volume 4, pages 91–104, 2004.
- [48] V. Vapnik. The Nature of Statistical Learning Theory. Springer, New York, 1995.
- [49] R.C. Whaley, A. Petitet, and J.J. Dongarra. Automated empirical optimization of software and the ATLAS project. Parallel Computing, 27(1–2):3–35, 2001.
- [50] M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. M. Ma, M. McCauley, M. J. Franklin, S. Shenker, and I. Stoica. Fast and interactive analytics over Hadoop data with Spark. USENIX;login:, 37(4):45–51, August 2012.
- [37] D. G. Murray, F. McSherry, R. Isaacs, M. Isard, P. Barham, and M. Abadi. Naiad: a timely dataflow system. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles, pages 439–455. ACM, 2013.
- [38] A. Phanishayee, D. G. Andersen, H. Pucha, A. Povzner, and W. Belluomini. Flex-KV: Enabling high-performance and flexible KV systems. In Proceedings of the 2012 workshop on Management of big data systems, pages 19–24. ACM, 2012.
- [39] R. Power and J. Li. Piccolo: Building fast, distributed programs with partitioned tables. In R. H. Arpaci-Dusseau and B. Chen, editors, Operating Systems Design and Implementation, OSDI, pages 293–306. USENIX Association, 2010.
- [40] PRObE Project. Parallel Reconfigurable Observational Environment. https://www.nmc-probe.org/wiki/ Machines:Susitna,
- [41] A. Rowstron and P. Druschel. Pastry: Scalable, decentralized object location and routing for large-scale peer-topeer systems. In IFIP/ACM International Conference on Distributed Systems Platforms (Middleware), pages 329– 350, Heidelberg, Germany, November 2001.
- [42] B. Scholkopf and A. J. Smola. Learning with Kernels. MIT Press, Cambridge, MA, 2002.
- [43] A. J. Smola and S. Narayanamurthy. An architecture for parallel topic models. In Very Large Databases (VLDB), 2010.
- [44] E. Sparks, A. Talwalkar, V. Smith, J. Kottalam, X. Pan, J. Gonzalez, M. J. Franklin, M. I. Jordan, and T. Kraska. Mli: An api for distributed machine learning. 2013.
- [45] I. Stoica, R. Morris, D. Karger, M. F. Kaashoek, and H. Balakrishnan. Chord: A scalable peer-to-peer lookup service for internet applications. ACM SIGCOMM Computer Communication Review, 31(4):149–160, 2001.

Full Text

Tags

Comments