Petuum: A New Platform for Distributed Machine Learning on Big Data

ACM Knowledge Discovery and Data Mining, pp. 49-67, 2015.

Cited by: 462|Views198
EI WOS
Weibo:
Petuum provides M ACHINE Learning practitioners with an M ACHINE Learning library and M ACHINE Learning programming platform, capable of handling Big Data and Big M ACHINE Learning Models with performance that is competitive with specialized implementations, while running on reas...

Abstract:

How can one build a distributed framework that allows efficient deployment of a wide spectrum of modern advanced machine learning (ML) programs for industrial-scale problems using Big Models (100s of billions of parameters) on Big Data (terabytes or petabytes)- Contemporary parallelization strategies employ fine-grained operations and sch...More

Code:

Data:

0
Introduction
  • M ACHINE Learning (ML) is becoming a primary mechanism for extracting information from data.
  • On the Big Model front, state-of-the-art image recognition systems have embraced large-scale deep learning models with billions of parameters [1]; topic models with up to 106 topics can cover long-tail semantic word sets for substantially improved online advertising [2], [3]; and very-high-rank matrix factorization yields improved prediction on collaborative filtering problems [4]
  • Training such big models with a single machine can be prohibitively slow, if not impossible.
  • While careful model design and feature engineering can certainly reduce the size of the model, they require domain-specific expertise and are fairly labor-intensive, the recent appeal of building high-capacity Big Models in order to substitute computation cost for labor cost
Highlights
  • M ACHINE Learning (ML) is becoming a primary mechanism for extracting information from data
  • We evaluated Petuum Distance Metric Learning’s convergence speed on 1-4 machines (Figure 16) — compared to using 1 machine, Petuum Distance Metric Learning achieves 3.8 times speedup with 4 machines and 1.9 times speedup with 2 machines
  • Petuum provides M ACHINE Learning practitioners with an M ACHINE Learning library and M ACHINE Learning programming platform, capable of handling Big Data and Big M ACHINE Learning Models with performance that is competitive with specialized implementations, while running on reasonable cluster sizes (10-100 machines)
  • This is made possible by systematically exploiting the unique properties of iterative-convergent M ACHINE Learning algorithms — error tolerance, dependency structures and uneven convergence; these properties have yet to be thoroughly explored in general-purpose Big Data platforms such as Hadoop and Spark
  • In terms of feature set, Petuum is still relatively immature compared to Hadoop and Spark, and lacks the following: fault recovery from partial program state, ability to adjust resource usage on-the-fly in running jobs, scheduling jobs for multiple users, a unified data interface that closely integrates with databases and distributed file systems, and support for interactive scripting languages such as Python and R
  • The lack of these features imposes a barrier to entry for new users, and future work on Petuum will address these issues — but in a manner consistent with Petuum’s focus on iterative-convergent M ACHINE Learning properties
Conclusion
  • SUMMARY AND FUTURE WORK

    Petuum provides ML practitioners with an ML library and ML programming platform, capable of handling Big Data and Big ML Models with performance that is competitive with specialized implementations, while running on reasonable cluster sizes (10-100 machines).
  • In terms of feature set, Petuum is still relatively immature compared to Hadoop and Spark, and lacks the following: fault recovery from partial program state, ability to adjust resource usage on-the-fly in running jobs, scheduling jobs for multiple users, a unified data interface that closely integrates with databases and distributed file systems, and support for interactive scripting languages such as Python and R
  • The lack of these features imposes a barrier to entry for new users, and future work on Petuum will address these issues — but in a manner consistent with Petuum’s focus on iterative-convergent ML properties.
  • This in turn opens up new ways to achieve on-the-fly resource adjustment and multi-tenancy
Summary
  • Introduction:

    M ACHINE Learning (ML) is becoming a primary mechanism for extracting information from data.
  • On the Big Model front, state-of-the-art image recognition systems have embraced large-scale deep learning models with billions of parameters [1]; topic models with up to 106 topics can cover long-tail semantic word sets for substantially improved online advertising [2], [3]; and very-high-rank matrix factorization yields improved prediction on collaborative filtering problems [4]
  • Training such big models with a single machine can be prohibitively slow, if not impossible.
  • While careful model design and feature engineering can certainly reduce the size of the model, they require domain-specific expertise and are fairly labor-intensive, the recent appeal of building high-capacity Big Models in order to substitute computation cost for labor cost
  • Conclusion:

    SUMMARY AND FUTURE WORK

    Petuum provides ML practitioners with an ML library and ML programming platform, capable of handling Big Data and Big ML Models with performance that is competitive with specialized implementations, while running on reasonable cluster sizes (10-100 machines).
  • In terms of feature set, Petuum is still relatively immature compared to Hadoop and Spark, and lacks the following: fault recovery from partial program state, ability to adjust resource usage on-the-fly in running jobs, scheduling jobs for multiple users, a unified data interface that closely integrates with databases and distributed file systems, and support for interactive scripting languages such as Python and R
  • The lack of these features imposes a barrier to entry for new users, and future work on Petuum will address these issues — but in a manner consistent with Petuum’s focus on iterative-convergent ML properties.
  • This in turn opens up new ways to achieve on-the-fly resource adjustment and multi-tenancy
Tables
  • Table1: Petuum ML Library (PMLlib): ML applications and achievable problem scale for a given cluster size. Petuum’s goal is to solve large model and data problems using medium-sized clusters with only 10s of machines (100-1000 cores, 1TB+ memory). Running time varies between 10s of minutes to several days, depending on the application
Download tables as Excel
Funding
  • This work is supported in part by DARPA FA87501220324, and NSF IIS1447676 grants to Eric P
Reference
  • Q. Le, M. Ranzato, R. Monga, M. Devin, K. Chen, G. Corrado, J. Dean, and A. Ng, “Building high-level features using large scale unsupervised learning,” in ICML, 2012.
    Google ScholarLocate open access versionFindings
  • Y. Wang, X. Zhao, Z. Sun, H. Yan, L. Wang, Z. Jin, L. Wang, Y. Gao, J. Zeng, Q. Yang et al., “Towards topic modeling for big data,” arXiv preprint arXiv:1405.4402, 2014.
    Findings
  • J. Yuan, F. Gao, Q. Ho, W. Dai, J. Wei, X. Zheng, E. P. Xing, T.-Y. Liu, and W.-Y. Ma, “Lightlda: Big topic models on modest compute clusters,” in Accepted to International World Wide Web Conference, 2015.
    Google ScholarLocate open access versionFindings
  • Y. Zhou, D. Wilkinson, R. Schreiber, and R. Pan, “Large-scale parallel collaborative filtering for the netflix prize,” in Algorithmic Aspects in Information and Management, 2008.
    Google ScholarLocate open access versionFindings
  • J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, Q. Le, M. Mao, M. Ranzato, A. Senior, P. Tucker, K. Yang, and A. Ng, “Large scale distributed deep networks,” in NIPS 2012, 2012.
    Google ScholarFindings
  • S. A. Williamson, A. Dubey, and E. P. Xing, “Parallel markov chain monte carlo for nonparametric mixture models,” in ICML, 2013.
    Google ScholarFindings
  • M. D. Hoffman, D. M. Blei, C. Wang, and J. Paisley, “Stochastic variational inference,” JMLR, vol. 14, 2013.
    Google ScholarLocate open access versionFindings
  • M. Zinkevich, J. Langford, and A. J. Smola, “Slow learners are fast,” in NIPS, 2009.
    Google ScholarLocate open access versionFindings
  • A. Agarwal and J. C. Duchi, “Distributed delayed stochastic optimization,” in NIPS, 2011.
    Google ScholarLocate open access versionFindings
  • J. K. Bradley, A. Kyrola, D. Bickson, and C. Guestrin, “Parallel coordinate descent for l1-regularized loss minimization,” in ICML, 2011.
    Google ScholarFindings
  • T. White, Hadoop: The definitive guide. O’Reilly Media, Inc., 2012.
    Google ScholarFindings
  • M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica, “Spark: cluster computing with working sets,” in HotCloud, 2010.
    Google ScholarFindings
  • Y. Low, J. Gonzalez, A. Kyrola, D. Bickson, C. Guestrin, and J. M. Hellerstein, “Distributed GraphLab: A Framework for Machine Learning and Data Mining in the Cloud,” PVLDB, 2012.
    Google ScholarLocate open access versionFindings
  • W. Dai, A. Kumar, J. Wei, Q. Ho, G. Gibson, and E. P. Xing, “Highperformance distributed ml at scale through parameter server consistency models,” in AAAI, 2015.
    Google ScholarFindings
  • G. Malewicz, M. H. Austern, A. J. Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Czajkowski, “Pregel: a system for large-scale graph processing,” in ACM SIGMOD International Conference on Management of data. ACM, 2010.
    Google ScholarLocate open access versionFindings
  • R. Power and J. Li, “Piccolo: building fast, distributed programs with partitioned tables,” in OSDI. USENIX Association, 2010.
    Google ScholarFindings
  • M. Li, D. G. Andersen, J. W. Park, A. J. Smola, A. Ahmed, V. Josifovski, J. Long, E. J. Shekita, and B.-Y. Su, “Scaling distributed machine learning with the parameter server,” in OSDI, 2014.
    Google ScholarFindings
  • R. Gemulla, E. Nijkamp, P. J. Haas, and Y. Sismanis, “Large-scale matrix factorization with distributed stochastic gradient descent,” in KDD, 2011. [Online]. Available: http://doi.acm.org/10.1145/2020408.2020426
    Findings
  • X. Chen, Q. Lin, S. Kim, J. Carbonell, and E. Xing, “Smoothing proximal gradient method for general structured sparse learning,” in UAI, 2011.
    Google ScholarFindings
  • T. L. Griffiths and M. Steyvers, “Finding scientific topics,” PNAS, vol. 101, no. Suppl 1, pp. 5228–5235, 2004.
    Google ScholarLocate open access versionFindings
  • S. Lee, J. K. Kim, X. Zheng, Q. Ho, G. Gibson, and E. P. Xing, “On model parallelism and scheduling strategies for distributed machine learning,” in NIPS, 2014.
    Google ScholarLocate open access versionFindings
  • C. Scherrer, A. Tewari, M. Halappanavar, and D. Haglin, “Feature clustering for accelerating parallel coordinate descent,” NIPS, 2012.
    Google ScholarLocate open access versionFindings
  • Q. Ho, J. Cipar, H. Cui, J.-K. Kim, S. Lee, P. B. Gibbons, G. Gibson, G. R. Ganger, and E. P. Xing, “More effective distributed ml via a stale synchronous parallel parameter server,” in NIPS, 2013.
    Google ScholarFindings
  • Y. Zhang, Q. Gao, L. Gao, and C. Wang, “Priter: A distributed framework for prioritized iterative computations,” in SOCC, 2011.
    Google ScholarFindings
  • E. P. Xing, M. I. Jordan, S. Russell, and A. Y. Ng, “Distance metric learning with application to clustering with side-information,” in Advances in neural information processing systems, 2002, pp. 505–512.
    Google ScholarFindings
  • J. V. Davis, B. Kulis, P. Jain, S. Sra, and I. S. Dhillon, “Informationtheoretic metric learning,” in Proceedings of the 24th international conference on Machine learning. ACM, 2007, pp. 209–216.
    Google ScholarLocate open access versionFindings
  • H. B. M. et. al., “Ad click prediction: a view from the trenches,” in KDD, 2013.
    Google ScholarFindings
  • A. Ahmed, M. Aly, J. Gonzalez, S. Narayanamurthy, and A. J. Smola, “Scalable inference in latent variable models,” in WSDM, 2012.
    Google ScholarFindings
  • K. P. Murphy, Machine learning: a probabilistic perspective, Cambridge, MA, 2012.
    Google ScholarFindings
  • L. Yao, D. Mimno, and A. McCallum, “Efficient methods for topic model inference on streaming document collections,” in Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, ser. KDD ’09. New York, NY, USA: ACM, 2009, pp. 937–946.
    Google ScholarLocate open access versionFindings
  • A. Kumar, A. Beutel, Q. Ho, and E. P. Xing, “Fugue: Slow-workeragnostic distributed learning for big models on big data,” in AISTATS.
    Google ScholarFindings
  • J. Zhu, X. Zheng, L. Zhou, and B. Zhang, “Scalable inference in max-margin topic models,” in Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2013, pp. 964–972.
    Google ScholarLocate open access versionFindings
  • H.-F. Yu, C.-J. Hsieh, S. Si, and I. Dhillon, “Scalable coordinate descent approaches to parallel matrix factorization for recommender systems,” in Data Mining (ICDM), 2012 IEEE 12th International Conference on. IEEE, 2012, pp. 765–774.
    Google ScholarLocate open access versionFindings
  • A. Kumar, A. Beutel, Q. Ho, and E. P. Xing, “Fugue: Slow-workeragnostic distributed learning for big models on big data,” in Proceedings of the Seventeenth International Conference on Artificial Intelligence and Statistics, 2014, pp. 531–539.
    Google ScholarLocate open access versionFindings
  • F. Niu, B. Recht, C. Re, and S. J. Wright, “Hogwild!: A lock-free approach to parallelizing stochastic gradient descent,” in NIPS, 2011.
    Google ScholarFindings
  • P. Richtarik and M. Takac, “Parallel coordinate descent methods for big data optimization,” arXiv preprint arXiv:1212.0873, 2012.
    Findings
  • Y. Zhang, Q. Gao, L. Gao, and C. Wang, “Priter: A distributed framework for prioritizing iterative computations,” Parallel and Distributed Systems, IEEE Transactions on, vol. 24, no. 9, pp. 1884– 1893, 2013.
    Google ScholarLocate open access versionFindings
  • J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on. IEEE, 2009, pp. 248–255. Qirong Ho Dr. Qirong Ho is a scientist at the Institute for Infocomm Research, A*STAR, Singapore, and an adjunct assistant professor at the Singapore Management University School of Information Systems. His primary research focus is distributed cluster software systems for Machine Learning at Big Data scales, with a view towards correctness and performance guarantees. In addition, Dr. Ho has performed research on statistical models for large-scale network analysis — particularly latent space models for visualization, community detection, user personalization and interest prediction — as well as social media analysis on hyperlinked documents with text and network data. Dr. Ho received his PhD in 2014, under Eric P. Xing at Carnegie Mellon University’s Machine Learning Department. He is a recipient of the Singapore A*STAR National Science Search Undergraduate and PhD fellowships.
    Google ScholarLocate open access versionFindings
Your rating :
0

 

Tags
Comments