MLlib*: Fast Training of GLMs Using Spark MLlib

Zhipeng Zhang,Jiawei Jiang,Wentao Wu, Ce Zhang,Lele Yu,Bin Cui

2019 IEEE 35th International Conference on Data Engineering (ICDE)（2019）

引用 23|浏览136

暂无评分

摘要

In Tencent Inc., more than 80% of the data are extracted and transformed using Spark. However, the commonly used machine learning systems are TensorFlow, XGBoost, and Angel, whereas Spark MLlib, an official Spark package for machine learning, is seldom used. One reason for this ignorance is that it is generally believed that Spark is slow when it comes to distributed machine learning. Users therefore have to undergo the painful procedure of moving data in and out of Spark. The question why Spark is slow, however, remains elusive. In this paper, we study the performance of MLlib with a focus on training generalized linear models using gradient descent. Based on a detailed examination, we identify two bottlenecks in MLlib, i.e., pattern of model update and pattern of communication. To address these two bottlenecks, we tweak the implementation of MLlib with two state-of-the-art and well-known techniques, model averaging and AllReduce. We show that, the new system that we call MLlib*, can significantly improve over MLlib and achieve similar or even better performance than other specialized distributed machine learning systems (such as Petuum and Angel), on both public and Tencent's workloads.

查看译文

关键词

Sparks,Machine learning,Biological system modeling,Computational modeling,Training,Data models,Mathematical model

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要