HarpLDA+: Optimizing latent dirichlet allocation for parallel efficiency

Bo Peng,Bingjing Zhang,Langshi Chen,Mihai Avram,Robert Henschel,Craig Stewart, Shaojuan Zhu,Emily Mccallum,Lisa Smith,Tom Zahniser,Jon Omer,Judy Qiu

2017 IEEE International Conference on Big Data (Big Data)（2017）

引用 12|浏览46

暂无评分

摘要

Latent Dirichlet Allocation (LDA) is a widely used machine learning technique in topic modeling and data analysis. Training large LDA models on big datasets involves dynamic and irregular computation patterns and is a major challenge to both algorithm optimization and system design. In this paper, we present a comprehensive benchmarking of our novel synchronized LDA training system HarpLDA+ based on Hadoop and Java. It demonstrates impressive performance when compared to three other MPI/C++ based state-of-the-art systems, which are LightLDA, F+NomadLDA, and WarpLDA. HarpLDA+ uses optimized collective communication with a timer control for load balance, leading to stable scalability in both shared-memory and distributed systems. We demonstrate in the experiments that HarpLDA+ is effective in reducing synchronization and communication overhead and outperforms the other three LDA training systems.

查看译文

关键词

parallel efficiency,latent dirichlet allocation,topic modeling,data analysis,LDA models,big datasets,irregular computation patterns,algorithm optimization,system design,comprehensive benchmarking,MPI/C++ based state-of-the-art systems,shared-memory,distributed systems,LDA training systems,machine learning technique,HarpLDA+,dynamic computation patterns,LDA training system,load balancing,Hadoop,Java

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要