HarpLDA+: Optimizing latent dirichlet allocation for parallel efficiency
2017 IEEE International Conference on Big Data (Big Data)(2017)
摘要
Latent Dirichlet Allocation (LDA) is a widely used machine learning technique in topic modeling and data analysis. Training large LDA models on big datasets involves dynamic and irregular computation patterns and is a major challenge to both algorithm optimization and system design. In this paper, we present a comprehensive benchmarking of our novel synchronized LDA training system HarpLDA+ based on Hadoop and Java. It demonstrates impressive performance when compared to three other MPI/C++ based state-of-the-art systems, which are LightLDA, F+NomadLDA, and WarpLDA. HarpLDA+ uses optimized collective communication with a timer control for load balance, leading to stable scalability in both shared-memory and distributed systems. We demonstrate in the experiments that HarpLDA+ is effective in reducing synchronization and communication overhead and outperforms the other three LDA training systems.
更多查看译文
关键词
parallel efficiency,latent dirichlet allocation,topic modeling,data analysis,LDA models,big datasets,irregular computation patterns,algorithm optimization,system design,comprehensive benchmarking,MPI/C++ based state-of-the-art systems,shared-memory,distributed systems,LDA training systems,machine learning technique,HarpLDA+,dynamic computation patterns,LDA training system,load balancing,Hadoop,Java
AI 理解论文
溯源树
样例
![](https://originalfileserver.aminer.cn/sys/aminer/pubs/mrt_preview.jpeg)
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要