Distributing the Stochastic Gradient Sampler for Large-Scale LDA

KDD（2016）

引用 18|浏览108

暂无评分

摘要

Learning large-scale Latent Dirichlet Allocation (LDA) models is beneficial for many applications that involve large collections of documents.Recent work has been focusing on developing distributed algorithms in the batch setting, while leaving stochastic methods behind, which can effectively explore statistical redundancy in big data and thereby are complementary to distributed computing.The distributed stochastic gradient Langevin dynamics (DSGLD) represents one attempt to combine stochastic sampling and distributed computing, but it suffers from drawbacks such as excessive communications and sensitivity to partitioning of datasets across nodes. DSGLD is typically limited to learn small models that have about 103 topics and $10^3$ vocabulary size. In this paper, we present embarrassingly parallel SGLD (EPSGLD), a novel distributed stochastic gradient sampling method for topic models. Our sampler is built upon a divide-and-conquer architecture which enables us to produce robust and asymptotically exact samples with less communication overhead than DSGLD. We further propose several techniques to reduce the overhead in I/O and memory usage. Experiments on Wikipedia and ClueWeb12 documents demonstrate that, EPSGLD can scale up to large models with 1010 parameters (i.e., 105 topics, 105 vocabulary size), four orders of magnitude larger than DSGLD, and converge faster.

查看译文

关键词

large-scale LDA,stochastic gradient sampler,embarrassingly parallel MCMC

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要