Using PageRank for Characterizing Topic Quality in LDA.

ICTIR(2018)

引用 24|浏览21
暂无评分
摘要
Topic models based on Latent Dirichlet Allocation (LDA) are employed effectively in various information retrieval and data mining tasks. Despite their popularity and wide-spread application, the question of assessing the quality of topics extracted by LDA models is still not completely resolved. While various measures have been proposed to quantify the thematic coherence and interpretability of a topic extracted by LDA, they do not address this problem sufficiently. We observe that existing quality measures select top topic words based on their topic-word co-occurrence without considering word co-occurrences within the same context. We incorporate precisely this information by constructing topic-specific graphs capturing neighborhood of words in an LDA modeled corpus. Next, the PageRank algorithm is applied on these graphs to assign word importance scores based on centrality. We propose two measures to compute topic quality: (1) the Aggregate PageRank of Top-words of a topic and (2) the PageRank Centralization Index of a topic-specific word graph. Our experiments across three datasets show that unlike existing quality measures, our proposed measures are able to identify topics that are discriminative as well as interpretable and yield superior performance on both classification and intruder word identification tasks.
更多
查看译文
关键词
Topic Modeling, Topic Quality Measures, PageRank
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要