Clustering for information analysis and retrieval: algorithms and applications

Clustering for information analysis and retrieval: algorithms and applications（2011）

引用 23|浏览53

暂无评分

摘要

This work focuses on three main questions: first, we ask which clustering objectives are best to solve which information retrieval problems. There has been prior work evaluating which algorithms are best for particular problems, but this does not evaluate why an algorithm performs well: it could be designed for a good objective or it could be doing something unrelated. We design an experiment to see past the algorithms and determine which objectives are best to optimize. Second, we ask what can be done to deal with information overload. There are a few previous algorithms that can be applied to the popular k-means objective, but these either are designed for another objective (and thus do not benefit from properties specific to k-means), are inaccurate, or require significant post-processing time. We design an algorithm that suffers from none of these, and will demonstrate its effectiveness. We will even improve it to competecompute a solution in less time than evaluating an existing solution takes! Finally, we demonstrate additional uses of the k-means objective. We begin with an overview of previous work in data clustering and management. The history of such formulations as k-median, k-center, k-means, and facility location are discussed, and relevant existing results are surveyed. We then address the problem of comparing objectives for information retrieval. We first evaluate three common formulations for this problem in order to evaluate its effectiveness in its own right. We then consider the value of knowing the baseline true cluster count a priori, versus being forced to determine the number via multiple runs or as an algorithm is running. This leads into determining which objectives are better for information retrieval effectiveness. Finally, we also consider the practical difference in two methods of preparing documents for clustering purposes. We also make progress in addressing the write-only memory problem by providing an algorithm for many clustering formulations that is able to provide a constant-bounded approximation despite making only one read of the data and using very little main memory, relative to the input size. In the popular k-means formulation, we achieve an approximation that nears to one as the available memory increases and the suitability of the input data for clustering increases. Finally, we investigate an additional usage of clustering to address the gaining of information rather than dealing with an overload of the same. We use k-means to address the issue of collaborative filtering.

查看译文

关键词

clustering purpose,clustering formulation,information retrieval,clustering objective,information retrieval effectiveness,popular k-means objective,popular k-means formulation,information analysis,clustering increase,information overload,k-means objective

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要