Estimating the Number of Near-Duplicate Document Pairs for Massive Data Sets using Small Space
msra
摘要
Detecting similar or near-duplicate pairs in a large collection is an important problem with wide-spread applications; the problem has been studied for dieren t data types (e.g. tex- tual documents, spatial points and relational records) and under dieren t settings. A more recent manifestation of the problem is ecien tly nding near-duplicate Web pages, which is particularly challenging in a Web-scale because of the huge data volume and the high dimensionality of docu- ments. More precisely, nding Web pages that are almost, but not exactly the same for billions of documents, is a very time-consuming task. In practice, the task can easily take days (if not weeks depending on the data set size), even with powerful distributed computing infrastructures and af- ter trading accuracy for eciency (e.g. by reducing docu- ment dimensionality). Under such circumstances, we believe that it would be of great help if one can quickly predict the running time and the result set size, both of which rely on an important statistics (i.e. the number of near-duplicate pairs), using an inexpensive method before starting the slow near-duplicate detection operation. In this paper, we propose an ecien t and elegant proba- bilistic algorithm to approximate the number of near-duplicate pairs. By scanning the input data set once, our algorithm gives a provably accurate estimate with high probability us- ing only small constant space, independent of the number of objects in the data set. Both theoretical analysis and exper- imental evaluation on real and synthetic data show that our algorithm signican tly outperforms the alternative random- sampling method (which is the only competitor to ours), when the dimensionality is reasonably small. Furthermore, the proposed algorithm is fully parallelizable, and is well- suited in practical distributed computing environments of search engines (such as Google) because of its small space cost which in turn results in reduced communication trac.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络