TxtAlign: Efficient Near-Duplicate Text Alignment Search via Bottom-k Sketches for Plagiarism Detection

PROCEEDINGS OF THE 2022 INTERNATIONAL CONFERENCE ON MANAGEMENT OF DATA (SIGMOD '22)(2022)

引用 7|浏览5
暂无评分
摘要
In this paper, we study the near-duplicate text alignment search problem, which, given a collection of source (data) documents and a suspicious (query) document, finds all the near-duplicate passage pairs between the suspicious document and every source document. It finds applications in plagiarism detection. Specifically, the first two steps in plagiarism detection are source retrieval and text alignment. Source retrieval finds candidate source documents in a corpus that share content with the suspicious document while text alignment finds all the similar passage pairs between the suspicious document and every candidate source document. This problem is computation-intensive, especially for long documents. This is because there are O(n(2)m(2)) passage pairs between a single source document with n words and a suspicious document with m words, not to mention the large number of source documents in a corpus. Due to the high computation cost, existing solutions primarily rely on heuristic rules, such as the "seeding-extension-filtering" pipeline, and involves many hard-to-tune hyper-parameters. To address these issues, a recent work ALLIGN leverages the min-wise hash sketch for the text alignment problem. However, ALLIGN only works for two documents and leaves the source retrieval problem unattended. In this paper, we propose to leverage the bottom-k sketch (a.k.a., conditional random sampling) to estimate the similarity of two passages. We observe that many nearby passages in a document would share the same bottom-k sketch. Thus we propose to group all the passages in a document by their sketches. We prove that all the O(n(2)) passages can be partitioned into O(nk) groups in a document with n words and develop an algorithm to generate these groups in O(n log n + nk) time. Then, to address the source retrieval problem, we only need to find groups of passages with "similar" bottom-k sketches. Every passage pair in two groups with "similar" sketches are near-duplicates. Experimental results on real-world datasets show that our techniques are highly efficient.
更多
查看译文
关键词
Near-Duplicate Detection, Text Alignment, Plagiarism Detection, Conditional Random Sampling, Bottom-k Sketch
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要