Similarity Search For Dynamic Data Streams

Marc Bury,Chris Schwiegelshohn,Mara Sorella

IEEE Transactions on Knowledge and Data Engineering（2020）

引用 7|浏览83

暂无评分

摘要

Nearest neighbor searching systems are an integral part of many online applications, including but not limited to pattern recognition, plagiarism detection, and recommender systems. With increasingly larger data sets, scalability has become an important issue. Many of the most space and running time efficient algorithms are based on locality-sensitive hashing. Here, we view the data set as an n by vertical bar U vertical bar matrix where each row corresponds to one of n users and the columns correspond to items drawn from a universe U. The de-facto standard approach to quickly answer nearest neighbor queries on such a data set is usually a form of min-hashing. Not only is min-hashing very fast, but it is also space efficient and can be implemented in many computational models aimed at dealing with large data sets such as MapReduce and streaming. However, a significant drawback is that minhashing and related methods are only able to handle insertions to user profiles and tend to perform poorly when items may be removed. We initiate the study of scalable locality-sensitive hashing (LSH) for fully dynamic data-streams. Specifically, using the Jaccard index as similarity measure, we design (1) a collaborative filtering mechanism maintainable in dynamic data streams and (2) a sketching algorithm for similarity estimation. Our algorithms have little overhead in terms of running time compared to previous LSH approaches for the insertion only case, and drastically outperform previous algorithms in case of deletions.

查看译文

关键词

Heuristic algorithms,Hash functions,Approximation algorithms,Indexes,Measurement,Knowledge engineering,Data engineering,Dynamic streaming,locality-sensitive hashing,nearest neighbor searching

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要