谷歌浏览器插件
订阅小程序
在清言上使用

Exploiting Database Similarity Joins for Metric Spaces.

Proceedings of the VLDB Endowment(2012)

引用 18|浏览25
暂无评分
摘要
Similarity Joins are recognized among the most useful data processing and analysis operations and are extensively used in multiple application domains. They retrieve all data pairs whose distances are smaller than a predefined threshold ε. Multiple Similarity Join algorithms and implementation techniques have been proposed. They range from out-of-database approaches for only in-memory and external memory data to techniques that make use of standard database operators to answer similarity joins. Recent work has shown that this operation can be efficiently implemented as a physical database operator. However, the proposed operator only support 1D numeric data. This paper presents DBSimJoin , a physical Similarity Join database operator for datasets that lie in any metric space. DBSimJoin is a non-blocking operator that prioritizes the early generation of results. We implemented the proposed operator in PostgreSQL, an open source database system. We show how this operator can be used in multiple real-world data analysis scenarios with multiple data types and distance functions. Particularly, we show the use of DBSimJoin to identify similar images represented as feature vectors, and similar publications in a bibliographic database. We also show that DBSimJoin scales very well when important parameters, e.g., e, data size, increase.
更多
查看译文
关键词
proposed operator,data pair,data size,external memory data,multiple data type,multiple real-world data analysis,numeric data,useful data processing,non-blocking operator,physical database operator,Exploiting database similarity,metric space
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要