C2Net: A Network-Efficient Approach to Collision Counting LSH Similarity Join(Extended Abstract)

Hangyu Li,Sarana Nutanong,Hong Xu,Chenyun Yu,Foryu Ha

2019 IEEE 35th International Conference on Data Engineering (ICDE)（2019）

引用 10|浏览12

暂无评分

摘要

Similarity join of two datasets P and Q is a primitive operation that is useful in many application domains. The operation involves identifying pairs (p, q), in the Cartesian product of P and Q such that (p, q) satisfies a stipulated similarity condition. In a high-dimensional space, an approximate similarity join based on locality-sensitive hashing (LSH) provides a good solution while reducing the processing cost with a predictable loss of accuracy. A distributed processing framework such as MapReduce allows the handling of large and high-dimensional datasets. However, network cost frequently turns into a bottleneck in a distributed processing environment, thus resulting in a challenge of achieving faster and more efficient similarity join [2]. This paper focuses on collision counting LSH-based similarity join in MapReduce and proposes a network-efficient solution called C2Net to improve the utilization of MapReduce combiners. The solution uses two graph partitioning schemes: (i) minimum spanning tree for organizing LSH buckets replication; and (ii) spectral clustering for runtime collision counting task scheduling. Experiments have shown that, in comparison to the state of the art, the proposed solution is able to achieve 20% data reduction and 50% reduction in shuffle time.

查看译文

关键词

Runtime,Task analysis,Distributed databases,Distributed processing,Scheduling,Acceleration,Time measurement

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要