C2Net: A Network-Efficient Approach to Collision Counting LSH Similarity Join(Extended Abstract)

2019 IEEE 35th International Conference on Data Engineering (ICDE)(2019)

引用 10|浏览12
暂无评分
摘要
Similarity join of two datasets P and Q is a primitive operation that is useful in many application domains. The operation involves identifying pairs (p, q), in the Cartesian product of P and Q such that (p, q) satisfies a stipulated similarity condition. In a high-dimensional space, an approximate similarity join based on locality-sensitive hashing (LSH) provides a good solution while reducing the processing cost with a predictable loss of accuracy. A distributed processing framework such as MapReduce allows the handling of large and high-dimensional datasets. However, network cost frequently turns into a bottleneck in a distributed processing environment, thus resulting in a challenge of achieving faster and more efficient similarity join [2]. This paper focuses on collision counting LSH-based similarity join in MapReduce and proposes a network-efficient solution called C2Net to improve the utilization of MapReduce combiners. The solution uses two graph partitioning schemes: (i) minimum spanning tree for organizing LSH buckets replication; and (ii) spectral clustering for runtime collision counting task scheduling. Experiments have shown that, in comparison to the state of the art, the proposed solution is able to achieve 20% data reduction and 50% reduction in shuffle time.
更多
查看译文
关键词
Runtime,Task analysis,Distributed databases,Distributed processing,Scheduling,Acceleration,Time measurement
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要