Dataset Similarity Detection for Global Deduplication in the DD File System.

Tony Wong, Smriti Thakkar, Kao-Feng Hsieh,Zachary Tom, Hetaben Saraiya,Philip Shilane


引用 1|浏览8
Deduplication has become a widely used technique to reduce space requirements for storage systems by replacing redundant chunks of data with references. While storage systems continue to grow in size, there remain practical limits to the size of any deduplication node, and enterprise businesses may have dozens to hundreds of nodes. It is important to place datasets on nodes in a multi-node environment to take advantage of deduplication savings globally. For customers of the DD File System (DDFS) 1 , we provide the Global Deduplication Service that advises customers on data placement to maximize deduplication-related space savings. This paper describes our currently shipping approach that uses a Fingerprint Dictionary to intelligently cluster customer data and generate a plan to relocate datasets to improve global deduplication. We report results from thousands of deployed systems at customer sites. We have also developed a further improvement using MinHashes that lowers resource requirements, and we provide proofs of the similarity estimates. Our results on a real-world dataset show that MinHashes improve the clustering speed up to 400X relative to our previous method and reduce memory consumption up to 260X.
MinHash,Clustering,Jaccard similarity,Data Placement,Deduplication File System
AI 理解论文
Chat Paper