RapidCDC - Leveraging Duplicate Locality to Accelerate Chunking in CDC-based Deduplication Systems.

SoCC '19: ACM Symposium on Cloud Computing Santa Cruz CA USA November, 2019(2019)

引用 18|浏览16
暂无评分
摘要
I/O deduplication is a key technique for improving storage systems' space and I/O efficiency. Among various deduplication techniques content-defined chunking (CDC) based deduplication is the most desired one for its high deduplication ratio. However, CDC is compute-intensive and time-consuming, and has been recognized as a major performance bottleneck of the CDC-based deduplication system. In this paper we leverage the existence of a property in the duplicate data, named duplicate locality, that reveals the fact that multiple duplicate chunks are likely to occur together. In other words, one duplicate chunk is likely to be immediately followed by a sequence of contiguous duplicate chunks. The longer the sequence, the stronger the locality is. After a quantitative analysis of duplicate locality in real-world data, we propose a suite of chunking techniques that exploit the locality to remove almost all chunking cost for deduplicatable chunks in CDC-based deduplication systems. The resulting deduplication method, named RapidCDC, has two salient features. One is that its efficiency is positively correlated to the deduplication ratio. RapidCDC can be as fast as a fixed-size chunking method when applied on data sets with high data redundancy. The other feature is that its high efficiency does not rely on high duplicate locality strength. These attractive features make RapidCDC's effectiveness almost guaranteed for datasets with high deduplication ratio. Our experimental results with synthetic and real-world datasets show that RapidCDC's chunking speedup can be up to 33x higher than regular CDC. Meanwhile, it maintains (nearly) the same deduplication ratio.
更多
查看译文
关键词
storage systems, deduplication, CDC, content-defined chunking, locality
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要