UltraCDC:A Fast and Stable Content-Defined Chunking Algorithm for Deduplication-based Backup Storage Systems

2022 IEEE International Performance, Computing, and Communications Conference (IPCCC)(2022)

Cited 1|Views19
No score
Abstract
Content-Defined Chunking(CDC) is the key stage of data deduplication since it has a significant impact on deduplication system's throughput and deduplication efficiency. However, existing CDC algorithms suffer from high computation overhead, weak stability, and poor ability to handle low-entropy strings. In this paper, we propose UltraCDC, a fast and stable, high-efficiency deal with low-entropy strings, CDC algorithm for deduplication-based storage systems. There are four key techniques behind UltraCDC, namely, rolling compute boundary conditions, skipping sub-minimum chunk size, normalized chunking, and jumping to detect low-entropy strings. Using a sliding window to rolling compute boundary conditions not only accelerates the chunking stage but also makes it more resistant to boundary shift, the two techniques of skipping sub-minimum chunk size and normalized chunking can complement each other to speed up chunking without sacrificing deduplication ratio too much, and the jumping detection can detect more low-entropy strings than AE-opt2 without affecting chunking speed. We implemented UltraCDC in Destor, and the experimental results show that using the above four techniques, chunking speed is 1.5-10x faster than the state-of-the-art CDC approaches, while deduplication ratio is comparable or even higher than the classic Rabin-base CDC. In terms of the capability to detect low-entropy strings, UltraCDC is a CDC approach with the highest ability to detect low-entropy strings, 10(2)x and 2x higher than Rabinbased CDC and AE-opt2, respectively.
More
Translated text
Key words
Data deduplication,content-defined chunking algorithm,storage system,performance evaluation
AI Read Science
Must-Reading Tree
Example
Generate MRT to find the research sequence of this paper
Chat Paper
Summary is being generated by the instructions you defined