A Bloom Filter Based Scalable Data Integrity Check Tool For Large-Scale Dataset

SC16: The International Conference for High Performance Computing, Networking, Storage and Analysis Salt Lake City Utah November, 2016(2016)

引用 13|浏览6
Large scale HPC applications are becoming increasingly data intensive. At Oak Ridge Leadership Computing Facility (OLCF), we are observing the number of files curated under individual project are reaching as high as 200 millions and project data size is exceeding petabytes. These simulation datasets, once validated, often needs to be transferred to archival system for long term storage or shared with the rest of the research community. Ensuring the data integrity of the full dataset at this scale is paramount important but also a daunting task. This is especially true considering that most conventional tools are serial and file-based, unwieldy to use and/or can't scale to meet user's demand.To tackle this particular challenge, this paper presents the design, implementation and evaluation of a scalable parallel checksumming tool, fsum, which we developed at OLCF. It is built upon the principle of parallel tree walk and work-stealing pattern to maximize parallelism and is capable of generating a single, consistent signature for the entire dataset at extreme scale. We also applied a novel bloom-filter based technique in aggregating signatures to overcome the signature ordering requirement. Given the probabilistic nature of bloom filter, we provided a detailed error and trade-off analysis. Using multiple datasets from production environment, we demonstrated that our tool can efficiently handle both very large files as well as many small-file based datasets. Our preliminary test showed that on the same hardware, it outperforms conventional tool by as much as 4x. It also exhibited near-linear scaling properties when provisioned with more compute resources.
Bloom filter-based scalable data integrity check tool,large-scale dataset,large-scale HPC applications,Oak Ridge Leadership Computing Facility,OLCF,project data size,fsum scalable parallel checksumming tool,parallel tree walk,work-stealing pattern,extreme scale,signature ordering requirement,very-large file dataset,small-file based dataset,near-linear scaling properties
AI 理解论文
Chat Paper