UniIndex: An index and query middleware for parallel file systems

CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE(2020)

引用 1|浏览56
暂无评分
摘要
As data analysis scenarios keep increasing on high-performance computing systems, the ability to select a small fraction of data from a large volume of scientific data sets is vital to accelerate scientific discovery. However, parallel file systems lack the ability to provide efficient data locating services at the granularity of both a file and a record. Existing methods for identifying and indexing data are often domain-specific and do not scale to large scientific data sets. In this paper, we describe the design and implementation of UniIndex framework, which combines our proposed techniques for user-annotation extraction, in-memory cache layer, in-situ indexing, and parallel query processing. Acting as middleware on top of production file systems, UniIndex enables efficient data locating services with minimal user effort. Our evaluations show that UniIndex can locate target files from directories containing millions of files in microseconds. By applying in situ indexing and the lightweight range-bitmap index, record-level index building time can be dramatically reduced while maintaining up to two orders of magnitude query speedup than scanning the entire data set.
更多
查看译文
关键词
big data,data management,high-performance computing,index
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要