Maintaining very large samples using the geometric file

Maintaining very large samples using the geometric file（2007）

引用 23|浏览3

暂无评分

摘要

Sampling is one of the most fundamental data management tools available. It is one of the most powerful methods for building a one-pass synopsis of a data set, especially in a streaming environment where the assumption is that there is too much data to store all of it permanently. However, most current research involving sampling considers the problem of how to use a sample, and not how to compute one. The implicit assumption is that a “sample” is a small data structure that is easily maintained as new data are encountered, even though simple statistical arguments demonstrate that very large samples of gigabytes or terabytes in size can be necessary to provide high accuracy. No existing work tackles the problem of maintaining very large, disk-based samples in an online manner from streaming data. We present a new data organization called the geometric file and online algorithms for maintaining a very large, on-disk samples. The algorithms are designed for any environment where a large sample must be maintained online in a single pass through a data set. The geometric file organization meets the strict requirement that the sample always be a true, statistically random sample (without replacement) of all of the data processed thus far. We modify the classic reservoir sampling algorithm to compute a fixed-size sample in a single pass over a data set, where the goal is to bias the sample using an arbitrary, user-defined weighting function. We also describe how the geometric file can be used to perform a biased reservoir sampling. While a very large sample can be required to answer a difficult query, a huge sample may often contain too much information. We therefore develop efficient techniques which allow a geometric file to itself be sampled in order to produce smaller data objects. Efficiently searching and discovering information from the geometric file is essential for query processing. A natural way to support this is to build an index structure. We discuss three secondary index structures and their maintenance as new records are inserted to a geometric file.

查看译文

关键词

new data organization,large sample,fundamental data management tool,data set,disk-based sample,single pass,small data structure,new data,geometric file,smaller data object

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要