Stratified random sampling from streaming and stored data

DISTRIBUTED AND PARALLEL DATABASES(2020)

引用 47|浏览33
暂无评分
摘要
Stratified random sampling (SRS) is a widely used sampling technique for approximate query processing. We consider SRS on continuously arriving data streams and statically stored data sets. We present a tight lower bound showing that any streaming algorithm for SRS over the entire stream must have, in the worst case, a variance that is (r) factor away from the optimal, where r is the number of strata. We present S-VOILA, a practical streaming algorithm for SRS over the entire stream that is locally variance-optimal . We prove that any sliding window-based streaming SRS needs a workspace of (rMlog W) in the worst case, to maintain a variance-optimal SRS of size M , where W is the number of elements in the sliding window. Due to the inherent high workspace needs for sliding window-based SRS, we present SW-VOILA, a multi-layer practical sampling algorithm that uses only O ( M ) workspace but can maintain an SRS of size close to M in practice over a sliding window. Experiments show that both S-VOILA and SW-VOILA result in a variance that is typically close to their optimal offline counterparts, which was given the entire input beforehand. We also present VOILA, a variance-optimal offline algorithm for stratified random sampling. VOILA is a strict generalization of the well-known Neyman allocation , which is optimal only under the assumption that each stratum is abundant. Experiments show that VOILA can have significantly smaller variance (1.4x to 50x) than Neyman allocation on real-world data.
更多
查看译文
关键词
Stratified random sampling,Stream sampling,Sliding window sampling,Neyman allocation
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要