A Sampling-Based System For Approximate Big Data Analysis On Computing Clusters
PROCEEDINGS OF THE 28TH ACM INTERNATIONAL CONFERENCE ON INFORMATION & KNOWLEDGE MANAGEMENT (CIKM '19)(2019)
Abstract
To break the in-memory bottleneck and facilitate online sampling in cluster computing frameworks, we propose a new sampling-based system for approximate big data analysis on computing clusters. We address both computational and statistical aspects of big data across the main layers of cluster computing frameworks: big data storage, big data management, big data online sampling, big data processing, and big data exploration and analysis. We use the new Random Sample Partition (RSP) distributed data model to store a big data set as a set of ready-to-use random sample data blocks in Hadoop Distributed File System (HDFS), called RSP blocks. With this system, only a few RSP blocks are selected and processed using a sequential algorithm in a distributed data-parallel manner to produce approximate results for the entire data set. In this paper, we present a prototype RSP-based system and demonstrate its advantages. Our experiments show that RSP blocks can be used to get approximate models and summary statistics as well as estimate the proportions of inconsistent values without computing the entire data or running expensive online sampling operations. This new system enables big data exploration and analysis where the entire data set cannot be computed.
MoreTranslated text
Key words
Big Data,Cluster Computing,Approximate Computing
AI Read Science
Must-Reading Tree
Example
Generate MRT to find the research sequence of this paper
Chat Paper
Summary is being generated by the instructions you defined