Chrome Extension
WeChat Mini Program
Use on ChatGLM

A Two-Stage Data Processing Algorithm to Generate Random Sample Partitions for Big Data Analysis

IEEE International Conference on Cloud Computing(2018)

Shenzhen Univ

Cited 22|Views10
Abstract
To enable the individual data block files of a distributed big data set to be used as random samples for big data analysis, a two-stage data processing (TSDP) algorithm is proposed in this paper to convert a big data set into a random sample partition (RSP) representation which ensures that each individual data block in the RSP is a random sample of the big data, therefore, it can be used to estimate the statistical properties of the big data. The first stage of this algorithm is to sequentially chunk the big data set into non-overlapping subsets and distribute these subsets as data block files to the nodes of a cluster. The second stage is to take a random sample from each subset without replacement to form a new subset saved as an RSP data block file and the random sampling step is repeated until all data records in all subsets are used up and a new set of RSP data block files are created to form an RSP of the big data. It is formally proved that the expectation of the sample distribution function ( s.d.f. ) of each RSP data block equals to the s.d.f. of the big data set, therefore, each RSP data block is a random sample of the big data set. Implementation of the TSDP algorithm on Apache Spark and HDFS is presented. Performance evaluations on terabyte data sets show the efficiency of this algorithm in converting HDFS big data files into HDFS RSP big data files. We also show an example that uses only a small number of RSP data blocks to build ensemble models which perform better than the single model built from the entire data set.
More
Translated text
Key words
Big data analysis,Random sample partition,RSP,HDFS,Apache Spark
求助PDF
上传PDF
Bibtex
AI Read Science
AI Summary
AI Summary is the key point extracted automatically understanding the full text of the paper, including the background, methods, results, conclusions, icons and other key content, so that you can get the outline of the paper at a glance.
Example
Background
Key content
Introduction
Methods
Results
Related work
Fund
Key content
  • Pretraining has recently greatly promoted the development of natural language processing (NLP)
  • We show that M6 outperforms the baselines in multimodal downstream tasks, and the large M6 with 10 parameters can reach a better performance
  • We propose a method called M6 that is able to process information of multiple modalities and perform both single-modal and cross-modal understanding and generation
  • The model is scaled to large model with 10 billion parameters with sophisticated deployment, and the 10 -parameter M6-large is the largest pretrained model in Chinese
  • Experimental results show that our proposed M6 outperforms the baseline in a number of downstream tasks concerning both single modality and multiple modalities We will continue the pretraining of extremely large models by increasing data to explore the limit of its performance
Upload PDF to Generate Summary
Must-Reading Tree
Example
Generate MRT to find the research sequence of this paper
Related Papers
Pooja Honnutagi
2014

被引用2721 | 浏览

Data Disclaimer
The page data are from open Internet sources, cooperative publishers and automatic analysis results through AI technology. We do not make any commitments and guarantees for the validity, accuracy, correctness, reliability, completeness and timeliness of the page data. If you have any questions, please contact us by email: report@aminer.cn
Chat Paper

要点】:本文提出了一种两阶段数据处理(TSDP)算法,用于将大数据集转换为随机样本分区(RSP)表示,以实现对大数据统计特性的估计。

方法】:TSDP算法首先将大数据集顺序划分为非重叠子集,并将这些子集作为数据块文件分发至集群节点;其次,从每个子集中进行无放回随机抽样,形成新的RSP数据块文件,直至所有数据记录被使用,生成大数据的RSP。

实验】:本文在Apache Spark和HDFS上实现了TSDP算法,并在TB级数据集上进行了性能评估,证明了算法在将HDFS大数据文件转换为HDFS RSP大数据文件方面的效率。同时,展示了使用少量RSP数据块构建的集成模型性能优于使用整个数据集构建的单个模型。