An Algorithmic View of Streaming Submodular Data Summarization with A Knapsack Constraint
2022 5th International Conference on Data Science and Information Technology (DSIT)(2022)
Abstract
Data summarization, in the form of extracting a representative subset (i.e, a data summary) from a massive data set, is often used for big data processing. A good summary can not only significantly reduce the information redundancy, but also provide a better understanding of the original data. The utility function we use to evaluate the quality of a summary usually has a natrual diminishing returns property, also known as submodularity. Due to the rapid growth of data scale, traditional offline data processing has become more and more difficult to deal with massive data, and streaming data processing methods that require less space start to attract attention, leading to the emergence of many related studies. In this paper, we first make an algorithmic view of methods widely used in streaming submodu-lar maximization with knapsack constraint. After analyzing the ideas behind them, we further propose a new algorithm, called RSStream, for the same problem. RSStream is an innovative combination of traditional sieve approach, multi-cadidate set method and augmentation strategy with data sampling. It can achieve the state-of-the-art approximation ratio within a near-linear time and space complexity. At the end, we execute our algorithm on two real data summarization applications to demonstrate the effectiveness and efficiency of it.
MoreTranslated text
Key words
streaming data,data summarization,submodu-larity,data mining
AI Read Science
Must-Reading Tree
Example
Generate MRT to find the research sequence of this paper
Chat Paper
Summary is being generated by the instructions you defined