Continuously Distinct Sampling over Centralized and Distributed High Speed Data Streams

IEEE Transactions on Parallel and Distributed Systems(2019)

引用 2|浏览88
暂无评分
摘要
Distinct sampling is fundamental for computing statistics (e.g., the age and gender distribution of distinct users accessing a particular website) depending on the set of distinct keys (e.g., user IDs) in a large and high speed data stream such as a sequence of key-update pairs. However, the major shortcoming of existing methods is their high computational cost incurred by determining whether each incoming key in the data stream is currently in the set of sampled keys and keeping track of sampled keys’ update aggregations. To solve this challenge, we develop a new method random projection and eviction (RPE) that uses a list of buckets to continuously sample distinct keys and their update aggregations. RPE processes each key-update pair with small and nearly constant time complexity $O(1)$ . Besides centralized data streams, we also develop a novel method DRPE to deal with distributed data streams consisting of key-update pairs observed at multiple distributed sites. We conduct extensive experiments on real-world datasets, and the results demonstrate that RPE and DRPE reduce the memory, computational, and message costs of state-of-the-art methods by several times.
更多
查看译文
关键词
Distributed databases,IP networks,Random access memory,Monitoring,Measurement,Sampling methods,Data models
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要