Automatic IO Filtering for Optimizing Cloud Analytics Microsoft Technical Report MSR-TR-2012-3

semanticscholar(2012)

引用 0|浏览3
暂无评分
摘要
Network bandwidth is often the bottleneck resource for large-scale data analytics. Cloud-based analytics platforms such as Amazon’s Elastic Map Reduce provide high bandwidth within a compute cluster but limited bandwidth to storage resources such as S3 servers. If data is accessed from another public cloud or a private cloud, then the network is not only a performance bottleneck but also causes egress bandwidth charges. This paper describes Rhea, a system to reduce traffic between storage and compute nodes for Hadoop MapReduce jobs. Rhea filters data read from storage by removing input rows, and column values within rows. Filters are job-specific and are automatically generated from static analysis of the mapper’s byte code. Filters are stateless, side effect free, and do not change the output of the MapReduce computation. Filtering is transparent to both the compute and storage nodes and is best-effort, allowing it to opportunistically use spare CPU cycles when available. Our evaluation shows that Rhea filters significantly reduce the amount of bytes transferred (e.g. by a factor >5 for some of our jobs), and correspondingly reduce egress bandwidth charges and overall execution time by similar factors.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要