Reining in the Outliers in Map-Reduce Clusters using Mantri.

OSDI'10: Proceedings of the 9th USENIX conference on Operating systems design and implementation(2010)

引用 990|浏览483
暂无评分
摘要
Experience froman operational Map-Reduce cluster reveals that outliers significantly prolong job completion. The causes for outliers include run-time contention for processor, memory and other resources, disk failures, varying bandwidth and congestion along network paths and, imbalance in task workload. We present Mantri, a system that monitors tasks and culls outliers using cause- and resource-aware techniques. Mantri's strategies include restarting outliers, network-aware placement of tasks and protecting outputs of valuable tasks. Using real-time progress reports, Mantri detects and acts on outliers early in their lifetime. Early action frees up resources that can be used by subsequent tasks and expedites the job overall. Acting based on the causes and the resource and opportunity cost of actions lets Mantri improve over prior work that only duplicates the laggards. Deployment in Bing's production clusters and trace-driven simulations show that Mantri improves job completion times by 32%.
更多
查看译文
关键词
Mantri detects,job completion,job completion time,job overall,early action,culls outlier,disk failure,experience froman,network path,network-aware placement,map-reduce cluster
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要