Hound: Causal Learning for Datacenter-scale Straggler Diagnosis.
POMACS(2018)
摘要
Stragglers are exceptionally slow tasks within a job that delay its completion. Stragglers, which are uncommon within a single job, are pervasive in datacenters with many jobs. We present Hound, a statistical machine learning framework that infers the causes of stragglers from traces of datacenter-scale jobs. Hound is designed to achieve several objectives: datacenter-scale diagnosis, unbiased inference, interpretable models, and computational efficiency. We demonstrate Hound's capabilities for a production trace from Google's warehouse-scale datacenters and two Spark traces from Amazon EC2 clusters.
更多查看译文
关键词
causal reasoning, datacenter, distributed system, machine learning, performance diagnosis, performance modeling, topic modeling
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络