Machine Learning for Reliability Analysis of Large Scale Systems.
QEST(2020)
摘要
As distributed systems dramatically grow in terms of scale, complexity, and usage, understanding the hidden interactions among system and workload properties becomes an exceedingly difficult task. Machine learning models for prediction of system behavior (and analysis) are increasingly popular but their effectiveness in answering what and why is not always the most favorable. In this talk I will present two reliability analysis studies from two large, distributed systems: one that looks into GPGPU error prediction at the Titan, a large scale high-performance-computing system at ORNL, and one that analyzes the failure characteristics of solid state drives at a Google data center and hard disk drives at the Backblaze data center. Both studies illustrate the difficulty of untangling complex interactions of workload characteristics that lead to failures and of identifying failure root causes from monitored symptoms. Nevertheless, this difficulty can occasionally manifest in spectacular results where failure prediction can be dramatically accurate.
更多查看译文
关键词
Data centers, HPC, Storage systems, Reliability, GPUs, SSDs, HDDs
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络