Understanding Real-World Timeout Problems in Cloud Server Systems
2018 IEEE International Conference on Cloud Engineering (IC2E)(2018)
摘要
Timeouts are commonly used to handle unexpected failures in distributed systems. In this paper, we conduct a comprehensive study to characterize real-world timeout problems in 11 commonly used cloud server systems (e.g., Hadoop, HDSF, Spark, Cassandra, etc.). Our study reveals timeout problems are widespread among cloud server systems. We categorize those timeout problems in three aspects: 1) what are the root causes of those timeout problems? 2) what impact can timeout problems impose to cloud systems? 3) how are timeout problems currently diagnosed or misdiagnosed? Our results show that root causes of timeout problems include misused timeout, missing timeout, improper timeout handling, unnecessary timeout, and clock drifting. We further find timeout bugs impose serious impact (e.g., system hang or crash, job failure, performance degradation, data loss) to both applications and systems. Our study also shows that 60% of the bugs do not produce any error messages and 12% bugs produce misleading error messages, which makes it difficult to diagnose those timeout bugs.
更多查看译文
关键词
Timeout,Reliability,Cloud Computing
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络