Monitoring Cloud Service Unreachability at Scale

IEEE CONFERENCE ON COMPUTER COMMUNICATIONS (IEEE INFOCOM 2021)(2021)

引用 1|浏览13
暂无评分
摘要
We consider the problem of network unreachability in a global-scale cloud-hosted service that caters to hundreds of millions of users. Even when the service itself is up, the "last mile" between where users are, and the cloud is often the weak link that could render the service unreachable. We present NetDetector, a tool for detecting network-unreachability based on measurements from a client-based HTTP-ping service. NetDetector employs two models. The first, GA (Gaussian Alerts) models temporally averaged raw success rate of the HTTP-pings as a Gaussian distribution and flags significant dips below the mean as unreachability episodes. The second, more sophisticated approach (BB, or Beta-Binomial) models the health of network connectivity as the probability of an access request succeeding, estimates health from noisy samples, and alerts based on dips in health below a client-network-specific SLO (service-level objective) derived from data. These algorithms are enhanced by a drill-down technique that identifies a more precise scope of the unreachability event. We present promising results from GA, which has been in deployment, and the experimental BB detector over a 4-month period. For instance, GA flags 49 country-level unreachability incidents, of which 42 were labelled true positives based on investigation by on-call engineers (OCEs).
更多
查看译文
关键词
OCE,on-call engineers,drill-down technique,Gaussian distribution,Gaussian alerts models,unreachability event,network connectivity,client-based HTTP-ping service,NetDetector,weak link,global-scale cloud-hosted service,network unreachability,monitoring cloud service unreachability
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要