D3S: debugging deployed distributed systems

NSDI(2008)

引用 216|浏览130
暂无评分
摘要
Testing large-scale distributed systems is a challenge, because some errors manifest themselves only after a distributed sequence of events that involves machine and network failures. D3S is a checker that allows developers to specify predicates on distributed properties of a deployed system, and that checks these predicates while the system is running. When D3S finds a problem it produces the sequence of state changes that led to the problem, allowing developers to quickly find the root cause. Developers write predicates in a simple and sequential programming style, while D3S checks these predicates in a distributed and parallel manner to allow checking to be scalable to large systems and fault tolerant. By using binary instrumentation, D3S works transparently with legacy systems and can change predicates to be checked at runtime. An evaluation with 5 deployed systems shows that D3S can detect non-trivial correctness and performance bugs at runtime and with low performance overhead (less than 8%).
更多
查看译文
关键词
low performance overhead,performance bug,non-trivial correctness,network failure,large system,fault tolerant,legacy system,binary instrumentation,D3S work,D3S check
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要