Reliable Group Communication using Corrected Trees

acm sigplan symposium on principles and practice of parallel programming(2019)

引用 23|浏览70
暂无评分
摘要
Driven by ever increasing performance demands of compute-intensive applications, supercomputing systems comprise more and more nodes. This growth is a significant burden for fast group communication primitives and also makes those systems more susceptible to failures of individual nodes. In this paper we present a two-phase fault-tolerant scheme for group communication. Using broadcast as an example, we provide a full-spectrum discussion of our approach — from a formal analysis to LogP-based simulations to a message-passing-based implementation running on a large cluster. Ultimately, we are able to reduce the complex problem of reliable and fault-tolerant collective group communication to a graph theoretical renumbering problem. Both, simulations and measurements, show our solution to achieve a latency reduction of 50% with up to six times fewer messages sent in comparison to existing schemes.#R##N#Martin KuttlerTU DresdenMaksym PlanetaTU Dresden, GermanyGermanyJan BierbaumTU DresdenCarsten WeinholdTU DresdenHermann HartigTU DresdenAmnon BarakThe He
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要