Achieving High-Performance Fault-Tolerant Routing in HyperX Interconnection Networks
CoRR(2024)
摘要
Interconnection networks are key actors that condition the performance of
current large datacenter and supercomputer systems. Both topology and routing
are critical aspects that must be carefully considered for a competitive system
network design. Moreover, when daily failures are expected, this tandem should
exhibit resilience and robustness. Low-diameter networks, including HyperX, are
cheaper than typical Fat Trees. But, to be really competitive, they have to
employ evolved routing algorithms to both balance traffic and tolerate
failures.
In this paper, SurePath, an efficient fault-tolerant routing mechanism for
HyperX topology is introduced and evaluated. SurePath leverages routes provided
by standard routing algorithms and a deadlock avoidance mechanism based on an
Up/Down escape subnetwork. This mechanism not only prevents deadlock but also
allows for a fault-tolerant solution for these networks. SurePath is thoroughly
evaluated in the paper under different traffic patterns, showing no performance
degradation under extremely faulty scenarios.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要