Coded Reactive Stragglers Mitigation in Distributed Computing Systems.

ICC(2023)

引用 0|浏览2
暂无评分
摘要
In distributed computing systems, to mitigate the adverse effect of stragglers on the computation time, computation redundancy is used. The redundancy can be added proactively at the beginning, or reactively after some time based on the delay pattern of the workers. While most of the existing work with reactive mitigation strategy only considered task replication, we propose a coded reactive straggler mitigation with an uncoded and a coded phase for distributed matrix-matrix multiplication. Specifically, in the uncoded phase of the proposed reactive strategy, the master distributes the computational job without redundancy among workers and waits for some time. After the waiting time, the master cancels the remaining tasks. It then encodes the remaining tasks and distributes them among the workers that have already completed their computations. The expected execution time of the proposed method is analytically obtained. Furthermore, the optimal waiting time for the uncoded phase and the optimal code rate for the coded phase are investigated. Our simulation results demonstrate that the proposed coded reactive mitigation strategy significantly decreases the execution time in comparison with the proactive mitigation strategy or repetition-based reactive mitigation strategy.
更多
查看译文
关键词
Coded distributed computing,stragglers mitigation,partial recovery,reactive stragglers mitigation,MatDot code,matrix-matrix multiplication
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要