Online Training Flow Scheduling for Geo-Distributed Machine Learning Jobs Over Heterogeneous and Dynamic Networks
IEEE TRANSACTIONS ON COGNITIVE COMMUNICATIONS AND NETWORKING(2024)
Abstract
Geo-Distributed Machine Leaning (Geo-DML) has been a promising technology, which performs collaborative learning across geographically dispersed data centers (DCs) with privacy-preserving over Wide Area Networks (WANs). Unfortunately, the limited and heterogeneous WAN bandwidth poses significant challenges to the performance of Geo-DML systems, leading to increased communication overhead and affecting the revenue of ISPs eventually. In particular, when multiple online jobs coexist in Geo-DML systems, the competition for bandwidth between training flows of different jobs aggravates this negative impact. To alleviate it, this paper investigates the problem of online training flow scheduling for Geo-DML jobs. We first formulate the studied problem as an Linear Programming (LP) model with the objective of maximizing the revenue of ISPs. Then, we propose an online traffic scheduling algorithm called Training Flow Adaptive Steering (TFAS), which exploits a primal-dual framework, tailored for efficient resource allocation of jobs to schedule training flows, such that system resources are maximally utilized and training procedures can be expedited and completed in a timely manner. Meanwhile, we conduct rigorous theoretical analysis to guarantee that the proposed algorithm can achieve a good competitive ratio. Extensive evaluation results demonstrate that our algorithm performs well and outperforms commonly adopted solutions 36.2%-49.4% in average.
MoreTranslated text
Key words
Training,Wide area networks,Bandwidth,Resource management,Machine learning,Synchronization,Data models,Geo-distributed machine leaning,training jobs,resource allocation,online scheduling
AI Read Science
Must-Reading Tree
Example
Generate MRT to find the research sequence of this paper
Chat Paper
Summary is being generated by the instructions you defined