Chrome Extension
WeChat Mini Program
Use on ChatGLM

Online Training Flow Scheduling for Geo-Distributed Machine Learning Jobs over Heterogeneous and Dynamic Networks

IEEE TRANSACTIONS ON COGNITIVE COMMUNICATIONS AND NETWORKING(2024)

Univ Elect Sci & Technol China | Univ Sci & Technol China | Deakin Univ | Univ Technol Sydney

Cited 1|Views37
Abstract
Geo-Distributed Machine Leaning (Geo-DML) has been a promising technology, which performs collaborative learning across geographically dispersed data centers (DCs) with privacy-preserving over Wide Area Networks (WANs). Unfortunately, the limited and heterogeneous WAN bandwidth poses significant challenges to the performance of Geo-DML systems, leading to increased communication overhead and affecting the revenue of ISPs eventually. In particular, when multiple online jobs coexist in Geo-DML systems, the competition for bandwidth between training flows of different jobs aggravates this negative impact. To alleviate it, this paper investigates the problem of online training flow scheduling for Geo-DML jobs. We first formulate the studied problem as an Linear Programming (LP) model with the objective of maximizing the revenue of ISPs. Then, we propose an online traffic scheduling algorithm called Training Flow Adaptive Steering (TFAS), which exploits a primal-dual framework, tailored for efficient resource allocation of jobs to schedule training flows, such that system resources are maximally utilized and training procedures can be expedited and completed in a timely manner. Meanwhile, we conduct rigorous theoretical analysis to guarantee that the proposed algorithm can achieve a good competitive ratio. Extensive evaluation results demonstrate that our algorithm performs well and outperforms commonly adopted solutions 36.2%-49.4% in average.
More
Translated text
Key words
Training,Wide area networks,Bandwidth,Resource management,Machine learning,Synchronization,Data models,Geo-distributed machine leaning,training jobs,resource allocation,online scheduling
求助PDF
上传PDF
Bibtex
AI Read Science
AI Summary
AI Summary is the key point extracted automatically understanding the full text of the paper, including the background, methods, results, conclusions, icons and other key content, so that you can get the outline of the paper at a glance.
Example
Background
Key content
Introduction
Methods
Results
Related work
Fund
Key content
  • Pretraining has recently greatly promoted the development of natural language processing (NLP)
  • We show that M6 outperforms the baselines in multimodal downstream tasks, and the large M6 with 10 parameters can reach a better performance
  • We propose a method called M6 that is able to process information of multiple modalities and perform both single-modal and cross-modal understanding and generation
  • The model is scaled to large model with 10 billion parameters with sophisticated deployment, and the 10 -parameter M6-large is the largest pretrained model in Chinese
  • Experimental results show that our proposed M6 outperforms the baseline in a number of downstream tasks concerning both single modality and multiple modalities We will continue the pretraining of extremely large models by increasing data to explore the limit of its performance
Upload PDF to Generate Summary
Must-Reading Tree
Example
Generate MRT to find the research sequence of this paper
Related Papers
Data Disclaimer
The page data are from open Internet sources, cooperative publishers and automatic analysis results through AI technology. We do not make any commitments and guarantees for the validity, accuracy, correctness, reliability, completeness and timeliness of the page data. If you have any questions, please contact us by email: report@aminer.cn
Chat Paper

要点】:本文研究了地理分布式机器学习(Geo-DML)系统中在线训练流调度的优化问题,提出了一种名为Training Flow Adaptive Steering (TFAS)的在线流量调度算法,旨在最大化互联网服务提供商(ISP)的收入,并通过理论分析证实了算法的良好竞争力。

方法】:通过构建一个线性规划(LP)模型,将在线训练流调度问题转化为最大化ISP收入的目标。

实验】:作者实现了TFAS算法,并在理论上证明了其竞争力。实验结果表明,与现有解决方案相比,TFAS平均性能提升了36.2%-49.4%。