TensorOpt: Exploring the Tradeoffs in Distributed DNN Training With Auto-Parallelism

IEEE Transactions on Parallel and Distributed Systems(2022)

引用 19|浏览126
暂无评分
摘要
Effective parallelization strategies are crucial for the performance of distributed deep neural network (DNN) training. Recently, several methods have been proposed to search parallelization strategies but they all optimize a single objective (e.g., execution time, memory consumption) and produce only one strategy. We propose Frontier Tracking (FT), an efficient algorithm that finds a set of Pareto-optimal parallelization strategies to explore the best trade-off among different objectives. FT can minimize the memory consumption when the number of devices is limited and fully utilize additional resources to reduce the execution time. Based on FT , we develop a user-friendly system, called TensorOpt , which allows users to run their distributed DNN training jobs without caring the details about searching and coding parallelization strategies. Experimental results show that TensorOpt is more flexible in adapting to resource availability compared with existing frameworks.
更多
查看译文
关键词
Deep learning,distributed systems,large-scale model training
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要