Deep Reinforcement Agent for Failure-aware Job scheduling in High-Performance Computing

Kang Yang,Rongyu Cao,Yueyuan Zhou, Jiawei Zhang,En Shao,Guangming Tan

2021 IEEE 27th International Conference on Parallel and Distributed Systems (ICPADS)(2021)

引用 1|浏览12
暂无评分
摘要
Job scheduling is crucial in high-performance computing (HPC), which is dedicated to deciding when and which jobs are allocated to the system and placing the jobs on which resources, by considering multiple scheduling goals. Along with the incremental of various resources and dazzling deep learning training (DLT) workloads, job failure becomes a quite common issue in HPC, which will affect user sa...
更多
查看译文
关键词
Training,Deep learning,Processor scheduling,Error analysis,Computational modeling,Conferences,Neural networks
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要