Ffmk: A Fast And Fault-Tolerant Microkernel-Based System For Exascale Computing

SOFTWARE FOR EXASCALE COMPUTING - SPPEXA 2013-2015(2016)

引用 9|浏览45
暂无评分
摘要
In this paper we describe the hardware and application-inherent challenges that future exascale systems pose to high-performance computing (HPC) and propose a system architecture that addresses them. This architecture is based on proven building blocks and few principles: (1) a fast light-weight kernel that is supported by a virtualized Linux for tasks that are not performance critical, (2) decentralized load and health management using fault-tolerant gossip-based information dissemination, (3) a maximally-parallel checkpoint store for cheap checkpoint/ restart in the presence of frequent component failures, and (4) a runtime that enables applications to interact with the underlying system platform through new interfaces. The paper discusses the vision behind FFMK and the current state of a prototype implementation of the system, which is based on a microkernel and an adapted MPI runtime.
更多
查看译文
关键词
Gossip Algorithm, Host Thread, Exascale System, Busy Waiting, Checkpoint Store
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要