High Fidelity Data Collection and Transport Service Applied to the Cray XE 6 / XK 6

Jim Brandt, Tom Tucker,Ann Gentile,David Thompson, Victor Kuhns, Jason Repik

semanticscholar(2013)

引用 0|浏览0
暂无评分
摘要
A common problem experienced by users of large scale High Performance Computer (HPC) systems, including the Cray XE6, is the inability to gain insight into their computational environments. Our Lightweight Distributed Metric Service (LDMS) is intended to be run as a continuous system service for providing low-overhead remote collection and on-node access to high-fidelity data, capable of handling 100s of data values per node per second, vastly exceeding the data collection sizes and rates typically handled by current HPC monitoring services while still maintaining much lower overhead. We present a case study of utilizing LDMS on the Cray XE6 platform, Cielo, to enable remote storage of system resource data for post run analysis and node-local access to data for run-time in-situ analysis and workload rebalancing. We also present information from deployment on an XK6 system at Sandia, where we leverage RDMA over the Gemini transport to further reduce LDMS overhead.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要