Collective Memory Transfers For Multi-Core Chips

ICS(2014)

引用 7|浏览47
暂无评分
摘要
Future performance improvements for microprocessors have shifted from clock frequency scaling towards increases in on chip parallelism. Performance improvements for a wide variety of parallel applications require domain decomposition of data arrays from a contiguous arrangement in memory to a tiled layout for on-chip L1 caches and scratchpads. However, DRAM performance suffers under the non-streaming access patterns generated by many independent cores. In this paper, we propose collective memory scheduling (CMS) that uses simple software and inexpensive hardware to identify collective transfers and guarantee that loads and stores arrive in memory address order to the memory controller. CMS actively takes charge of collective transfers and pushes or pulls data to or from the on-chip processors according to memory address order. CMS reduces application execution time by up to 55% (20% average) compared to a state-of-the-art architecture where each processor reads and writes its data independently. CMS also reduces DRAM read power by 2.2x and write power by 50%.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要