The Non-Uniform Compute Device (NUCD) Architecture for Lightweight Accelerator Offload

Mochamad Asri, Curtis Dunham,Roxana Rusitoru,Andreas Gerstlauer,Jonathan Beard

2020 28th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP)（2020）

引用 0|浏览46

暂无评分

摘要

Heterogeneous architectures have arisen as a well-suited approach for the post-Moore era. Among them, architectures that integrate programmable accelerators in or near memory are gaining popularity due to the potential advantages of reduced data movement. Such near-memory accelerators benefit from launching a large number of fine-grain tasks to hide memory latency while exploiting bandwidth gains. This requires low-overhead and portable mechanisms for interfacing of accelerators. If not managed carefully, the hard and soft costs of host and accelerator interactions, such as programming and device driver overheads for actuation, context transfer and synchronization can severely limit acceleration benefits.We present the non-uniform compute device (NUCD) system architecture as a novel lightweight and generic accelerator offload mechanism that is tightly-coupled with a general-purpose processor core. Different from conventional offload mechanisms that rely primarily on device drivers and software queues, the NUCD system architecture extends a host core micro-architecture to enable a low-latency out-of-order task offload to heterogeneous devices. In the NUCD programming model, a candidate region for offload in the code is marked with a special instruction. The NUCD microarchitecture then accelerates function offloading, actuation, synchronization for out-of-order parallel execution in hardware with little driver or runtime software involvement, while maintaining standard sequential program semantics.Results demonstrate that the NUCD system architecture can achieve an average performance improvement of 21%-128% over a conventional driver-based offload mechanism. This in turn enables whole new forms of fine-grain task offloading that would otherwise not see any performance benefits.

查看译文

关键词

lightweight accelerator offload,post-Moore era,programmable accelerators,reduced data movement,near-memory accelerators,accelerator interactions,device driver overheads,synchronization,nonuniform compute device system architecture,general-purpose processor core,NUCD system architecture,host core microarchitecture,NUCD programming model,NUCD microarchitecture,fine-grain task offloading,sequential program semantics,driver-based offload mechanism

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要