PRIMO: A Full-Stack Processing-in-DRAM Emulation Framework for Machine Learning Workloads

Jaehoon Heo,Yongwon Shin, Sangjin Choi, Sungwoong Yune,Jung-Hoon Kim,Hyojin Sung,Youngjin Kwon,Joo-Young Kim

2023 IEEE/ACM INTERNATIONAL CONFERENCE ON COMPUTER AIDED DESIGN, ICCAD（2023）

引用 0|浏览17

暂无评分

摘要

Recently, the size of deep learning models has significantly increased, making the excessive memory access between the AI processor and DRAM a major bottleneck of the system. The processing-in-DRAM (DRAM-PIM) concept has emerged as a promising solution, which integrates computing logic within memory, thus saving abundant access to external memory. Although many simulators have been proposed to model and analyze the benefits of DRAM-PIM, they are often too slow to run an entire application. FPGA-based emulators have been introduced to overcome this limitation. However, none of the prior works include the full software stack from the model to DRAM-PIM hardware. This paper presents a full-stack processing-in-DRAM emulation framework named PRIMO, the first emulation framework that can model and analyze DRAM-PIM for end-to-end ML inference. PRIMO enables software developers to develop and test their customized software stacks on various ML workloads without requiring a real DRAM-PIM chip. Moreover, it allows designers to explore design space and monitor memory access patterns, facilitating software and hardware co-design for efficient DRAM-PIM architectures. To achieve these goals, we develop a real-time FPGA emulator that emulates DRAM-PIM architecture and generates experimental results such as predicted cycle information and computed output at incomparably high speeds compared to the CPU-based simulation. In addition, we propose a software stack comprising a PIM compiler that enables the execution of various ML workloads, including end-to-end inference, and a PIM driver that runs the workloads with high bandwidth utilization by leveraging virtual memory scatter-gather DMA. Finally, we demonstrate that PRIMO can successfully emulate DRAM-PIM 106.64-6093.56x faster than the CPU-based simulation framework for ML workloads ranging from small microbenchmarks to end-to-end inference of ResNets.

查看译文

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要