PRIMATE: Processing in Memory Acceleration for Dynamic Token-pruning Transformers

Yue Pan,Minxuan Zhou,Chonghan Lee,Zheyu Li, Rishika Kushwah,Vijaykrishnan Narayanan,Tajana Rosing

Asia and South Pacific Design Automation Conference（2024）

引用 0|浏览11

暂无评分

摘要

Attention-based models such as Transformers represent the state of the art for various machine learning (ML) tasks. Their superior performance is often overshadowed by the substantial memory requirements and low data reuse opportunities. Processing in Memory (PIM) is a promising solution to accelerate Transformer models due to its massive parallelism, low data movement costs, and high memory bandwidth utilization. Existing PIM accelerators lack the support for algorithmic optimizations like dynamic token pruning that can significantly improve the efficiency of Transformers. We identify two challenges to enabling dynamic token pruning on PIM-based architectures: the lack of an in-memory top-k token selection mechanism and the memory underutilization problem from pruning. To address these challenges, we propose PRIMATE, a software-hardware co-design PIM framework based on High Bandwidth Memory (HBM). We initiate minor hardware modifications to conventional HBM to enable Transformer model computation and top-k selection. For software, we introduce a pipelined mapping scheme and an optimization framework for maximum throughput and efficiency. PRIMATE achieves 30.6x improvement in throughput, 29.5x improvement in space efficiency, and 4.3x better energy efficiency compared to the current state-of-the-art PIM accelerator for Transformers.

查看译文

关键词

Energy Efficiency,Memory Processes,Load Data,Transformer Model,High Bandwidth,High Memory,Throughput Improvement,Space Efficiency,Primates,Lookup Table,Design Space,Importance Scores,Partitioning Scheme,Pareto Front,Partial Sums,Attention Scores,Resistive Random Access Memory,Written Back,Area Overhead,Pipelining,Global Adjustment,Sense Amplifier,Partition Size,Word Line,Memory Bank,Critical Layer,Partitioning Problem,Feed-forward Layer,Input Sequence,Encoder Layer

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要