34.2 A 16nm 96Kb Integer/Floating-Point Dual-Mode-Gain-Cell-Computing-in-Memory Macro Achieving 73.3-163.3TOPS/W and 33.2-91.2TFLOPS/W for AI-Edge Devices

Win-San Khwa, Ping-Chun Wu,Jui-Jen Wu,Jian-Wei Su, Ho-Yu Chen, Zhao-En Ke, Ting-Chien Chiu, Jun-Ming Hsu, Chiao-Yen Cheng,Yu-Chen Chen,Chung-Chuan Lo,Ren-Shuo Liu,Chih-Cheng Hsieh,Kea-Tiong Tang,Meng-Fan Chang

2024 IEEE International Solid-State Circuits Conference (ISSCC)（2024）

引用 0|浏览2

暂无评分

摘要

Advanced AI-edge chips require computational flexibility and high-energy efficiency (EEF) with sufficient inference accuracy for a variety of applications. Floating-point (FP) numerical representation can be used for complex neural networks (NN) requiring a high inference accuracy; however, such an approach requires higher energy and more parameter storage than does a fixed-point integer (INT) numerical representation. Many compute-in-memory (CIM) designs have a good EEF for INT multiply-and-accumulate (MAC) operations; however, few support FP-MAC operations [1–3]. Implementing INT/FP dual-mode (DM) MAC operations presents challenges (Fig. 34.2.1), including (1) low-area efficiency, since FP-MAC functions become idle during INT-MAC operations; (2) a high system-level latency, due to NN data update interruptions on small-capacity SRAM-CIM without concurrent write-and-compute functionality; and (3) high-energy consumption, due to repeated system-to-CIM data transfers during computation. This work presents an INT/FP DM macro featuring (1) a DM zone-based input (IN) processing scheme (ZB-IPS) to eliminate subtraction in exponent (EXP) computation, while reusing the alignment circuit in INT-mode to improve EEF and area efficiency (AEF); (2) a DM local-computing-cell (DM-LCC), which reuses the EXP addition as an adder tree stage for INT-MAC to improve AEF in INT mode; and (3) a stationary-based two-port gain-cell (GC) array (SB-TP-GCA) to support concurrent data updates and computation, while reducing system-to-CIM and internal data accesses to improve EEF and latency (T MAC ). A 16nm 96-Kb INT-FP DM GC-CIM macro with 4T GCs is fabricated to support FP-MAC with 64 accumulations (N ACCU ) for BF16-IN, BF16-W, and FP32-OUT as well as an INT-MAC with N ACCU =128 for 8b-IN, 8b-W, and 23b-OUT. This CIM macro achieves a 163.3TOPS/W INT-MAC and a 91.2TFLOPS/W FP-MAC EEF.

查看译文

关键词

Neural Network,Processing Unit,Data Transfer,Inverter,Processing Strategies,High Energy Consumption,Updated Data,Inference Accuracy,High Energy Efficiency,Numerical Representation,Area Efficiency,Area Overhead,Bit-width,CIFAR-100 Dataset,Bit-shift

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要