REDUCT: keep it close, keep it cool!: efficient scaling of DNN inference on multi-core CPUs with near-cache compute

Anant V. Nori,Rahul Bera,Shankar Balachandran,Joydeep Rakshit,Om J. Omer,Avishaii Abuhatzera,Belliappa Kuttanna,Sreenivas Subramoney

International Symposium on Computer Architecture（2021）

引用 15|浏览35

暂无评分

摘要

ABSTRACTDeep Neural Networks (DNN) are used in a variety of applications and services. With the evolving nature of DNNs, the race to build optimal hardware (both in datacenter and edge) continues. General purpose multi-core CPUs offer unique attractive advantages for DNN inference at both datacenter [60] and edge [71]. Most of the CPU pipeline design complexity is targeted towards optimizing general-purpose single thread performance, and is overkill for relatively simpler, but still hugely important, data parallel DNN inference workloads. Addressing this disparity efficiently can enable both raw performance scaling and overall performance/Watt improvements for multi-core CPU DNN inference. We present REDUCT, where we build innovative solutions that bypass traditional CPU resources which impact DNN inference power and limit its performance. Fundamentally, REDUCT's "Keep it close" policy enables consecutive pieces of work to be executed close to each other. REDUCT enables instruction delivery/decode close to execution and instruction execution close to data. Simple ISA extensions encode the fixed-iteration count loop-y workload behavior enabling an effective bypass of many power-hungry front-end stages of the wide Out-of-Order (OoO) CPU pipeline. Per core performance scales efficiently by distributing lightweight tensor compute near all caches in a multi-level cache hierarchy. This maximizes the cumulative utilization of the existing architectural bandwidth resources in the system and minimizes movement of data. Across a number of DNN models, REDUCT achieves a 2.3X increase in convolution performance/Watt with a 2X to 3.94X scaling in raw performance. Similarly, REDUCT achieves a 1.8X increase in inner-product performance/Watt with 2.8X scaling in performance. REDUCT performance/power scaling is achieved with no increase to cache capacity or bandwidth and a mere 2.63% increase in area. Crucially, REDUCT operates entirely within the CPU programming and memory model, simplifying software development, while achieving performance similar to or better than state-of-the-art Domain Specific Accelerators (DSA) for DNN inference, providing fresh design choices in the AI era.

查看译文

关键词

REDUCT,architectural bandwidth resources,light-weight tensor,power-hungry front-end stages,DNN inference power,data parallel DNN inference workloads,memory model,DNN models,multilevel cache hierarchy,out-of-order CPU pipeline,fixed-iteration count loop-y workload behavior,CPU resources,multicore CPU DNN inference,raw performance scaling,general-purpose single thread performance,CPU pipeline design complexity,datacenter,general purpose multicore CPUs,deep neural networks,near-cache compute,DNN inference scaling

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要