Customization of a Deep Learning Accelerator

2019 International Symposium on VLSI Design, Automation and Test (VLSI-DAT)(2019)

引用 5|浏览9
暂无评分
摘要
Deep-learning algorithms require large and parallel multiplication and accumulation (MAC), which fits hardware accelerators consist of parallel processing elements (PEs) to speed up. As the PE number increases, how to distribute data in time becomes the key problem. Improving the accelerator performance needs to balance the computation power and the data communication bandwidth, which forms a roofline model of the throughput. In addition to hardware setup, the combination of neural network operators, layers, also causes the roofline curve to shift. Optimizing the performance, power, cost of the accelerators needs to link the neural network models to the physical hardware setup, which indicates custom and model-specific are essential. This presentation introduces an example of custom deep-learning accelerating system developed from open-source NVIDIA deep-learning accelerator (DLA). We supplement an environment for developing the system, from model quantization, model compilation, test generation, to driving tools. FPGA prototypes and test chips are designed, running an application of object detection, offering 70% MAC network utilization.
更多
查看译文
关键词
deep learning accelerator,hardware accelerators,parallel processing elements,data communication bandwidth,roofline model,neural network operators,MAC network utilization,parallel multiplication and accumulation,optimization,open-source NVIDIA,DLA,FPGA prototypes
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要