A Framework for Acceleration of CNN Training on Deeply-Pipelined FPGA Clusters with Work and Weight Load Balancing

2018 28th International Conference on Field Programmable Logic and Applications (FPL)(2018)

引用 53|浏览18
暂无评分
摘要
To improve flexibility and energy efficiency of Convolutional Neural Networks, a number of cloud computing service providers-including Microsoft, Amazon, and Alibaba-are using FPGA-based CNN accelerators. However, the growing size and complexity of neural networks, coupled with communication and off-chip memory bottlenecks, make it increasingly difficult for multi-FPGA designs to achieve high resource utilization and performance, especially when training. In this work, we present new results for a scalable framework, FPDeep, which helps users efficiently map CNN training logic to multiple FPGAs and automatically generates the resulting RTL implementation. FPDeep is equipped with two mechanisms to facilitate high-performance and energy-efficient training. First, FPDeep improves DSP slice utilization across FPGAs by balancing workload using dedicated partition and mapping strategies. Second, only on-chip memory is used in the CONV layers: a) FPDeep balances CNN weight allocation among FPGAs to improve BRAM utilization; b) training of CNNs is executed in a fine-grained pipelined manner, minimizing the time features need to be cached while waiting for back-propagation leading to a reduced storage demand. We evaluate our framework by training AlexNet, VGG-16, and VGG-19. Experimental results show FPDeep has good scalability to a large number of FPGAs, with the limiting factor being the inter-FPGA bandwidth. With 6 transceivers per FPGA, FPDeep shows linearity up to 83 FPGAs. FPDeep provides, on average, 6.36x higher energy efficiency than GPU servers.
更多
查看译文
关键词
CNN Training,FPGA Cluster,High Performance Computing
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要