Parallel DNN Inference Framework Leveraging a Compact RISC-V ISA-based Multi-core System

KDD '20: The 26th ACM SIGKDD Conference on Knowledge Discovery and Data Mining Virtual Event CA USA July, 2020, pp. 627-635, 2020.

Cited by: 0|Bibtex|Views48|DOI:https://doi.org/10.1145/3394486.3403105
EI
Other Links: dl.acm.org|academic.microsoft.com|dblp.uni-trier.de
Weibo:
We design a RISC-V core that contains the basic Load and Issue instructions by utilizing the Xilinx HLS, alongside a statistical analysis of the potential multi-core system based on this approach

Abstract:

RISC-V is an open-source instruction set and now has been examined as a universal standard to unify the heterogeneous platforms. However, current research focuses primarily on the design and fabrication of general-purpose processors based on RISC-V, despite the fact that in the era of IoT (Internet of Things), the fusion of heterogeneous ...More

Code:

Data:

0
Introduction
  • Growing workloads are imposing huge pressure on computing platform. These platforms are very sensitive to the energy efficiency, since the neural network is a highly computation-intensive task.
  • The remainder of this paper is organized as follows: Section 2 explicates the related works; section 3 introduces some basic knowledge of the DNN model and a popular light-weight architecture; section 4 introduces the RISC-V instructions involved in the design; section 5 formally presents the proposed approach to establish a collaborative system with the multi-core and AI SoC; section 6 demonstrates the experimental exploration on the Xilinx FPGA with HLS.
Highlights
  • Growing workloads are imposing huge pressure on computing platform
  • The remainder of this paper is organized as follows: Section 2 explicates the related works; section 3 introduces some basic knowledge of the Deep Neural Network (DNN) model and a popular light-weight architecture; section 4 introduces the RISC-V instructions involved in our design; section 5 formally presents our proposed approach to establish a collaborative system with the multi-core and AI SoC; section 6 demonstrates our experimental exploration on the Xilinx FPGA with HLS
  • Input: Instructions popped up from Cache, uint_32 ; Output: Load memory → register; Issue register → Process Engine (PE) Cache; 1: Instruction Fetch (IF) register fetch the instruction from the Instruction Cache; 2: IF transfers the instruction to the Instruction Decode (ID) register when ID is valid; 3: if inst[6:0] == ’0001111’ 4: ID: [ ] = [ [ 1] + ] [31 : 0]; Loading; 5: = 0; 6: else if inst[6:0] == ’0100111’ 7: h [ [ 1] + ] [31 : 0] = [ ]; Issuing; 8: = 1; 9: end if; 10: return ; Since all data must go through Load and Store instructions before being formally written to the PE, and these two instructions do not have to be strictly tied up in the time sequence, we can analyze the impact of different scheduling algorithms and assess their performance
  • We design a RISC-V core that contains the basic Load and Issue instructions by utilizing the Xilinx HLS, alongside a statistical analysis of the potential multi-core system based on this approach
  • Our analysis reveals that when we design the multi-core system, we should not increase the quantity of cores without considering the reasonable ratio to PEs or the convolutional kernel
  • Our experiment demonstrates that a single PE in YOLO-lite in the CONV4 context matches well with two cores, which can make the whole timing sequence more compact and results in less idle slots for the PEs
Results
  • Since the authors have decided to use the RISC-V core as the coordinator between the memory and AI accelerators, a set of scheduling algorithms in which the performance is hugely at stake definitely exists.
  • : In the design, a group of RISC-V cores schedules the flow of data traffic from main memory to the accelerators.
  • Input: Instructions popped up from Cache, uint_32 ; Output: Load memory → register; Issue register → PE Cache; 1: IF register fetch the instruction from the Instruction Cache; 2: IF transfers the instruction to the ID register when ID is valid; 3: if inst[6:0] == ’0001111’ 4: ID: [ ] = [ [ 1] + ] [31 : 0]; Loading; 5: = 0; 6: else if inst[6:0] == ’0100111’ 7: h [ [ 1] + ] [31 : 0] = [ ]; Issuing; 8: = 1; 9: end if; 10: return ; Since all data must go through Load and Store instructions before being formally written to the PE, and these two instructions do not have to be strictly tied up in the time sequence, the authors can analyze the impact of different scheduling algorithms and assess their performance.
  • To address the aforementioned problems, the authors can modify the sequence of the system in order to eliminate the idle slots and data collision in the Bus. The authors' arrangement can help the RISC-V core to alternately utilize the RD and WR bus, which greatly increases the utility of the on-board resources.
  • Figure 10 presents all the read clock cycles needed to fill up the PEs. The authors can see that the RISC-V core design can use less frequently read time slots as long as the compiler fully utilizes the overlapping mechanism and intelligently schedule the order of the instructions.
Conclusion
  • The authors design a RISC-V core that contains the basic Load and Issue instructions by utilizing the Xilinx HLS, alongside a statistical analysis of the potential multi-core system based on this approach.
  • The authors' paper may provide future researchers with some innovative hints regarding the cross study of machine learning and computer architecture
Summary
  • Growing workloads are imposing huge pressure on computing platform. These platforms are very sensitive to the energy efficiency, since the neural network is a highly computation-intensive task.
  • The remainder of this paper is organized as follows: Section 2 explicates the related works; section 3 introduces some basic knowledge of the DNN model and a popular light-weight architecture; section 4 introduces the RISC-V instructions involved in the design; section 5 formally presents the proposed approach to establish a collaborative system with the multi-core and AI SoC; section 6 demonstrates the experimental exploration on the Xilinx FPGA with HLS.
  • Since the authors have decided to use the RISC-V core as the coordinator between the memory and AI accelerators, a set of scheduling algorithms in which the performance is hugely at stake definitely exists.
  • : In the design, a group of RISC-V cores schedules the flow of data traffic from main memory to the accelerators.
  • Input: Instructions popped up from Cache, uint_32 ; Output: Load memory → register; Issue register → PE Cache; 1: IF register fetch the instruction from the Instruction Cache; 2: IF transfers the instruction to the ID register when ID is valid; 3: if inst[6:0] == ’0001111’ 4: ID: [ ] = [ [ 1] + ] [31 : 0]; Loading; 5: = 0; 6: else if inst[6:0] == ’0100111’ 7: h [ [ 1] + ] [31 : 0] = [ ]; Issuing; 8: = 1; 9: end if; 10: return ; Since all data must go through Load and Store instructions before being formally written to the PE, and these two instructions do not have to be strictly tied up in the time sequence, the authors can analyze the impact of different scheduling algorithms and assess their performance.
  • To address the aforementioned problems, the authors can modify the sequence of the system in order to eliminate the idle slots and data collision in the Bus. The authors' arrangement can help the RISC-V core to alternately utilize the RD and WR bus, which greatly increases the utility of the on-board resources.
  • Figure 10 presents all the read clock cycles needed to fill up the PEs. The authors can see that the RISC-V core design can use less frequently read time slots as long as the compiler fully utilizes the overlapping mechanism and intelligently schedule the order of the instructions.
  • The authors design a RISC-V core that contains the basic Load and Issue instructions by utilizing the Xilinx HLS, alongside a statistical analysis of the potential multi-core system based on this approach.
  • The authors' paper may provide future researchers with some innovative hints regarding the cross study of machine learning and computer architecture
Tables
  • Table1: YOLO-Lite Architecture
  • Table2: Resource and latency comparison
  • Table3: Comparison of RISC & RISC-V latency
Download tables as Excel
Related work
  • In this section, we review the history and the current progress of RISC-V. RISC-V, which was first proposed by UC.Berkeley, is an open-source instruction set oriented for all platforms. RISC-V immediately attracted global attention since it was first brought to the stage. The founder of RISC-V also presented methodology for rapidly building an RISC-V processor[18], which can be considered as the manual of their proposed ISA (Instruction Set Architecture).

    Many works have since been launched to expand the RISC-V family. Lee[17] has solved a basic but very important problem; namely, vector calculation, which has great significance regarding the matrices’ operation. It first discusses the possibility of designing a Vector Processor System (VPS) under the standards of RISC-V, since VPS is renowned for its high parallelism in the architecture and also has great significance in the field of linear algebra and fast calculation in frequency domain. Since IoTs and 5G are currently playing an important role in every scenario, the low latency requirements result in our pursuit for the highly parallel devices.
Funding
  • This work was supported in part by the National Natural Science Foundation of China under Grants 61822113 and 62041105, the Natural Science Foundation of Hubei Province under Grant 2018CFA050, the National Key R & D Program of China under Grant 2018YFA060550, Science and Technology Major Project of Hubei Province (Next-Generation AI Technologies) under Grant 2019AEA170
Reference
  • Xiaobing Chen, Shaohui Peng, Luyang Jin, Yimin Zhuang, Jin Song, Weijian Du, Shaoli Liu, and Tian Zhi. 2019. Partition and Scheduling Algorithms for Neural Network Accelerators. In Advanced Parallel Processing Technologies 13th International Symposium, APPT 2019, Tianjin, China, August 15-16, 2019, Proceedings (Lecture Notes in Computer Science, Vol. 11719), Pen-Chung Yew, Per Stenström, Junjie Wu, Xiaoli Gong, and Tao Li (Eds.). Springer, 55–67.
    Google ScholarLocate open access versionFindings
  • Tomasz S. Czajkowski, Utku Aydonat, Dmitry Denisenko, John Freeman, Michael Kinsner, David Neto, Jason Wong, Peter Yiannacouras, and Deshanand P. Singh. 201From opencl to high-performance hardware on FPGAS. In International Conference on Field Programmable Logic Applications.
    Google ScholarLocate open access versionFindings
  • Absalom E. Ezugwu, Marc Frîncu, Aderemi Oluyinka Adewumi, Seyed M. Buhari, and Sahalu B. Junaidu. 2017. Neural network-based multi-agent approach for scheduling in distributed systems. Concurrency and Computation: Practice and Experience 29, 1 (2017).
    Google ScholarLocate open access versionFindings
  • Angelo Garofalo, Manuele Rusci, Francesco Conti, Davide Rossi, and Luca Benini. 2019. PULP-NN: A Computing Library for Quantized Neural Network inference at the edge on RISC-V Based Parallel Ultra Low Power Clusters. In 26th IEEE International Conference on Electronics, Circuits and Systems, ICECS 2019, Genoa, Italy, November 27-29, 2019. IEEE, 33–36.
    Google ScholarLocate open access versionFindings
  • Jan Gray. 2016. GRVI Phalanx: A Massively Parallel RISC-V FPGA Accelerator Accelerator. In IEEE International Symposium on Field-programmable Custom Computing Machines.
    Google ScholarLocate open access versionFindings
  • C. B. Hsu, Y. S. Hong, and J. B. Kuo. 2015. MTCMOS low-power optimization technique (LPOT) for 1V pipelined RISC CPU circuit. In IEEE International Conference on Electronics.
    Google ScholarLocate open access versionFindings
  • Rachel Huang, Jonathan Pedoeem, and Cuixian Chen. 2018. YOLO-LITE: A RealTime Object Detection Algorithm Optimized for Non-GPU Computers. In IEEE International Conference on Big Data, Big Data 2018, Seattle, WA, USA, December 10-13, 2018. 2503–2510.
    Google ScholarLocate open access versionFindings
  • Goshgar Ismayilov and Haluk Rahmi Topcuoglu. 2020. Neural network based multi-objective evolutionary algorithm for dynamic workflow scheduling in cloud computing. Future Generation Comp. Syst. 102 (2020), 307–322.
    Google ScholarLocate open access versionFindings
  • Michael A Iverson, Füsun Özgüner, and Gregory J Follen. 1995. Parallelizing existing applications in a distributed heterogeneous environment. In 4th Heterogeneous Computing Workshop (HCW’95). Citeseer.
    Google ScholarLocate open access versionFindings
  • Hayato Kato and Hiroshi Saito. 2019. Design of Asynchronous CNN Circuits on Commercial FPGA from Synchronous CNN Circuits. In 13th IEEE International Symposium on Embedded Multicore/Many-core Systems-on-Chip, MCSoC 2019, Singapore, Singapore, October 1-4, 2019. IEEE, 61–67.
    Google ScholarLocate open access versionFindings
  • Jungwon Kim, Sangmin Seo, Jun Lee, Jeongho Nah, Gangwon Jo, and Jaejin Lee. 2012. SnuCL: an OpenCL framework for heterogeneous CPU/GPU clusters. Poetry 3, 6 (2012), 304–307.
    Google ScholarLocate open access versionFindings
  • Thaddeus Koehn and Peter Athanas. 2019. Scheduling Data in Neural Network Applications. In Proceedings of the 2019 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, FPGA 2019, Seaside, CA, USA, February 24-26, 2019, Kia Bazargan and Stephen Neuendorffer (Eds.). ACM, 116.
    Google ScholarLocate open access versionFindings
  • Ryosuke Kuramochi, Youki Sada, Masayuki Shimoda, Shimpei Sato, and Hiroki Nakahara. 2019. Many Universal Convolution Cores for Ensemble Sparse Convolutional Neural Networks. In 13th IEEE International Symposium on Embedded Multicore/Many-core Systems-on-Chip, MCSoC 2019, Singapore, Singapore, October 1-4, 2019. IEEE, 93–100.
    Google ScholarLocate open access versionFindings
  • C. Lamb. 2009. OpenCL for NVIDIA GPUs. In 2009 IEEE Hot Chips 21 Symposium (HCS). 1–24.
    Google ScholarFindings
  • Yann LeCun, Bernhard E. Boser, John S. Denker, Donnie Henderson, Richard E. Howard, Wayne E. Hubbard, and Lawrence D. Jackel. 1989. Handwritten Digit Recognition with a Back-Propagation Network. In Advances in Neural Information Processing Systems 2, [NIPS Conference, Denver, Colorado, USA, November 27-30, 1989], David S. Touretzky (Ed.). Morgan Kaufmann, 396–404.
    Google ScholarLocate open access versionFindings
  • Jaejin Lee, Jungwon Kim, Sangmin Seo, Seungkyun Kim, Jung Ho Park, Honggyu Kim, Thanh Tuan Dao, Yongjin Cho, Sung Jong Seo, and Seung Hak Lee. 2010.
    Google ScholarFindings
  • Yunsup Lee, Andrew Waterman, Rimas Avizienis, Henry Cook, Chen Sun, Vladimir Stojanović, and Krste Asanović. 2014. A 45nm 1.3GHz 16.7 doubleprecision GFLOPS/W RISC-V processor with vector accelerators. In European Solid State Circuits Conference.
    Google ScholarFindings
  • Yunsup Lee, Andrew Waterman, Henry Cook, Brian Zimmer, and Krste Asanovic. 2016. An Agile Approach to Building RISC-V Microprocessors. IEEE Micro 36, 2 (2016), 8–20.
    Google ScholarLocate open access versionFindings
  • Shijie Li, Xiaolong Shen, Yong Dou, Shi-Ce Ni, Jinwei Xu, Ke Yang, Qiang Wang, and Xin Niu. 20A Novel Memory-Scheduling Strategy for Large Convolutional Neural Network on Memory-Limited Devices. Comp. Int. and Neurosc. 2019 (2019), 4328653:1–4328653:12.
    Google ScholarLocate open access versionFindings
  • Longlong Liao, Kenli Li, Keqin Li, Canqun Yang, and Qi Tian. 2018. UHCLDarknet: An OpenCL-based Deep Neural Network Framework for Heterogeneous Multi-/Many-core Clusters. In Proceedings of the 47th International Conference on Parallel Processing, ICPP 2018, Eugene, OR, USA, August 13-16, 2018. ACM, 44:1–44:10.
    Google ScholarLocate open access versionFindings
  • Katie Lim, Jonathan Balkind, and David Wentzlaff. 2019. JuxtaPiton: Enabling Heterogeneous-ISA Research with RISC-V and SPARC FPGA Soft-cores. In Proceedings of the 2019 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, FPGA 2019, Seaside, CA, USA, February 24-26, 2019. 184.
    Google ScholarLocate open access versionFindings
  • Wenqi Lou, Chao Wang, Lei Gong, and Xuehai Zhou. 2019. RV-CNN: Flexible and Efficient Instruction Set for CNNs Based on RISC-V Processors. In Advanced Parallel Processing Technologies - 13th International Symposium, APPT 2019, Tianjin, China, August 15-16, 2019, Proceedings. 3–14.
    Google ScholarLocate open access versionFindings
  • Hyeongyun Moon, Jeonghun Cho, and Daejin Park. 2019. Reconfigurable Fault-Safe Processor Platform Based on RISC-V for Large-Scaled IoT-Driven Applications. In 2019 IEEE Intl Conf on Dependable, Autonomic and Secure Computing, Intl Conf on Pervasive Intelligence and Computing, Intl Conf on Cloud and Big Data Computing, Intl Conf on Cyber Science and Technology Congress, DASC/PiCom/CBDCom/CyberSciTech 2019, Fukuoka, Japan, August 5-8, 2019. 627– 632.
    Google ScholarLocate open access versionFindings
  • Lucas Morais, Vitor Silva, Alfredo Goldman, Carlos Álvarez, Jaume Bosch, Michael Frank, and Guido Araujo. 2019. Adding Tightly-Integrated Task Scheduling Acceleration to a RISC-V Multi-core Processor. In Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2019, Columbus, OH, USA, October 12-16, 2019. ACM, 861–872.
    Google ScholarLocate open access versionFindings
  • Vinayak Patil, Aneesh Raveendran, P. M. Sobha, A. David Selvakumar, and D. Vivian. 2015. Out of order floating point coprocessor for RISC V ISA. In 19th International Symposium on VLSI Design and Test, VDAT 2015, Ahmedabad, India, June 26-29, 2015. 1–7.
    Google ScholarLocate open access versionFindings
  • Karyofyllis Patsidis, Dimitris Konstantinou, Chrysostomos Nicopoulos, and Giorgos Dimitrakopoulos. 2018. A low-cost synthesizable RISC-V dual-issue processor core leveraging the compressed Instruction Set Extension. Microprocessors Microsystems 61 (2018), 1–10.
    Google ScholarLocate open access versionFindings
  • Jagadish Kumar Ranbirsingh, Hanke Kimm, and Haklin Kimm. 2019. Distributed Neural Networks using TensorFlow over Multicore and Many-Core Systems. In 13th IEEE International Symposium on Embedded Multicore/Many-core Systemson-Chip, MCSoC 2019, Singapore, Singapore, October 1-4, 2019. IEEE, 101–107.
    Google ScholarLocate open access versionFindings
  • Joseph Redmon, Santosh Kumar Divvala, Ross B. Girshick, and Ali Farhadi. 2016. You Only Look Once: Unified, Real-Time Object Detection. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016. 779–788.
    Google ScholarLocate open access versionFindings
  • Colin Shea and Tinoosh Mohsenin. 2019. Heterogeneous Scheduling of Deep Neural Networks for Low-power Real-time Designs. JETC 15, 4 (2019), 36:1– 36:31.
    Google ScholarLocate open access versionFindings
  • Dongjoo Shin, Jinmook Lee, Jinsu Lee, Juhyoung Lee, and Hoi-Jun Yoo. 2017. An energy-efficient deep learning processor with heterogeneous multi-core architecture for convolutional neural networks and recurrent neural networks. In 2017 IEEE Symposium in Low-Power and High-Speed Chips, COOL Chips 2017, Yokohama, Japan, April 19-21, 2017. IEEE Computer Society, 1–2.
    Google ScholarLocate open access versionFindings
  • Haluk Topcuoglu, Salim Hariri, and Min-You Wu. 2002. Performance-Effective and Low-Complexity Task Scheduling for Heterogeneous Computing. IEEE Trans. Parallel Distrib. Syst. 13, 3 (2002), 260–274.
    Google ScholarLocate open access versionFindings
  • Bay Vo, Loan T. T. Nguyen, Trinh D. D. Nguyen, Philippe Fournier-Viger, and Unil Yun. 2020. A Multi-Core Approach to Efficiently Mining High-Utility Itemsets in Dynamic Profit Databases. IEEE Access 8 (2020), 85890–85899.
    Google ScholarLocate open access versionFindings
  • Zeyang Ye, Lihao Zhang, Keli Xiao, Wenjun Zhou, Yong Ge, and Yuefan Deng. 2018. Multi-User Mobile Sequential Recommendation: An Efficient Parallel Computing Paradigm. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD 2018, London, UK, August 19-23, 2018, Yike Guo and Faisal Farooq (Eds.). ACM, 2624–2633.
    Google ScholarLocate open access versionFindings
  • Yipeng Zhang, Bo Du, Lefei Zhang, Rongchun Li, and Yong Dou. 2019. Accelerated Inference Framework of Sparse Neural Network Based on Nested Bitmask Structure. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI 2019, Macao, China, August 10-16, 2019. 4355–4361.
    Google ScholarLocate open access versionFindings
Full Text
Your rating :
0

 

Tags
Comments