# FloatPIM: in-memory acceleration of deep neural network training with high precision

Proceedings of the 46th International Symposium on Computer Architecture, pp. 802-815, 2019.

EI

Weibo:

Abstract:

Processing In-Memory (PIM) has shown a great potential to accelerate inference tasks of Convolutional Neural Network (CNN). However, existing PIM architectures do not support high precision computation, e.g., in floating point precision, which is essential for training accurate CNN models. In addition, most of the existing PIM approaches ...More

Code:

Data:

Introduction

- Artificial neural networks, in particular deep learning [3, 4], have wide range of applications in diverse areas including: object detection [5], self driving car, and translation [6].
- The on-chip caches do not have enough capacity to store all data for large size CNNs with hundreds of layers and millions of weights
- This creates a large amount of data movement between the processing cores and memory units which significantly slows down the computation.
- The result of accumulation passes through an activation function (д)
- This function is traditionally a Sigmoid [29], but recently Rectangular Linear Unit (ReLU) is the most commonly used [3].
- The activation results are used as the input for the neurons in the layer

Highlights

- Artificial neural networks, in particular deep learning [3, 4], have wide range of applications in diverse areas including: object detection [5], self driving car, and translation [6]
- FloatPIM works with any bipolar resistive technology which is the most commonly used in existing Non-Volatile Memory (NVM)
- Our evaluation shows that FloatPIM-HP can provide 8.2× higher energy-delay product (EDP) improvement while requiring 3.9× larger memory as compared to FloatPIM-LP
- Our evaluation shows that FloatPIM in high performance and low power modes can achieve 818.4 GF LOPS/s/W and 695.1 GF LOPS/s/W power efficiency which are higher than both ISAAC (380.7 GOPS/s/W ) and PipeLayer (142.9 GOPS/s/W ) design
- FloatPIM is a flexible Processing In-Memory (PIM)-based accelerator that works with floating-point as well as fixed-point precision
- All existing PIM architectures can support Convolutional neural networks (CNN) acceleration just using fixed-point values, which results in up to 5.1% lower classification accuracy than floating point precision supported by FloatPIM
- Our evaluation shows that FloatPIM can achieve on average 4.3× and 15.8× (6.3× and 21.6×) higher speedup and energy efficiency as compared to PipeLayer (ISAAC), the state-of-the-art PIM accelerator, during training

Results

- 7.1 Experimental Setup

The authors have designed and used a cycle-accurate simulator based on Tensorflow [45, 46] which emulates the memory functionality during the DNN training and testing phases. - The authors use HSPICE for circuit-level simulations to measure the energy consumption and performance of all the FloatPIM floating-point/fixedpoint operations in 28nm technology.
- FloatPIM works with any bipolar resistive technology which is the most commonly used in existing NVMs. Here, the authors adopt memristor device with a VTEAM model [36].
- The model parameters of the memristor, as listed in Table 1, are chosen to produce switching delay of 1ns, a voltage pulse of 1V and 2V for RESET and SET operations in order to fit practical devices [30].

Conclusion

- The authors proposed FloatPIM, the first PIM-based DNN training architecture that exploits analog properties of the memory without explicitly converting data into the analog domain.
- FloatPIM is a flexible PIM-based accelerator that works with floating-point as well as fixed-point precision.
- The authors' evaluation shows that FloatPIM can achieve on average 4.3× and 15.8× (6.3× and 21.6×) higher speedup and energy efficiency as compared to PipeLayer (ISAAC), the state-of-the-art PIM accelerator, during training

Summary

## Introduction:

Artificial neural networks, in particular deep learning [3, 4], have wide range of applications in diverse areas including: object detection [5], self driving car, and translation [6].- The on-chip caches do not have enough capacity to store all data for large size CNNs with hundreds of layers and millions of weights
- This creates a large amount of data movement between the processing cores and memory units which significantly slows down the computation.
- The result of accumulation passes through an activation function (д)
- This function is traditionally a Sigmoid [29], but recently Rectangular Linear Unit (ReLU) is the most commonly used [3].
- The activation results are used as the input for the neurons in the layer
## Results:

7.1 Experimental Setup

The authors have designed and used a cycle-accurate simulator based on Tensorflow [45, 46] which emulates the memory functionality during the DNN training and testing phases.- The authors use HSPICE for circuit-level simulations to measure the energy consumption and performance of all the FloatPIM floating-point/fixedpoint operations in 28nm technology.
- FloatPIM works with any bipolar resistive technology which is the most commonly used in existing NVMs. Here, the authors adopt memristor device with a VTEAM model [36].
- The model parameters of the memristor, as listed in Table 1, are chosen to produce switching delay of 1ns, a voltage pulse of 1V and 2V for RESET and SET operations in order to fit practical devices [30].
## Conclusion:

The authors proposed FloatPIM, the first PIM-based DNN training architecture that exploits analog properties of the memory without explicitly converting data into the analog domain.- FloatPIM is a flexible PIM-based accelerator that works with floating-point as well as fixed-point precision.
- The authors' evaluation shows that FloatPIM can achieve on average 4.3× and 15.8× (6.3× and 21.6×) higher speedup and energy efficiency as compared to PipeLayer (ISAAC), the state-of-the-art PIM accelerator, during training

- Table1: VTEAM Model Parameters for Memristor kon koff αon, αoff
- Table2: FloatPIM Parameters
- Table3: Workloads
- Table4: Error rate comparison and PIM supports

Related work

- There are several recent studies adopting alternative low-precision arithmetics for DNN training [51]. work in [52, 53] proposed DNN training on hardware with hybrid dynamic fixed-point and floating point precision. However, in terms of convolutions neural network, the work in [14, 54] showed that fixed-point is not the most suitable representation for CNN training. Instead, the training can perform with lower bits of floating point values.

Modern neural network algorithms are executed on different types of platforms such as GPU, FPGAs, and ASIC chips [55,56,57,58,59,60,61,62,63].

Prior work attempted to fully utilize existing cores to accelerate neural networks. However, in their design the main computation still relies on CMOS-based cores, thus has limited parallelism. To address data movement issue, work in [64] proposed a neural cache architecture which re-purposes caches for parallel in-memory computing. Work in [65] modified DRAM architecture to accelerate DNN inference by supporting matrix multiplication in memory. In contrast, FloatPIM performs a row-parallel and non-destructive bitwise operation inside non-volatile memory block without using any sense amplifier. FloatPIM also accelerates DNN in both training and testing modes.

Funding

- This work was partially supported by CRISP, one of six centers in JUMP, an SRC program sponsored by DARPA, and also NSF grants #1730158 and #1527034

Reference

- L. Song, X. Qian, H. Li, and Y. Chen, “Pipelayer: A pipelined reram-based accelerator for deep learning,” in High Performance Computer Architecture (HPCA), 2017 IEEE International Symposium on, IEEE, 2017.
- A. Shafiee, A. Nag, N. Muralimanohar, R. Balasubramonian, J. P. Strachan, M. Hu, R. S. Williams, and V. Srikumar, “Isaac: A convolutional neural network accelerator with in-situ analog arithmetic in crossbars,” in Proceedings of the 43rd International Symposium on Computer Architecture, pp. 14–26, IEEE Press, 2016.
- Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” nature, vol. 521, no. 7553, p. 436, 2015.
- J. Schmidhuber, “Deep learning in neural networks: An overview,” Neural networks, vol. 61, pp. 85–117, 2015.
- C. Dong, C. C. Loy, K. He, and X. Tang, “Learning a deep convolutional network for image super-resolution,” in European conference on computer vision, pp. 184– 199, Springer, 2014.
- L. Deng, D. Yu, et al., “Deep learning: methods and applications,” Foundations and Trends® in Signal Processing, vol. 7, no. 3–4, pp. 197–387, 2014.
- D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, et al., “Mastering the game of go with deep neural networks and tree search,” nature, vol. 529, no. 7587, p. 484, 2016.
- K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
- M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi, “Xnor-net: Imagenet classification using binary convolutional neural networks,” in European Conference on Computer Vision, pp. 525–542, Springer, 2016.
- M. N. Bojnordi and E. Ipek, “Memristive boltzmann machine: A hardware accelerator for combinatorial optimization and deep learning,” in High Performance Computer Architecture (HPCA), 2016 IEEE International Symposium on, pp. 1–13, IEEE, 2016.
- P. Chi, S. Li, C. Xu, T. Zhang, J. Zhao, Y. Liu, Y. Wang, and Y. Xie, “Prime: A novel processing-in-memory architecture for neural network computation in reram-based main memory,” in Proceedings of the 43rd International Symposium on Computer Architecture, pp. 27–39, IEEE Press, 2016.
- S. Han, H. Mao, and W. J. Dally, “Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding,” arXiv preprint arXiv:1510.00149, 2015.
- M. Courbariaux, Y. Bengio, and J.-P. David, “Training deep neural networks with low precision multiplications,” arXiv preprint arXiv:1412.7024, 2014.
- C. Louizos, M. Reisser, T. Blankevoort, E. Gavves, and M. Welling, “Relaxed quantization for discretized neural networks,” arXiv preprint arXiv:1810.01875, 2018.
- “Bfloat16 floating point format..” https://en.wikipedia.org/wiki/Bfloat16_
- “Intel xeon processors and intel fpgas..” https://venturebeat.com/2018/05/23/
- “Intel xeon and fpga lines.” https://www.top500.org/news/ https://www.tomshardware.com/news/
- [19] “Google cloud..” https://cloud.google.com/tpu/docs/tensorflow-ops.
- [20] “Tpu repository with tensorflow 1.7.0..” https://blog.riseml.com/
- [21] J. V. Dillon, I. Langmore, D. Tran, E. Brevdo, S. Vasudevan, D. Moore, B. Patton, A. Alemi, M. Hoffman, and R. A. Saurous, “Tensorflow distributions,” arXiv preprint arXiv:1711.10604, 2017.
- [22] “Google. 2018-05-08. retrieved 2018-05-23. in many models this is a drop-in replacement for float-32..” https://www.youtube.com/watch?v=vm67WcLzfvc&
- [23] B. Feinberg, U. K. R. Vengalam, N. Whitehair, S. Wang, and E. Ipek, “Enabling scientific computing on memristive accelerators,” in 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA), pp. 367–382, IEEE, 2018.
- http://newsroom.intel.com/community/intel_newsroom/blog/2015/07/28/
- [25] M. Cheng, L. Xia, Z. Zhu, Y. Cai, Y. Xie, Y. Wang, and H. Yang, “Time: A training-inmemory architecture for memristor-based deep neural networks,” in Proceedings of the 54th Annual Design Automation Conference 2017, p. 26, ACM, 2017.
- [26] Y. Cai, T. Tang, L. Xia, M. Cheng, Z. Zhu, Y. Wang, and H. Yang, “Training low bitwidth convolutional neural network on rram,” in Proceedings of the 23rd Asia and South Pacific Design Automation Conference, pp. 117–122, IEEE Press, 2018.
- [27] Y. Cai, Y. Lin, L. Xia, X. Chen, S. Han, Y. Wang, and H. Yang, “Long live time: improving lifetime for training-in-memory engines by structured gradient sparsification,” in Proceedings of the 55th Annual Design Automation Conference, p. 107, ACM, 2018.
- [28] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in neural information processing systems, pp. 1097–1105, 2012.
- [29] L. K. Hansen and P. Salamon, “Neural network ensembles,” IEEE transactions on pattern analysis and machine intelligence, vol. 12, no. 10, pp. 993–1001, 1990.
- [30] S. Kvatinsky, D. Belousov, S. Liman, G. Satat, N. Wald, E. G. Friedman, A. Kolodny, and U. C. Weiser, “MagicâĂŤmemristor-aided logic,” IEEE Transactions on Circuits and Systems II: Express Briefs, vol. 61, no. 11, pp. 895–899, 2014.
- [31] S. Gupta, M. Imani, and T. Rosing, “Felix: Fast and energy-efficient logic in memory,” in 2018 IEEE/ACM International Conference on Computer-Aided Design (ICCAD), pp. 1–7, IEEE, 2018.
- [32] A. Siemon, S. Menzel, R. Waser, and E. Linn, “A complementary resistive switchbased crossbar array adder,” IEEE journal on emerging and selected topics in circuits and systems, vol. 5, no. 1, pp. 64–74, 2015.
- [33] S. Kvatinsky, G. Satat, N. Wald, E. G. Friedman, A. Kolodny, and U. C. Weiser, “Memristor-based material implication (IMPLY) logic: design principles and methodologies,” TVLSI, vol. 22, no. 10, pp. 2054–2066, 2014.
- [34] J. Borghetti, G. S. Snider, P. J. Kuekes, J. J. Yang, D. R. Stewart, and R. S. Williams, “Memristive switches enable stateful logic operations via material implication,” Nature, vol. 464, no. 7290, pp. 873–876, 2010.
- [35] B. C. Jang, Y. Nam, B. J. Koo, J. Choi, S. G. Im, S.-H. K. Park, and S.-Y. Choi, “Memristive logic-in-memory integrated circuits for energy-efficient flexible electronics,” Advanced Functional Materials, vol. 28, no. 2, 2018.
- [36] S. Kvatinsky, M. Ramadan, E. G. Friedman, and A. Kolodny, “Vteam: A general model for voltage-controlled memristors,” IEEE Transactions on Circuits and Systems II: Express Briefs, vol. 62, no. 8, pp. 786–790, 2015.
- [37] N. Talati, S. Gupta, P. Mane, and S. Kvatinsky, “Logic design within memristive memories using memristor-aided logic (magic),” IEEE Transactions on Nanotechnology, vol. 15, no. 4, pp. 635–650, 2016.
- [38] M. Imani, S. Gupta, and T. Rosing, “Ultra-efficient processing in-memory for data intensive applications,” in Proceedings of the 54th Annual Design Automation Conference 2017, p. 6, ACM, 2017.
- [39] A. Haj-Ali et al., “Efficient algorithms for in-memory fixed point multiplication using magic,” in IEEE ISCAS, IEEE, 2018.
- [40] M. Imani, D. Peroni, Y. Kim, A. Rahimi, and T. Rosing, “Efficient neural network acceleration on gpgpu using content addressable memory,” in 2017 Design, Automation & Test in Europe Conference & Exhibition (DATE), pp. 1026–1031, IEEE, 2017.
- [41] C. Xu, D. Niu, N. Muralimanohar, R. Balasubramonian, T. Zhang, S. Yu, and Y. Xie, “Overcoming the challenges of crossbar resistive memory architectures,” in 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA), pp. 476–488, IEEE, 2015.
- [42] A. Nag, R. Balasubramonian, V. Srikumar, R. Walker, A. Shafiee, J. P. Strachan, and N. Muralimanohar, “Newton: Gravitating towards the physical limits of crossbar acceleration,” IEEE Micro, vol. 38, no. 5, pp. 41–49, 2018.
- [43] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778, 2016.
- [44] A. Ghofrani, A. Rahimi, M. A. Lastras-Montaño, L. Benini, R. K. Gupta, and K.-T. Cheng, “Associative memristive memory for approximate computing in gpus,” IEEE Journal on Emerging and Selected Topics in Circuits and Systems, vol. 6, no. 2, pp. 222–234, 2016.
- [45] F. Chollet, “keras.” https://github.com/fchollet/keras, 2015.
- [46] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, et al., “Tensorflow: Large-scale machine learning on heterogeneous distributed systems,” arXiv preprint arXiv:1603.04467, 2016.
- [47] X. Dong, C. Xu, N. Jouppi, and Y. Xie, “Nvsim: A circuit-level performance, energy, and area model for emerging non-volatile memory,” in Emerging Memory Technologies, pp. 15–50, Springer, 2014.
- [48] D. Compiler, R. User, and M. Guide, “Synopsys,” Inc., see http://www.synopsys.com, 2000.
- [49] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1–9, 2015.
- [50] F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf, W. J. Dally, and K. Keutzer, “Squeezenet: Alexnet-level accuracy with 50x fewer parameters and< 0.5 mb model size,” arXiv preprint arXiv:1602.07360, 2016.
- [51] P. Micikevicius, S. Narang, J. Alben, G. Diamos, E. Elsen, D. Garcia, B. Ginsburg, M. Houston, O. Kuchaev, G. Venkatesh, et al., “Mixed precision training,” arXiv preprint arXiv:1710.03740, 2017.
- [52] M. Drumond, T. Lin, M. Jaggi, and B. Falsafi, “End-to-end dnn training with block floating point arithmetic,” arXiv preprint arXiv:1804.01526, 2018.
- [53] D. Das, N. Mellempudi, D. Mudigere, D. Kalamkar, S. Avancha, K. Banerjee, S. Sridharan, K. Vaidyanathan, B. Kaul, E. Georganas, et al., “Mixed precision training of convolutional neural networks using integer operations,” arXiv preprint arXiv:1802.00930, 2018.
- [54] C. De Sa, M. Leszczynski, J. Zhang, A. Marzoev, C. R. Aberger, K. Olukotun, and C. Ré, “High-accuracy low-precision training,” arXiv preprint arXiv:1803.03383, 2018.
- [55] H. Sharma, J. Park, D. Mahajan, E. Amaro, J. K. Kim, C. Shao, A. Mishra, and H. Esmaeilzadeh, “From high-level deep neural models to fpgas,” in The 49th Annual IEEE/ACM International Symposium on Microarchitecture, p. 17, IEEE Press, 2016.
- [56] J. Albericio, A. Delmás, P. Judd, S. Sharify, G. O’Leary, R. Genov, and A. Moshovos, “Bit-pragmatic deep neural network computing,” in Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture, pp. 382–394, ACM, 2017.
- [57] T. Chen, Z. Du, N. Sun, J. Wang, C. Wu, Y. Chen, and O. Temam, “Diannao: A small-footprint high-throughput accelerator for ubiquitous machine-learning,” in ACM Sigplan Notices, vol. 49, pp. 269–284, ACM, 2014.
- [58] V. Aklaghi, A. Yazdanbakhsh, K. Samadi, H. Esmaeilzadeh, and R. Gupta, “Snapea: Predictive early activation for reducing computation in deep convolutional neural networks,” ISCA, 2018.
- [59] K. Hegde, J. Yu, R. Agrawal, M. Yan, M. Pellauer, and C. W. Fletcher, “Ucnn: Exploiting computational reuse in deep neural networks via weight repetition,” arXiv preprint arXiv:1804.06508, 2018.
- [60] C. Ding, S. Liao, Y. Wang, Z. Li, N. Liu, Y. Zhuo, C. Wang, X. Qian, Y. Bai, G. Yuan, et al., “C ir cnn: accelerating and compressing deep neural networks using blockcirculant weight matrices,” in Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture, pp. 395–408, ACM, 2017.
- [61] E. Nurvitadhi, G. Venkatesh, J. Sim, D. Marr, R. Huang, J. Ong Gee Hock, Y. T. Liew, K. Srivatsan, D. Moss, S. Subhaschandra, et al., “Can fpgas beat gpus in accelerating next-generation deep neural networks?,” in Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, pp. 5– 14, ACM, 2017.
- [62] S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. A. Horowitz, and W. J. Dally, “Eie: efficient inference engine on compressed deep neural network,” in Computer Architecture (ISCA), 2016 ACM/IEEE 43rd Annual International Symposium on, pp. 243–254, IEEE, 2016.
- [63] Y. Chen, T. Luo, S. Liu, S. Zhang, L. He, J. Wang, L. Li, T. Chen, Z. Xu, N. Sun, et al., “Dadiannao: A machine-learning supercomputer,” in Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture, pp. 609–622, IEEE Computer Society, 2014.
- [64] C. Eckert, X. Wang, J. Wang, A. Subramaniyan, R. Iyer, D. Sylvester, D. Blaauw, and R. Das, “Neural cache: Bit-serial in-cache acceleration of deep neural networks,” arXiv preprint arXiv:1805.03718, 2018.
- [65] S. Li, D. Niu, K. T. Malladi, H. Zheng, B. Brennan, and Y. Xie, “Drisa: A dram-based reconfigurable in-situ accelerator,” in Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture, pp. 288–301, ACM, 2017.
- [66] M. N. Bojnordi and E. Ipek, “The memristive boltzmann machines,” IEEE Micro, vol. 37, no. 3, pp. 22–29, 2017.
- [67] M. Imani, M. Samragh, Y. Kim, S. Gupta, F. Koushanfar, and T. Rosing, “Rapidnn: In-memory deep neural network acceleration framework,” arXiv preprint arXiv:1806.05794, 2018.
- [68] S. Gupta, M. Imani, H. Kaur, and T. S. Rosing, “Nnpim: A processing in-memory architecture for neural network acceleration,” IEEE Transactions on Computers, 2019.
- [69] M. Imani, S. Gupta, and T. Rosing, “Genpim: Generalized processing in-memory to accelerate data intensive applications,” in 2018 Design, Automation & Test in Europe Conference & Exhibition (DATE), pp. 1155–1158, IEEE, 2018.
- [70] S. Salamat, M. Imani, S. Gupta, and T. Rosing, “Rnsnet: In-memory neural network acceleration using residue number system,” in 2018 IEEE International Conference on Rebooting Computing (ICRC), pp. 1–12, IEEE, 2018.
- [71] M. Imani, A. Rahimi, D. Kong, T. Rosing, and J. M. Rabaey, “Exploring hyperdimensional associative memory,” in 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA), pp. 445–456, IEEE, 2017.
- [72] Y. Kim, M. Imani, and T. Rosing, “Orchard: Visual object recognition accelerator based on approximate in-memory processing,” in 2017 IEEE/ACM International Conference on Computer-Aided Design (ICCAD), pp. 25–32, IEEE, 2017.
- [73] M. Zhou, M. Imani, S. Gupta, and T. Rosing, “Gas: A heterogeneous memory architecture for graph processing,” in Proceedings of the International Symposium on Low Power Electronics and Design, p. 27, ACM, 2018.
- [74] M. Zhou, M. Imani, S. Gupta, Y. Kim, and T. Rosing, “Gram: graph processing in a reram-based computational memory,” in Proceedings of the 24th Asia and South Pacific Design Automation Conference, pp. 591–596, ACM, 2019.
- [75] M. Imani, S. Gupta, S. Sharma, and T. Rosing, “Nvquery: Efficient query processing in non-volatile memory,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2018.

Full Text

Tags

Comments