## AI helps you reading Science

## AI Insight

AI extracts a summary of this paper

Weibo:

# Training Linear Finite-State Machines

NIPS 2020, (2020)

EI

Keywords

Abstract

A finite-state machine (FSM) is a computation model to process binary strings in sequential circuits. Hence, a single-input linear FSM is conventionally used to implement complex single-input functions , such as tanh and exponentiation functions, in stochastic computing (SC) domain where continuous values are represented by sequences of r...More

Code:

Data:

Introduction

- In the paradigm of deep learning, deep neural networks (DNNs) and recurrent neural networks (RNNs) deliver state-of-the-art accuracy across various non-sequential and temporal tasks, respectively.
- They require considerable amount of storage and computational resources for their efficient deployment on different hardware platforms during both training and inference.
- Even though the aforementioned approaches managed to significantly reduce the complexity of DNNs and RNNs, they fail to completely remove multiplications

Highlights

- In the paradigm of deep learning, deep neural networks (DNNs) and recurrent neural networks (RNNs) deliver state-of-the-art accuracy across various non-sequential and temporal tasks, respectively
- Given a temporal task that makes a decision at each time step, the backpropagation of the finite-state machine (FSM)-based model is performed at the end of each time step
- This is in striking contrast to long shortterm memories (LSTMs) where their network is unrolled for each time step and the backpropagation is applied to the unrolled network
- Figure 4 shows the memory usage of the LSTM model versus the FSM-based model and their corresponding test accuracy performance on the GeForce GTX 1080 Ti for different numbers of time steps when both models have the same number of weights and use the batch size of 100 for the character-level language modeling (CLLM) on the Penn Treebank dataset [32]
- Training FSM-based models of size 1000 with batch size of 100 roughly draws 160W for all the time steps ranging from 100 to 2500, whereas training the LSTM models of the same size consumes power ranging from 205W to 245W based on our measurements obtained from the NVIDIA system management interface
- Even though the aforementioned approaches managed to significantly reduce the complexity of DNNs and RNNs, they fail to completely remove multiplications
- We showed that the required storage for training our FSM-based models is independent of the time steps as opposed to LSTMs

Methods

- Performing the computations for every single input entry of the input stochastic vector x ∈ {0, 1}l yields a stochastic output vector y ∈ {0, 1}l representing the continuous value y ∈ R in bipolar format such that y = 2 × E(y) − 1.
- The authors train FSM-based networks on the continuous values of stochastic streams while the inference computations are still performed on stochastic bit streams.
- Given the occurrence probability of the state ψi as pψi for i ∈ {0, 1, . . . , N − 1}, the authors can obtain the continuous value of WLFSM’s output (i.e., y ∈ [−1, 1]) by

Results

- States of the FSM-based model are updated based on the present input only.
- Given a temporal task that makes a decision at each time step, the backpropagation of the FSM-based model is performed at the end of each time step
- In this way, the storage required to store the intermediate values during the training stage is significantly reduced by a factor of l×, allowing to process extremely long data sequences using the FSM-based model.
- As the final point, increasing the number of times steps severely impacts the convergence rate of the LSTM model whereas the convergence rate of the FSM-based model remains unchanged

Conclusion

- The authors introduced a method to train WLFSMs. WLFSMs are computation models that can process sequences of data.
- The networks containing WLFSMs only and performing their computations on stochastic bit streams are call FSM-based networks.
- As the second application of FSM-based networks, the authors performed a classification task on the MNIST dataset and the authors showed that the FSM-based networks significantly outperforms their conventional SC-based implementations in terms of both the misclassification error and the number of operations.
- As the final contribution of this paper, the authors introduced an FSM-based model that can perform temporal tasks.
- The authors' FSM-based model can learn extremely long data dependencies while reducing the required storage for the intermediate values of training by a factor of l×, the power consumption of training by 33% and the number of operations of inference by a factor 7×

Summary

## Introduction:

In the paradigm of deep learning, deep neural networks (DNNs) and recurrent neural networks (RNNs) deliver state-of-the-art accuracy across various non-sequential and temporal tasks, respectively.- They require considerable amount of storage and computational resources for their efficient deployment on different hardware platforms during both training and inference.
- Even though the aforementioned approaches managed to significantly reduce the complexity of DNNs and RNNs, they fail to completely remove multiplications
## Methods:

Performing the computations for every single input entry of the input stochastic vector x ∈ {0, 1}l yields a stochastic output vector y ∈ {0, 1}l representing the continuous value y ∈ R in bipolar format such that y = 2 × E(y) − 1.- The authors train FSM-based networks on the continuous values of stochastic streams while the inference computations are still performed on stochastic bit streams.
- Given the occurrence probability of the state ψi as pψi for i ∈ {0, 1, . . . , N − 1}, the authors can obtain the continuous value of WLFSM’s output (i.e., y ∈ [−1, 1]) by
## Results:

States of the FSM-based model are updated based on the present input only.- Given a temporal task that makes a decision at each time step, the backpropagation of the FSM-based model is performed at the end of each time step
- In this way, the storage required to store the intermediate values during the training stage is significantly reduced by a factor of l×, allowing to process extremely long data sequences using the FSM-based model.
- As the final point, increasing the number of times steps severely impacts the convergence rate of the LSTM model whereas the convergence rate of the FSM-based model remains unchanged
## Conclusion:

The authors introduced a method to train WLFSMs. WLFSMs are computation models that can process sequences of data.- The networks containing WLFSMs only and performing their computations on stochastic bit streams are call FSM-based networks.
- As the second application of FSM-based networks, the authors performed a classification task on the MNIST dataset and the authors showed that the FSM-based networks significantly outperforms their conventional SC-based implementations in terms of both the misclassification error and the number of operations.
- As the final contribution of this paper, the authors introduced an FSM-based model that can perform temporal tasks.
- The authors' FSM-based model can learn extremely long data dependencies while reducing the required storage for the intermediate values of training by a factor of l×, the power consumption of training by 33% and the number of operations of inference by a factor 7×

- Table1: Performance of our FSM-based network compared to SC-based implementations on the test set of the MNIST dataset
- Table2: Performance of our FSM-based model when performing the CLLM task on the test set

Funding

Study subjects and analysis

data: 2

784-250-250-10 (l = ∞) 784-70-70-10 (l = ∞). 784-250-250-10 (N = 2) 784-70-70-10 (N = 2). determines the next state of the machine based on the present state and the present input

Reference

- M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi, “Xnor-net: Imagenet classification using binary convolutional neural networks,” in Computer Vision – ECCV 2016, B. Leibe, J. Matas, N. Sebe, and M. Welling, Eds. Cham: Springer International Publishing, 2016.
- I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and Y. Bengio, “Binarized Neural Networks,” in Advances in Neural Information Processing Systems 29, D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, Eds. Curran Associates, Inc., 2016, pp. 4107–4115. [Online]. Available: http://papers.nips.cc/paper/6573-binarized-neural-networks.pdf
- C. Xu, J. Yao, Z. Lin, W. Ou, Y. Cao, Z. Wang, and H. Zha, “Alternating Multi-bit Quantization for Recurrent Neural Networks,” in International Conference on Learning Representations, 2018. [Online]. Available: https://openreview.net/forum?id=S19dR9x0b
- L. Hou, Q. Yao, and J. T. Kwok, “Loss-aware Binarization of Deep Networks,” CoRR, vol. abs/1611.01600, 2016. [Online]. Available: http://arxiv.org/abs/1611.01600
- P. Wang, X. Xie, L. Deng, G. Li, D. Wang, and Y. Xie, “HitNet: Hybrid Ternary Recurrent Neural Network,” in Advances in Neural Information Processing Systems 31. Curran Associates, Inc., 2018, pp. 604–614. [Online]. Available: http://papers.nips.cc/paper/7341-hitnet-hybrid-ternary-recurrent-neural-network.pdf
- A. Ardakani, Z. Ji, A. Ardakani, and W. Gross, “The Synthesis of XNOR Recurrent Neural Networks with Stochastic Logic,” in Thirty-third Conference on Neural Information Processing Systems, 2019.
- Y. Wang, Z. Zhan, J. Li, J. Tang, B. Yuan, L. Zhao, W. Wen, S. Wang, and X. Lin, “Universal approximation property and equivalence of stochastic computing-based neural networks and binary neural networks,” vol. 33, 2019.
- A. Ardakani, F. Leduc-Primeau, N. Onizawa, T. Hanyu, and W. J. Gross, “VLSI Implementation of Deep Neural Network Using Integral Stochastic Computing,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 25, Oct 2017.
- S. R. Faraji, M. Hassan Najafi, B. Li, D. J. Lilja, and K. Bazargan, “Energy-efficient convolutional neural networks with deterministic bit-stream processing,” in 2019 Design, Automation Test in Europe Conference Exhibition (DATE), 2019.
- S. Liu, H. Jiang, L. Liu, and J. Han, “Gradient descent using stochastic circuits for efficient training of learning machines,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 37, 2018.
- Y. Liu, S. Liu, Y. Wang, F. Lombardi, and J. Han, “A stochastic computational multi-layer perceptron with backward propagation,” IEEE Transactions on Computers, vol. 67, 2018.
- S. Han, J. Pool, J. Tran, and W. Dally, “Learning both weights and connections for efficient neural network,” in Advances in Neural Information Processing Systems 28, C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, Eds. Curran Associates, Inc., 2015, pp. 1135–1143.
- S. Han, H. Mao, and W. J. Dally, “Deep Compression: compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding,” CoRR, vol. abs/1510.00149, 2015.
- A. Ardakani, C. Condo, and W. J. Gross, “Sparsely-Connected Neural Networks: Towards Efficient VLSI Implementation of Deep Neural Networks,” Proc. 5th Int. Conf. Learn. Represent. (ICLR), Nov. 2016.
- Y. He, J. Lin, Z. Liu, H. Wang, L.-J. Li, and S. Han, “Amc: Automl for model compression and acceleration on mobile devices,” in ECCV (7), 2018. [Online]. Available: https://doi.org/10.1007/978-3-030-01234-2_48
- F. N. Iandola, M. W. Moskewicz, K. Ashraf, S. Han, W. J. Dally, and K. Keutzer, “Squeezenet: Alexnet-level accuracy with 50x fewer parameters and <1mb model size,” CoRR, vol. abs/1602.07360, 2016.
- A. Gholami, K. Kwon, B. Wu, Z. Tai, X. Yue, P. Jin, S. Zhao, and K. Keutzer, “Squeezenext: Hardwareaware neural network design,” in 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2018.
- A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam, “Mobilenets: Efficient convolutional neural networks for mobile vision applications,” 2017, cite arxiv:1704.04861.
- M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L. Chen, “Mobilenetv2: Inverted residuals and linear bottlenecks,” in 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018.
- “Long Short-Term Memory,” Neural computation, vol. 9, 1997.
- P. Li, D. J. Lilja, W. Qian, K. Bazargan, and M. Riedel, “The synthesis of complex arithmetic computation on stochastic bit streams using sequential logic,” in 2012 IEEE/ACM International Conference on ComputerAided Design (ICCAD), Nov 2012.
- P. Li, D. J. Lilja, W. Qian, M. D. Riedel, and K. Bazargan, “Logical computation on stochastic bit streams with linear finite-state machines,” IEEE Transactions on Computers, vol. 63, 2014.
- Y. LeCun and C. Cortes, “MNIST handwritten digit database,” 2010.
- S. Brown and Z. Vranesic, Fundamentals of Digital Logic with VHDL Design with CD-ROM, 2nd ed. USA: McGraw-Hill, Inc., 2004.
- K. Cho, B. van Merriënboer, Ç. Gülçehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio, “Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation,” in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Doha, Qatar: Association for Computational Linguistics, Oct. 2014. [Online]. Available: http://www.aclweb.org/anthology/D14-1179
- B. R. Gaines, Stochastic Computing Systems. Boston, MA: Springer US, 1969, pp. 37–172.
- A. Alaghi and J. P. Hayes, “Survey of Stochastic Computing,” ACM Trans. Embed. Comput. Syst., vol. 12, May 2013. [Online]. Available: http://doi.acm.org/10.1145/2465787.2465794
- B. D. Brown and H. C. Card, “Stochastic neural computation i: Computational elements,” IEEE Trans. Comput., vol. 50, Sep. 2001. [Online]. Available: http://dx.doi.org/10.1109/12.954505
- G. Arfken, m. Hans-Jürgen Weber, H. Weber, and F. Harris, Mathematical Methods for Physicists. Elsevier, 2005. [Online]. Available: https://books.google.ca/books?id=f3aCnXWV1CcC [30] J. Mutch and D. G. Lowe, “Object class recognition and localization using sparse features with limited receptive fields,” Int. J. Comput. Vision, vol.80, Oct.2008.[Online]. Available:https://doi.org/10.1007/s11263-007-0118-0
- [31] N. Onizawa, D. Katagiri, K. Matsumiya, W. J. Gross, and T. Hanyu, “An accuracy/energy-flexible configurable gabor-filter chip based on stochastic computation with dynamic voltage–frequency–length scaling,” IEEE Journal on Emerging and Selected Topics in Circuits and Systems, vol. 8, 2018.
- [32] M. P. Marcus, M. A. Marcinkiewicz, and B. Santorini, “Building a Large Annotated Corpus of English: The Penn Treebank,” Comput. Linguist., vol. 19, Jun. 1993. [Online]. Available: http://dl.acm.org/citation.cfm?id=972470.972475
- [33] A. Karpathy, J. Johnson, and F.-F. Li, “Visualizing and Understanding Recurrent Networks,” CoRR, vol. abs/1506.02078, 2015. [Online]. Available: http://dblp.uni-trier.de/db/journals/corr/corr1506.html# KarpathyJL15
- [34] E. Strubell, A. Ganesh, and A. McCallum, “Energy and policy considerations for deep learning in NLP,” the Annual Meeting of the Association for Computer Linguistics, 2019.

Tags

Comments