AI helps you reading Science

AI generates interpretation videos

AI extracts and analyses the key points of the paper to generate videos automatically


pub
Go Generating

AI Traceability

AI parses the academic lineage of this thesis


Master Reading Tree
Generate MRT

AI Insight

AI extracts a summary of this paper


Weibo:
We showed that the required storage for training our finite-state machine-based models is independent of the time steps as opposed to long shortterm memories

Training Linear Finite-State Machines

NIPS 2020, (2020)

Cited by: 0|Views17
EI
Full Text
Bibtex
Weibo

Abstract

A finite-state machine (FSM) is a computation model to process binary strings in sequential circuits. Hence, a single-input linear FSM is conventionally used to implement complex single-input functions , such as tanh and exponentiation functions, in stochastic computing (SC) domain where continuous values are represented by sequences of r...More

Code:

Data:

0
Introduction
  • In the paradigm of deep learning, deep neural networks (DNNs) and recurrent neural networks (RNNs) deliver state-of-the-art accuracy across various non-sequential and temporal tasks, respectively.
  • They require considerable amount of storage and computational resources for their efficient deployment on different hardware platforms during both training and inference.
  • Even though the aforementioned approaches managed to significantly reduce the complexity of DNNs and RNNs, they fail to completely remove multiplications
Highlights
  • In the paradigm of deep learning, deep neural networks (DNNs) and recurrent neural networks (RNNs) deliver state-of-the-art accuracy across various non-sequential and temporal tasks, respectively
  • Given a temporal task that makes a decision at each time step, the backpropagation of the finite-state machine (FSM)-based model is performed at the end of each time step
  • This is in striking contrast to long shortterm memories (LSTMs) where their network is unrolled for each time step and the backpropagation is applied to the unrolled network
  • Figure 4 shows the memory usage of the LSTM model versus the FSM-based model and their corresponding test accuracy performance on the GeForce GTX 1080 Ti for different numbers of time steps when both models have the same number of weights and use the batch size of 100 for the character-level language modeling (CLLM) on the Penn Treebank dataset [32]
  • Training FSM-based models of size 1000 with batch size of 100 roughly draws 160W for all the time steps ranging from 100 to 2500, whereas training the LSTM models of the same size consumes power ranging from 205W to 245W based on our measurements obtained from the NVIDIA system management interface
  • Even though the aforementioned approaches managed to significantly reduce the complexity of DNNs and RNNs, they fail to completely remove multiplications
  • We showed that the required storage for training our FSM-based models is independent of the time steps as opposed to LSTMs
Methods
  • Performing the computations for every single input entry of the input stochastic vector x ∈ {0, 1}l yields a stochastic output vector y ∈ {0, 1}l representing the continuous value y ∈ R in bipolar format such that y = 2 × E(y) − 1.
  • The authors train FSM-based networks on the continuous values of stochastic streams while the inference computations are still performed on stochastic bit streams.
  • Given the occurrence probability of the state ψi as pψi for i ∈ {0, 1, . . . , N − 1}, the authors can obtain the continuous value of WLFSM’s output (i.e., y ∈ [−1, 1]) by
Results
  • States of the FSM-based model are updated based on the present input only.
  • Given a temporal task that makes a decision at each time step, the backpropagation of the FSM-based model is performed at the end of each time step
  • In this way, the storage required to store the intermediate values during the training stage is significantly reduced by a factor of l×, allowing to process extremely long data sequences using the FSM-based model.
  • As the final point, increasing the number of times steps severely impacts the convergence rate of the LSTM model whereas the convergence rate of the FSM-based model remains unchanged
Conclusion
  • The authors introduced a method to train WLFSMs. WLFSMs are computation models that can process sequences of data.
  • The networks containing WLFSMs only and performing their computations on stochastic bit streams are call FSM-based networks.
  • As the second application of FSM-based networks, the authors performed a classification task on the MNIST dataset and the authors showed that the FSM-based networks significantly outperforms their conventional SC-based implementations in terms of both the misclassification error and the number of operations.
  • As the final contribution of this paper, the authors introduced an FSM-based model that can perform temporal tasks.
  • The authors' FSM-based model can learn extremely long data dependencies while reducing the required storage for the intermediate values of training by a factor of l×, the power consumption of training by 33% and the number of operations of inference by a factor 7×
Summary
  • Introduction:

    In the paradigm of deep learning, deep neural networks (DNNs) and recurrent neural networks (RNNs) deliver state-of-the-art accuracy across various non-sequential and temporal tasks, respectively.
  • They require considerable amount of storage and computational resources for their efficient deployment on different hardware platforms during both training and inference.
  • Even though the aforementioned approaches managed to significantly reduce the complexity of DNNs and RNNs, they fail to completely remove multiplications
  • Methods:

    Performing the computations for every single input entry of the input stochastic vector x ∈ {0, 1}l yields a stochastic output vector y ∈ {0, 1}l representing the continuous value y ∈ R in bipolar format such that y = 2 × E(y) − 1.
  • The authors train FSM-based networks on the continuous values of stochastic streams while the inference computations are still performed on stochastic bit streams.
  • Given the occurrence probability of the state ψi as pψi for i ∈ {0, 1, . . . , N − 1}, the authors can obtain the continuous value of WLFSM’s output (i.e., y ∈ [−1, 1]) by
  • Results:

    States of the FSM-based model are updated based on the present input only.
  • Given a temporal task that makes a decision at each time step, the backpropagation of the FSM-based model is performed at the end of each time step
  • In this way, the storage required to store the intermediate values during the training stage is significantly reduced by a factor of l×, allowing to process extremely long data sequences using the FSM-based model.
  • As the final point, increasing the number of times steps severely impacts the convergence rate of the LSTM model whereas the convergence rate of the FSM-based model remains unchanged
  • Conclusion:

    The authors introduced a method to train WLFSMs. WLFSMs are computation models that can process sequences of data.
  • The networks containing WLFSMs only and performing their computations on stochastic bit streams are call FSM-based networks.
  • As the second application of FSM-based networks, the authors performed a classification task on the MNIST dataset and the authors showed that the FSM-based networks significantly outperforms their conventional SC-based implementations in terms of both the misclassification error and the number of operations.
  • As the final contribution of this paper, the authors introduced an FSM-based model that can perform temporal tasks.
  • The authors' FSM-based model can learn extremely long data dependencies while reducing the required storage for the intermediate values of training by a factor of l×, the power consumption of training by 33% and the number of operations of inference by a factor 7×
Tables
  • Table1: Performance of our FSM-based network compared to SC-based implementations on the test set of the MNIST dataset
  • Table2: Performance of our FSM-based model when performing the CLLM task on the test set
Download tables as Excel
Funding
  • Even though the aforementioned approaches managed to significantly reduce the complexity of DNNs and RNNs, they fail to completely remove multiplications
Study subjects and analysis
data: 2
784-250-250-10 (l = ∞) 784-70-70-10 (l = ∞). 784-250-250-10 (N = 2) 784-70-70-10 (N = 2). determines the next state of the machine based on the present state and the present input

Reference
  • M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi, “Xnor-net: Imagenet classification using binary convolutional neural networks,” in Computer Vision – ECCV 2016, B. Leibe, J. Matas, N. Sebe, and M. Welling, Eds. Cham: Springer International Publishing, 2016.
    Google ScholarFindings
  • I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and Y. Bengio, “Binarized Neural Networks,” in Advances in Neural Information Processing Systems 29, D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, Eds. Curran Associates, Inc., 2016, pp. 4107–4115. [Online]. Available: http://papers.nips.cc/paper/6573-binarized-neural-networks.pdf
    Findings
  • C. Xu, J. Yao, Z. Lin, W. Ou, Y. Cao, Z. Wang, and H. Zha, “Alternating Multi-bit Quantization for Recurrent Neural Networks,” in International Conference on Learning Representations, 2018. [Online]. Available: https://openreview.net/forum?id=S19dR9x0b
    Findings
  • L. Hou, Q. Yao, and J. T. Kwok, “Loss-aware Binarization of Deep Networks,” CoRR, vol. abs/1611.01600, 2016. [Online]. Available: http://arxiv.org/abs/1611.01600
    Findings
  • P. Wang, X. Xie, L. Deng, G. Li, D. Wang, and Y. Xie, “HitNet: Hybrid Ternary Recurrent Neural Network,” in Advances in Neural Information Processing Systems 31. Curran Associates, Inc., 2018, pp. 604–614. [Online]. Available: http://papers.nips.cc/paper/7341-hitnet-hybrid-ternary-recurrent-neural-network.pdf
    Findings
  • A. Ardakani, Z. Ji, A. Ardakani, and W. Gross, “The Synthesis of XNOR Recurrent Neural Networks with Stochastic Logic,” in Thirty-third Conference on Neural Information Processing Systems, 2019.
    Google ScholarLocate open access versionFindings
  • Y. Wang, Z. Zhan, J. Li, J. Tang, B. Yuan, L. Zhao, W. Wen, S. Wang, and X. Lin, “Universal approximation property and equivalence of stochastic computing-based neural networks and binary neural networks,” vol. 33, 2019.
    Google ScholarFindings
  • A. Ardakani, F. Leduc-Primeau, N. Onizawa, T. Hanyu, and W. J. Gross, “VLSI Implementation of Deep Neural Network Using Integral Stochastic Computing,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 25, Oct 2017.
    Google ScholarLocate open access versionFindings
  • S. R. Faraji, M. Hassan Najafi, B. Li, D. J. Lilja, and K. Bazargan, “Energy-efficient convolutional neural networks with deterministic bit-stream processing,” in 2019 Design, Automation Test in Europe Conference Exhibition (DATE), 2019.
    Google ScholarLocate open access versionFindings
  • S. Liu, H. Jiang, L. Liu, and J. Han, “Gradient descent using stochastic circuits for efficient training of learning machines,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 37, 2018.
    Google ScholarLocate open access versionFindings
  • Y. Liu, S. Liu, Y. Wang, F. Lombardi, and J. Han, “A stochastic computational multi-layer perceptron with backward propagation,” IEEE Transactions on Computers, vol. 67, 2018.
    Google ScholarLocate open access versionFindings
  • S. Han, J. Pool, J. Tran, and W. Dally, “Learning both weights and connections for efficient neural network,” in Advances in Neural Information Processing Systems 28, C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, Eds. Curran Associates, Inc., 2015, pp. 1135–1143.
    Google ScholarLocate open access versionFindings
  • S. Han, H. Mao, and W. J. Dally, “Deep Compression: compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding,” CoRR, vol. abs/1510.00149, 2015.
    Findings
  • A. Ardakani, C. Condo, and W. J. Gross, “Sparsely-Connected Neural Networks: Towards Efficient VLSI Implementation of Deep Neural Networks,” Proc. 5th Int. Conf. Learn. Represent. (ICLR), Nov. 2016.
    Google ScholarLocate open access versionFindings
  • Y. He, J. Lin, Z. Liu, H. Wang, L.-J. Li, and S. Han, “Amc: Automl for model compression and acceleration on mobile devices,” in ECCV (7), 2018. [Online]. Available: https://doi.org/10.1007/978-3-030-01234-2_48
    Findings
  • F. N. Iandola, M. W. Moskewicz, K. Ashraf, S. Han, W. J. Dally, and K. Keutzer, “Squeezenet: Alexnet-level accuracy with 50x fewer parameters and <1mb model size,” CoRR, vol. abs/1602.07360, 2016.
    Findings
  • A. Gholami, K. Kwon, B. Wu, Z. Tai, X. Yue, P. Jin, S. Zhao, and K. Keutzer, “Squeezenext: Hardwareaware neural network design,” in 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2018.
    Google ScholarLocate open access versionFindings
  • A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam, “Mobilenets: Efficient convolutional neural networks for mobile vision applications,” 2017, cite arxiv:1704.04861.
    Findings
  • M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L. Chen, “Mobilenetv2: Inverted residuals and linear bottlenecks,” in 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018.
    Google ScholarLocate open access versionFindings
  • “Long Short-Term Memory,” Neural computation, vol. 9, 1997.
    Google ScholarLocate open access versionFindings
  • P. Li, D. J. Lilja, W. Qian, K. Bazargan, and M. Riedel, “The synthesis of complex arithmetic computation on stochastic bit streams using sequential logic,” in 2012 IEEE/ACM International Conference on ComputerAided Design (ICCAD), Nov 2012.
    Google ScholarLocate open access versionFindings
  • P. Li, D. J. Lilja, W. Qian, M. D. Riedel, and K. Bazargan, “Logical computation on stochastic bit streams with linear finite-state machines,” IEEE Transactions on Computers, vol. 63, 2014.
    Google ScholarLocate open access versionFindings
  • Y. LeCun and C. Cortes, “MNIST handwritten digit database,” 2010.
    Google ScholarFindings
  • S. Brown and Z. Vranesic, Fundamentals of Digital Logic with VHDL Design with CD-ROM, 2nd ed. USA: McGraw-Hill, Inc., 2004.
    Google ScholarFindings
  • K. Cho, B. van Merriënboer, Ç. Gülçehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio, “Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation,” in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Doha, Qatar: Association for Computational Linguistics, Oct. 2014. [Online]. Available: http://www.aclweb.org/anthology/D14-1179
    Locate open access versionFindings
  • B. R. Gaines, Stochastic Computing Systems. Boston, MA: Springer US, 1969, pp. 37–172.
    Google ScholarFindings
  • A. Alaghi and J. P. Hayes, “Survey of Stochastic Computing,” ACM Trans. Embed. Comput. Syst., vol. 12, May 2013. [Online]. Available: http://doi.acm.org/10.1145/2465787.2465794
    Locate open access versionFindings
  • B. D. Brown and H. C. Card, “Stochastic neural computation i: Computational elements,” IEEE Trans. Comput., vol. 50, Sep. 2001. [Online]. Available: http://dx.doi.org/10.1109/12.954505
    Locate open access versionFindings
  • G. Arfken, m. Hans-Jürgen Weber, H. Weber, and F. Harris, Mathematical Methods for Physicists. Elsevier, 2005. [Online]. Available: https://books.google.ca/books?id=f3aCnXWV1CcC [30] J. Mutch and D. G. Lowe, “Object class recognition and localization using sparse features with limited receptive fields,” Int. J. Comput. Vision, vol.80, Oct.2008.[Online]. Available:https://doi.org/10.1007/s11263-007-0118-0
    Locate open access versionFindings
  • [31] N. Onizawa, D. Katagiri, K. Matsumiya, W. J. Gross, and T. Hanyu, “An accuracy/energy-flexible configurable gabor-filter chip based on stochastic computation with dynamic voltage–frequency–length scaling,” IEEE Journal on Emerging and Selected Topics in Circuits and Systems, vol. 8, 2018.
    Google ScholarLocate open access versionFindings
  • [32] M. P. Marcus, M. A. Marcinkiewicz, and B. Santorini, “Building a Large Annotated Corpus of English: The Penn Treebank,” Comput. Linguist., vol. 19, Jun. 1993. [Online]. Available: http://dl.acm.org/citation.cfm?id=972470.972475
    Locate open access versionFindings
  • [33] A. Karpathy, J. Johnson, and F.-F. Li, “Visualizing and Understanding Recurrent Networks,” CoRR, vol. abs/1506.02078, 2015. [Online]. Available: http://dblp.uni-trier.de/db/journals/corr/corr1506.html# KarpathyJL15
    Findings
  • [34] E. Strubell, A. Ganesh, and A. McCallum, “Energy and policy considerations for deep learning in NLP,” the Annual Meeting of the Association for Computer Linguistics, 2019.
    Google ScholarLocate open access versionFindings
Author
Amir Ardakani
Amir Ardakani
Your rating :
0

 

Tags
Comments
小科