AI helps you reading Science
AI generates interpretation videos
AI extracts and analyses the key points of the paper to generate videos automatically
AI parses the academic lineage of this thesis
AI extracts a summary of this paper
We have proposed a novel vector-vector-matrix architecture for low-latency inference, and we have demonstrated theoretical and empirical speed-ups for Seq2seq-LSTM and Transformer models, with application to neural machine translation, language modeling, and image classification
Vector Vector Matrix Architecture: A Novel Hardware Aware Framework for Low Latency Inference in NLP Applications
EMNLP 2020, pp.7975-7984, (2020)
Deep neural networks have become the standard approach to building reliable Natural Language Processing (NLP) applications, ranging from Neural Machine Translation (NMT) to dialogue systems. However, improving accuracy by increasing the model size requires a large number of hardware computations, which can slow down NLP applications signi...More
PPT (Upload PPT)
- The authors look at efficient models from the software and the hardware side, and the authors discuss the advantages of merging them in a co-design manner.
- Zhang et al (2018) added a new type of layer, a channel shuffle layer, to neural networks that use group convolution.
- Gao et al (2018) used a technique similar to group convolution, but applied it to recurrent neural networks.
- They used shuffling operations with a group recurrent neural network and showed improvements for NMT and text summarization
- We look at efficient models from the software and the hardware side, and we discuss the advantages of merging them in a co-design manner
- We present empirical results suggesting that our framework can reduce the latency of sequence-to-sequence and Transformer models used for Neural Machine Translation (NMT) by a factor of four
- We present empirical results showing that vector-vector-matrix architecture (VVMA) can substitute different types of weight matrices in neural networks (NNs)
- We report some theoretical speedups that VVMAs provide when using a TPUstyle architecture
- Even though the main focus of this paper is the contribution of VVMA to neural machine translation, we demonstrate that VVMA is compatible to state-of-the-art language modelling architectures
- We have proposed a novel vector-vector-matrix architecture for low-latency inference, and we have demonstrated theoretical and empirical speed-ups for Seq2seq-LSTM and Transformer models, with application to neural machine translation, language modeling, and image classification
- The authors present empirical results showing that VVMAs can substitute different types of weight matrices in neural networks (NNs).
- The authors use the VVMAs in Seq2seq-LSTM and Transformer NMT.
- The authors report some theoretical speedups that VVMAs provide when using a TPUstyle architecture.
- The authors present a small ablation study where the authors modify the VVMAs by removing the diagonal terms Di,j or by varying the value of k.
- The authors compare VVMA to standard low-rank approximations.
- The authors show that the technique extends to language modelling with TransformerXL, and beyond NLP tasks
- The authors discuss new AI hardware that could optimize inference for neural networks via VVMAs.
- The authors plan to optimize the lowlevel code and to develop new hardware to deploy VVMAs in real-world applications.
- Distilling models to their VVMA counterparts would be an interesting experiment, and potentially an orthogonal enhancement to pre-existing frameworks (Sanh et al, 2019).
- VVMAs could be an orthogonal contribution to other factorizations of NLP models, such as in (Lan et al, 2020)
- Table1: Comparing the original Seq2seq-LSTM and Transformer models to such with VVMAs. Shown are the number of parameters, the BLEU score, and the estimated number of clock cycles and floating point operations
- Table2: Ablation study for English-Vietnamese NMT with Seq2seq-LSTM models. Here, k is the size of M in VVMAs, Diags shows whether diagonal terms are present (T=true, F=false), then follow the number of parameters, BLEU score, number of clocks and FLOPs. Original’s clock is on a TPU with a block size of 32
- Table3: Language modeling on WikiText-103 using Transformer-XL with and without VVMA, as well as using QRNN. (Original: TPU with a block size of 32.)
- Table4: VVMA’s closeness of fit to a target matrix is comparable to that of (i) standard low-rank approximation and (ii) optimal approximation, but it is orders of magnitude faster at inference time
- Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural machine translation by jointly learning to align and translate. In Proceedings of the 3rd International Conference on Learning Representations, ICLR ’15, San Diego, CA, US.
- Iz Beltagy, Matthew E. Peters, and Arman Cohan. 2020. Longformer: The long-document transformer. arXiv preprint arXiv:2007.14062.
- Wenlin Chen, James T. Wilson, Stephen Tyree, Kilian Q. Weinberger, and Yixin Chen. 2015. Compressing neural networks with the hashing trick. In Proceedings of the 32nd International Conference on International Conference on Machine Learning, ICML ’15, pages 2285–2294, Lille, France.
- Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. 2019. Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509.
- Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP ’14, pages 1724– 1734, Doha, Qatar.
- Zihang Dai, Zhilin Yang, Yiming Yang, Jaime G. Carbonell, Quoc V. Le, and Ruslan Salakhutdinov. 2019. Transformer-XL: Attentive language models beyond a fixed-length context. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, EMNLP ’19, Florence, Italy.
- Prakhar Ganesh, Yao Chen, Xin Lou, Mohammad Ali Khan, Yin Yang, Deming Chen, Marianne Winslett, Hassan Sajjad, and Preslav Nakov. 2020. Compressing large-scale transformer-based models: A case study on BERT. arXiv preprint arXiv:2002.11985.
- Fei Gao, Lijun Wu, Lu Zhao, Tao Qin, Xueqi Cheng, and Tie-Yan Liu. 201Efficient sequence learning with group recurrent networks. In Proceedings of the 16th Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACLHLT ’18, pages 799–808, New Orleans, LA, US.
- Yunhui Guo. 2018. A survey on methods and theories of quantized neural networks. arXiv preprint arXiv:1808.04752.
- Song Han, Junlong Kang, Huizi Mao, Yiming Hu, Xin Li, Yubin Li, Dongliang Xie, Hong Luo, Song Yao, Yu Wang, Huazhong Yang, and William J. Dally. 2016a. ESE: efficient speech recognition engine with compressed LSTM on FPGA. In Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, FPGA ’17, pages 75–84, Monterey, CA, US.
- Song Han, Huizi Mao, and William J. Dally. 2016b. Deep compression: Compressing deep neural network with pruning, trained quantization and Huffman coding. In Proceedings of the 4th International Conference on Learning Representations, ICLR ’16, San Juan, Puerto Rico.
- Song Han, Jeff Pool, John Tran, and William Dally. 2015. Learning both weights and connections for efficient neural networks. In Proceedings of the Annual Conference on Neural Information Processing Systems: Advances in Neural Information Processing Systems 28, NIPS ’15, pages 1135–1143, Montreal, Canada.
- Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR ’16, Las Vegas, NV, US.
- Diederik P. Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In Proceedings of the 3rd International Conference on Learning Representations, ICLR ’15, San Diego, CA, US.
- Nikita Kitaev, Łukasz Kaiser, and Anselm Levskaya. 2020. Reformer: The efficient transformer. In Proceedings of the 8th International Conference on Learning Representations, ICLR ’20, Addis Ababa, Ethiopia.
- Guillaume Klein, Yoon Kim, Yuntian Deng, Josep Maria Crego, Jean Senellart, and Alexander M. Rush. 2017. OpenNMT: Open-source toolkit for neural machine translation. In Proceedings of the 55th Annual Meeting of the Association for
- Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. 2020. ALBERT: A lite BERT for self-supervised learning of language representations. In Proceedings of the 8th International Conference on Learning Representations, ICLR ’20, Addis Ababa, Ethiopia.
- Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. 2015.
- Yann LeCun, John S. Denker, and Sara A. Solla. 1990. Optimal brain damage. In Advances in Neural Information Processing Systems 2, NIPS ’89, pages 598– 605, Denver, CO, US.
- Peter J. Liu, Mohammad Saleh, Etienne Pot, Ben Goodrich, Ryan Sepassi, Łukasz Kaiser, and Noam Shazeer. 2018. Generating Wikipedia by summarizing long sequences. In Proceedings of the 6th International Conference on Learning Representations, ICLR ’18, Vancouver, Canada.
- Minh-Thang Luong, Eugene Brevdo, and Rui Zhao. 2017. Neural machine translation (seq2seq) tutorial. https://github.com/tensorflow/nmt.
- Stefano Markidis, Steven Wei Der Chien, Erwin Laure, Ivy Bo Peng, and Jeffrey S. Vetter. 2018. NVIDIA tensor core programmability, performance & precision. In Proceedinngs of the 2018 IEEE International Parallel and Distributed Processing Symposium Workshops, IPDPSW ’18, pages 522–531.
- Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. 2017. Pointer sentinel mixture models. In Proceedings of the 5th International Conference on Learning Representations, ICLR ’17, Toulon, France.
- Ramesh Nallapati, Bowen Zhou, Cıcero Nogueira dos Santos, Caglar Gulcehre, and Bing Xiang. 2016. Abstractive text summarization using sequence-tosequence RNNs and beyond. In Proceedings of the 20th SIGNLL Conference on Computational Natural Language Learning, CoNLL ’16, pages 280–290.
- Jerry Quinn and Miguel Ballesteros. 2018. Pieces of eight: 8-bit neural machine translation. In Proceedings of the 16th Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT ’18, pages 114–120, New Orleans, LA, US.
- Alexander M. Rush, Sumit Chopra, and Jason Weston. 2015. A neural attention model for abstractive sentence summarization. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, EMNLP ’15, pages 379–389, Lisbon, Portugal.
- Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. 2019. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108.
- Yichen Shen, Nicholas C. Harris, Scott Skirlo, Mihika Prabhu, Tom Baehr-Jones, Michael Hochberg, Xin Sun, Shijie Zhao, Hugo Larochelle, Dirk Englund, and Marin Soljacic. 2017. Deep learning with coherent nanophotonic circuits. Nature Photonics, 11:441–446.
- Sainbayar Sukhbaatar, Edouard Grave, Piotr Bojanowski, and Armand Joulin. 2019. Adaptive attention span in transformers. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, ACL ’19, pages 331–335, Florence, Italy.
- Ilya Sutskever, Oriol Vinayals, and Quoc V Le. 2014. Sequence to sequence learning with neural networks. In Proceedings of the Annual Conference on Neural Information Processing Systems: Advances in Neural Information Processing Systems 27, NIPS ’14, pages 3104–3112, Montreal, Canada.
- Vivienne Sze, Yu-Hsin Chen, Tien-Ju Yang, and Joel S. Emer. 2017. Efficient processing of deep neural networks: A tutorial and survey. Proceedings of IEEE, 105(12):2295–2329.
- Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Proceedings of the Annual Conference on Neural Information Processing Systems: Advances in Neural Information Processing Systems, NIPS ’17, pages 5998–6008, Long Beach, CA, US.
- Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, Jeff Klingner, Apurva Shah, Melvin Johnson, Xiaobing Liu, Łukasz Kaiser, Stephan Gouws, Yoshikiyo Kato, Taku Kudo, Hideto Kazawa, Keith Stevens, George Kurian, Nishant Patil, Wei Wang, Cliff Young, Jason Smith, Jason Riesa, Alex Rudnick, Oriol Vinyals, Greg Corrado, Macduff Hughes, and Jeffrey Dean. 2016. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144.
- Manzil Zaheer, Guru Guruganesh, Kumar Avinava Dubey, Joshua Ainslie, C. Alberti, S. Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, and A. Ahmed. 2020. Big Bird: Transformers for longer sequences. arXiv preprint arXiv:2007.14062.
- Xiangyu Zhang, Xinyu Zhou, Mengxiao Lin, and Jian Sun. 2018. Shufflenet: An extremely efficient convolutional neural network for mobile devices. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR ’18, pages 6848–6856, Salt Lake City, UT, US.