Time-aware Large Kernel Convolutions
ICML, pp. 6172-6183, 2020.
EI
Weibo:
Abstract:
To date, most state-of-the-art sequence modelling architectures use attention to build generative models for language based tasks. Some of these models use all the available sequence tokens to generate an attention distribution which results in time complexity of $O(n^2)$. Alternatively, they utilize depthwise convolutions with softmax ...More
Code:
Data:
Introduction
- Sequence modelling has seen some great breakthroughs through recent years with the introduction of the use of neural networks.
- All modern approaches of sequence encoding rely on the use of attention to “filter” the excessive information given at a current time-step.
- The transformer network (Vaswani et al, 2017) assigns attention weights for a given time-step to all available context token representations, while the newly proposed dynamic convolution (Wu et al, 2019) only computes an attention over a fixed context window.
- The more recent approach of dynamic convolution (Wu et al, 2019) successfully reduced the time complexity to O(k·n) where k is the kernel size specified for each layer
Highlights
- Sequence modelling has seen some great breakthroughs through recent years with the introduction of the use of neural networks
- We introduce a novel type of adaptive convolution, Time-aware Large Kernel (TaLK) convolutions, that learns the kernel size of a summation kernel for each time-step instead of learning the kernel weights as in a typical convolution operation
- We introduce a novel adaptive convolution based on summation kernel for sequence encoding
- The key of the proposed method is an adaptive time-aware large kernel convolution operation which has kernel sizes that vary over time as a learned function of the individual time steps; that is, we propose to learn the offsets of the summation kernel above for each time-step
- Machine Translation On the machine translation task, we report results on three mainstream benchmark datasets: WMT English to German (En-De), WMT English to French (En-Fr) and IWSLT German to English (De-En)
- We presented Time-aware Large Kernel Convolutions, a novel adaptive convolution method based on summation kernel for sequence representation and encoding
Methods
- Self-Attention DynamicConv (k = 3) DynamicConv (k = 31) TaLK Convolution n = 10 iter/sec Mem.
- ↓. 898 3.1x n = 10, 000 iter/sec Mem.
- ↓ OOM 45 29 Param Test.
- Grave et al (2017) Dauphin et al (2017) Merity et al (2018) Rae et al (2018) Baevski & Auli (2019).
- TaLK Convolution (Ours) 240M 20.3
Results
- Results on Language Modeling
The authors evaluated the method on the task of language modeling. - The authors use less number of parameters than the best comparison method.
- Table 2 shows that the method is able to achieve comparable results to current state-of-the-art methods.
- The authors' method was able to match the state-of-the-art score on WMT EnFr, a benchmark dataset that is considered indicative for the effectiveness of a method due to the large number of training examples (36M) it contains.
- The authors' method was able to outperform all other methods setting a new state-of-the-art result
Conclusion
- The authors presented Time-aware Large Kernel Convolutions, a novel adaptive convolution method based on summation kernel for sequence representation and encoding.
- It learns to predict the kernel boundaries for each time-step of the sequence.
- The authors will explore this novel convolution mechanism in the area of computer vision
Summary
Introduction:
Sequence modelling has seen some great breakthroughs through recent years with the introduction of the use of neural networks.- All modern approaches of sequence encoding rely on the use of attention to “filter” the excessive information given at a current time-step.
- The transformer network (Vaswani et al, 2017) assigns attention weights for a given time-step to all available context token representations, while the newly proposed dynamic convolution (Wu et al, 2019) only computes an attention over a fixed context window.
- The more recent approach of dynamic convolution (Wu et al, 2019) successfully reduced the time complexity to O(k·n) where k is the kernel size specified for each layer
Objectives:
The goal of this paper is to reduce the encoding time complexity for sequence modeling to O(n).Methods:
Self-Attention DynamicConv (k = 3) DynamicConv (k = 31) TaLK Convolution n = 10 iter/sec Mem.- ↓. 898 3.1x n = 10, 000 iter/sec Mem.
- ↓ OOM 45 29 Param Test.
- Grave et al (2017) Dauphin et al (2017) Merity et al (2018) Rae et al (2018) Baevski & Auli (2019).
- TaLK Convolution (Ours) 240M 20.3
Results:
Results on Language Modeling
The authors evaluated the method on the task of language modeling.- The authors use less number of parameters than the best comparison method.
- Table 2 shows that the method is able to achieve comparable results to current state-of-the-art methods.
- The authors' method was able to match the state-of-the-art score on WMT EnFr, a benchmark dataset that is considered indicative for the effectiveness of a method due to the large number of training examples (36M) it contains.
- The authors' method was able to outperform all other methods setting a new state-of-the-art result
Conclusion:
The authors presented Time-aware Large Kernel Convolutions, a novel adaptive convolution method based on summation kernel for sequence representation and encoding.- It learns to predict the kernel boundaries for each time-step of the sequence.
- The authors will explore this novel convolution mechanism in the area of computer vision
Tables
- Table1: Maximum path lengths, per-layer complexity and minimum number of sequential operations for different layer types. n is the sequence length, d is the representation dimension and k is the kernel size of convolutions
- Table2: Machine translation accuracy in terms of BLEU for WMT En-De and WMT En-Fr on newstest2014
- Table3: Machine translation accuracy in terms of BLEU on IWSLT De-En
- Table4: Throughput and memory consumption decrease measured for different sequence lengths (n) on a batch of size 10 with each token being represented with d = 1024 and H = 16. Throughput is calculated across 100K iterations of a single input encoding execution for each method. Memory decrease is computed as how many times less memory we need to encoding the input embedding compared to Self-Attention. Larger numbers indicate better performance
- Table5: Test perplexity on WikiText-103. We used adaptive inputs similar to <a class="ref-link" id="cBaevski_2019_a" href="#rBaevski_2019_a">Baevski & Auli (2019</a>) and show that our method yields better perplexity than self-attention using adaptive intputs
- Table6: Ablation on IWSLT De-En validation set. (+) indicates that a result includes all preceding features
Related work
- In this section, we provide a brief review over various related sequence modeling methods, and related methods that enlarge the receptive filed of a convolution operation.
2.1. Sequence Modeling
Sequence modeling is an important task in machine learning. An effective system should be able to comprehend and generate sequences similar to real data. Traditional approaches typically rely on the use of various kinds of recurrent neural networks such as long-short term memory networks (Hochreiter & Schmidhuber, 1997; Sutskever et al, 2014; Li et al, 2016; 2018) and gated recurrent unit networks (Cho et al, 2014; Nabil et al, 2016). These recurrent approaches are auto-regressive, which slows the process down for long sequences since they linearly depend on their own previous output tokens. Recent work is focused on exploring convolutional neural networks (CNN) methods (Kalchbrenner et al, 2016; Gehring et al, 2017; Wu et al, 2019) or self-attention methods (Vaswani et al, 2017; Dai et al, 2019; Kitaev et al, 2020) which both facilitate the parallilazation of the encoding process. In addition, since they are not auto-regressive, they allow the encoding process to capture stronger global and local dependencies.
Reference
- Aharoni, R., Johnson, M., and Firat, O. Massively multilingual neural machine translation. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 2019.
- Ahmed, K., Keskar, N. S., and Socher, R. Weighted transformer network for machine translation, 2017. URL https://arxiv.org/abs/1711.02132.
- Baevski, A. and Auli, M. Adaptive input representations for neural language modeling. In International Conference on Learning Representations, 2019.
- Bahdanau, D., Cho, K., and Bengio, Y. Neural machine translation by jointly learning to align and translate, 201URL https://arxiv.org/abs/1409.0473.
- Britz, D., Goldie, A., Luong, M.-T., and Le, Q. Massive exploration of neural machine translation architectures. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, 2017.
- Burkov, E. and Lempitsky, V. Deep neural networks with box convolutions. In Advances in Neural Information Processing Systems 31. 2018.
- Celikyilmaz, A., Bosselut, A., He, X., and Choi, Y. Deep communicating agents for abstractive summarization. Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), 2018.
- Chen, M. X., Firat, O., Bapna, A., Johnson, M., Macherey, W., Foster, G., Jones, L., Schuster, M., Shazeer, N., Parmar, N., Vaswani, A., Uszkoreit, J., Kaiser, L., Chen, Z., Wu, Y., and Hughes, M. The best of both worlds: Combining recent advances in neural machine translation. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2018.
- Cheng, J., Dong, L., and Lapata, M. Long short-term memory-networks for machine reading. Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, 2016.
- Cho, K., van Merrienboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., and Bengio, Y. Learning phrase representations using RNN encoder–decoder for statistical machine translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2014.
- Crow, F. C. Summed-area tables for texture mapping. In Proceedings of the 11th Annual Conference on Computer Graphics and Interactive Techniques, 1984.
- Dai, Z., Yang, Z., Yang, Y., Carbonell, J., Le, Q., and Salakhutdinov, R. Transformer-xl: Attentive language models beyond a fixed-length context. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019.
- Dauphin, Y. N., Fan, A., Auli, M., and Grangier, D. Language modeling with gated convolutional networks. In Proceedings of the 34th International Conference on Machine Learning - Volume 70, 2017.
- Deng, Y., Kim, Y., Chiu, J., Guo, D., and Rush, A. M. Latent alignment and variational attention. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, 2018.
- Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT, 2019.
- Fan, A., Grangier, D., and Auli, M. Controllable abstractive summarization. In Proceedings of the 2nd Workshop on Neural Machine Translation and Generation. Association for Computational Linguistics, 2018.
- Gehring, J., Auli, M., Grangier, D., Yarats, D., and Dauphin, Y. Convolutional sequence to sequence learning. In ICML, 2017.
- Grave, E., Joulin, A., and Usunier, N. Improving neural language models with a continuous cache. In International Conference on Learning Representations, 2017.
- He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
- Hinton, G. E., Srivastava, N., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. R. Improving neural networks by preventing co-adaptation of feature detectors, 2012. URL https://arxiv.org/abs/1207.0580.
- Hochreiter, S. and Schmidhuber, J. Long short-term memory. Neural Computation, 1997.
- Kalchbrenner, N., Grefenstette, E., and Blunsom, P. A convolutional neural network for modelling sentences. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, 2014.
- Kalchbrenner, N., Espeholt, L., Simonyan, K., van den Oord, A., Graves, A., and Kavukcuoglu, K. Neural machine translation in linear time, 2016. URL https://arxiv.org/abs/1610.10099.
- Kim, Y. Convolutional neural networks for sentence classification. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, 2014.
- Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. In International Conference on Learning Representations, 2015.
- Kitaev, N., Kaiser, L., and Levskaya, A. Reformer: The efficient transformer. In International Conference on Learning Representations, 2020.
- Micikevicius, P., Narang, S., Alben, J., Diamos, G., Elsen, E., Garcia, D., Ginsburg, B., Houston, M., Kuchaiev, O., Venkatesh, G., and Wu, H. Mixed precision training. International Conference on Learning Representations, 2018.
- Nabil, M., Atyia, A., and Aly, M. CUFE at SemEval-2016 task 4: A gated recurrent model for sentiment classification. In Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016), 2016.
- Ott, M., Edunov, S., Grangier, D., and Auli, M. Scaling neural machine translation. Proceedings of the Third Conference on Machine Translation: Research Papers, 2018.
- Kolen, J. F. and Kremer, S. C. Gradient Flow in Recurrent Nets: The Difficulty of Learning LongTerm Dependencies. IEEE, 2001.
- Paulus, R., Xiong, C., and Socher, R. A deep reinforced model for abstractive summarization. In International Conference on Learning Representations, 2018.
- Ladner, R. E. and Fischer, M. J. Parallel prefix computation. J. ACM, 1980.
- Lample, G., Ballesteros, M., Subramanian, S., Kawakami, K., and Dyer, C. Neural architectures for named entity recognition. Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2016.
- Lecun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 1998.
- Lewis, J. Fast template matching. Vis. Interface, 1994.
- Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., and Sutskever, I. Language models are unsupervised multitask learners, 2019.
- Rae, J. W., Dyer, C., Dayan, P., and Lillicrap, T. P. Fast parametric learning with activation memorization. In ICML, 2018.
- Ramachandran, P., Zoph, B., and Le, Q. V. Searching for activation functions, 2017. URL https://arxiv.org/abs/1710.05941.
- Sachan, D. S., Zaheer, M., and Salakhutdinov, R. Revisiting lstm networks for semi-supervised text classification via mixed objective function. In AAAI, 2019.
- Li, J., Galley, M., Brockett, C., Spithourakis, G., Gao, J., and Dolan, B. A persona-based neural conversation model. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2016.
- Sennrich, R., Haddow, B., and Birch, A. Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, 2016.
- Li, Y., Pan, Q., Wang, S., Yang, T., and Cambria, E. A generative model for category text generation. Information Sciences, 2018.
- Loshchilov, I. and Hutter, F. Sgdr: Stochastic gradient descent with warm restarts. In International Conference on Learning Representations, 2017.
- Merity, S., Xiong, C., Bradbury, J., and Socher, R. Pointer sentinel mixture models. In International Conference on Learning Representations, 2017.
- Merity, S., Keskar, N. S., and Socher, R. An analysis of neural language modeling at multiple scales, 2018. URL http://arxiv.org/abs/1803.08240.
- Shaw, P., Uszkoreit, J., and Vaswani, A. Self-attention with relative position representations. Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), 2018.
- Shen, T., Zhou, T., Long, G., Jiang, J., and Zhang, C. Bi-directional block self-attention for fast and memoryefficient sequence modeling. In International Conference on Learning Representations, 2018.
- Springenberg, J. T., Dosovitskiy, A., Brox, T., and Riedmiller, M. Striving for simplicity: The all convolutional net. International Conference on Learning Representations, 2015.
- Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 2014.
- Sundermeyer, M., Schluter, R., and Ney, H. Lstm neural networks for language modeling. In INTERSPEECH, 2012.
- Sutskever, I., Vinyals, O., and Le, Q. V. Sequence to sequence learning with neural networks. In Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 2, 2014.
- Xu, J., Chen, D., Qiu, X., and Huang, X. Cached long short-term memory neural networks for document-level sentiment classification. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, 2016.
- Yu, F. and Koltun, V. Multi-scale context aggregation by dilated convolutions. International Conference on Learning Representations, 2016.
- Zhang, L., Halber, M., and Rusinkiewicz, S. Accelerating large-kernel convolution using summed-area tables, 2019. URL https://arxiv.org/abs/1906.11367.
- Szegedy, C., Ioffe, S., Vanhoucke, V., and Alemi, A. Inception-v4, inception-resnet and the impact of residual connections on learning. In AAAI, 2016.
- Tang, G., Mller, M., Rios, A., and Sennrich, R. Why selfattention? a targeted evaluation of neural machine translation architectures. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, 2018.
- Tran, K., Bisazza, A., and Monz, C. Recurrent memory networks for language modeling. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, 2016.
- Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L. u., and Polosukhin, I. Attention is all you need. In Advances in Neural Information Processing Systems 30. 2017.
- Viola, P. and Jones, M. Robust real-time object detection. In International Journal of Computer Vision, 2001.
- Vishkin, U. Prefix sums and an application thereof.: 09/224,104, 2003/04/01/ 2003. URL http://www.google.com/patents?id=qCAPAAAAEBAJ.
- Wu, F., Fan, A., Baevski, A., Dauphin, Y., and Auli, M. Pay less attention with lightweight and dynamic convolutions. In International Conference on Learning Representations, 2019.
- Wu, Y., Schuster, M., Chen, Z., Le, Q. V., Norouzi, M., Macherey, W., Krikun, M., Cao, Y., Gao, Q., Macherey, K., Klingner, J., Shah, A., Johnson, M., Liu, X., ukasz Kaiser, Gouws, S., Kato, Y., Kudo, T., Kazawa, H., Stevens, K., Kurian, G., Patil, N., Wang, W., Young, C., Smith, J., Riesa, J., Rudnick, A., Vinyals, O., Corrado, G., Hughes, M., and Dean, J. Google’s neural machine translation system: Bridging the gap between human and machine translation, 2016. URL https://arxiv.org/abs/1609.08144.
Full Text
Tags
Comments