Time-aware Large Kernel Convolutions

Lioutas Vasileios
Lioutas Vasileios

ICML, pp. 6172-6183, 2020.

Cited by: 2|Bibtex|Views4
EI
Other Links: arxiv.org|academic.microsoft.com|dblp.uni-trier.de
Weibo:
We presented Time-aware Large Kernel Convolutions, a novel adaptive convolution method based on summation kernel for sequence representation and encoding

Abstract:

To date, most state-of-the-art sequence modelling architectures use attention to build generative models for language based tasks. Some of these models use all the available sequence tokens to generate an attention distribution which results in time complexity of $O(n^2)$. Alternatively, they utilize depthwise convolutions with softmax ...More

Code:

Data:

0
Introduction
  • Sequence modelling has seen some great breakthroughs through recent years with the introduction of the use of neural networks.
  • All modern approaches of sequence encoding rely on the use of attention to “filter” the excessive information given at a current time-step.
  • The transformer network (Vaswani et al, 2017) assigns attention weights for a given time-step to all available context token representations, while the newly proposed dynamic convolution (Wu et al, 2019) only computes an attention over a fixed context window.
  • The more recent approach of dynamic convolution (Wu et al, 2019) successfully reduced the time complexity to O(k·n) where k is the kernel size specified for each layer
Highlights
  • Sequence modelling has seen some great breakthroughs through recent years with the introduction of the use of neural networks
  • We introduce a novel type of adaptive convolution, Time-aware Large Kernel (TaLK) convolutions, that learns the kernel size of a summation kernel for each time-step instead of learning the kernel weights as in a typical convolution operation
  • We introduce a novel adaptive convolution based on summation kernel for sequence encoding
  • The key of the proposed method is an adaptive time-aware large kernel convolution operation which has kernel sizes that vary over time as a learned function of the individual time steps; that is, we propose to learn the offsets of the summation kernel above for each time-step
  • Machine Translation On the machine translation task, we report results on three mainstream benchmark datasets: WMT English to German (En-De), WMT English to French (En-Fr) and IWSLT German to English (De-En)
  • We presented Time-aware Large Kernel Convolutions, a novel adaptive convolution method based on summation kernel for sequence representation and encoding
Methods
Results
  • Results on Language Modeling

    The authors evaluated the method on the task of language modeling.
  • The authors use less number of parameters than the best comparison method.
  • Table 2 shows that the method is able to achieve comparable results to current state-of-the-art methods.
  • The authors' method was able to match the state-of-the-art score on WMT EnFr, a benchmark dataset that is considered indicative for the effectiveness of a method due to the large number of training examples (36M) it contains.
  • The authors' method was able to outperform all other methods setting a new state-of-the-art result
Conclusion
  • The authors presented Time-aware Large Kernel Convolutions, a novel adaptive convolution method based on summation kernel for sequence representation and encoding.
  • It learns to predict the kernel boundaries for each time-step of the sequence.
  • The authors will explore this novel convolution mechanism in the area of computer vision
Summary
  • Introduction:

    Sequence modelling has seen some great breakthroughs through recent years with the introduction of the use of neural networks.
  • All modern approaches of sequence encoding rely on the use of attention to “filter” the excessive information given at a current time-step.
  • The transformer network (Vaswani et al, 2017) assigns attention weights for a given time-step to all available context token representations, while the newly proposed dynamic convolution (Wu et al, 2019) only computes an attention over a fixed context window.
  • The more recent approach of dynamic convolution (Wu et al, 2019) successfully reduced the time complexity to O(k·n) where k is the kernel size specified for each layer
  • Objectives:

    The goal of this paper is to reduce the encoding time complexity for sequence modeling to O(n).
  • Methods:

    Self-Attention DynamicConv (k = 3) DynamicConv (k = 31) TaLK Convolution n = 10 iter/sec Mem.
  • ↓. 898 3.1x n = 10, 000 iter/sec Mem.
  • ↓ OOM 45 29 Param Test.
  • Grave et al (2017) Dauphin et al (2017) Merity et al (2018) Rae et al (2018) Baevski & Auli (2019).
  • TaLK Convolution (Ours) 240M 20.3
  • Results:

    Results on Language Modeling

    The authors evaluated the method on the task of language modeling.
  • The authors use less number of parameters than the best comparison method.
  • Table 2 shows that the method is able to achieve comparable results to current state-of-the-art methods.
  • The authors' method was able to match the state-of-the-art score on WMT EnFr, a benchmark dataset that is considered indicative for the effectiveness of a method due to the large number of training examples (36M) it contains.
  • The authors' method was able to outperform all other methods setting a new state-of-the-art result
  • Conclusion:

    The authors presented Time-aware Large Kernel Convolutions, a novel adaptive convolution method based on summation kernel for sequence representation and encoding.
  • It learns to predict the kernel boundaries for each time-step of the sequence.
  • The authors will explore this novel convolution mechanism in the area of computer vision
Tables
  • Table1: Maximum path lengths, per-layer complexity and minimum number of sequential operations for different layer types. n is the sequence length, d is the representation dimension and k is the kernel size of convolutions
  • Table2: Machine translation accuracy in terms of BLEU for WMT En-De and WMT En-Fr on newstest2014
  • Table3: Machine translation accuracy in terms of BLEU on IWSLT De-En
  • Table4: Throughput and memory consumption decrease measured for different sequence lengths (n) on a batch of size 10 with each token being represented with d = 1024 and H = 16. Throughput is calculated across 100K iterations of a single input encoding execution for each method. Memory decrease is computed as how many times less memory we need to encoding the input embedding compared to Self-Attention. Larger numbers indicate better performance
  • Table5: Test perplexity on WikiText-103. We used adaptive inputs similar to <a class="ref-link" id="cBaevski_2019_a" href="#rBaevski_2019_a">Baevski & Auli (2019</a>) and show that our method yields better perplexity than self-attention using adaptive intputs
  • Table6: Ablation on IWSLT De-En validation set. (+) indicates that a result includes all preceding features
Download tables as Excel
Related work
  • In this section, we provide a brief review over various related sequence modeling methods, and related methods that enlarge the receptive filed of a convolution operation.

    2.1. Sequence Modeling

    Sequence modeling is an important task in machine learning. An effective system should be able to comprehend and generate sequences similar to real data. Traditional approaches typically rely on the use of various kinds of recurrent neural networks such as long-short term memory networks (Hochreiter & Schmidhuber, 1997; Sutskever et al, 2014; Li et al, 2016; 2018) and gated recurrent unit networks (Cho et al, 2014; Nabil et al, 2016). These recurrent approaches are auto-regressive, which slows the process down for long sequences since they linearly depend on their own previous output tokens. Recent work is focused on exploring convolutional neural networks (CNN) methods (Kalchbrenner et al, 2016; Gehring et al, 2017; Wu et al, 2019) or self-attention methods (Vaswani et al, 2017; Dai et al, 2019; Kitaev et al, 2020) which both facilitate the parallilazation of the encoding process. In addition, since they are not auto-regressive, they allow the encoding process to capture stronger global and local dependencies.
Reference
  • Aharoni, R., Johnson, M., and Firat, O. Massively multilingual neural machine translation. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 2019.
    Google ScholarLocate open access versionFindings
  • Ahmed, K., Keskar, N. S., and Socher, R. Weighted transformer network for machine translation, 2017. URL https://arxiv.org/abs/1711.02132.
    Findings
  • Baevski, A. and Auli, M. Adaptive input representations for neural language modeling. In International Conference on Learning Representations, 2019.
    Google ScholarLocate open access versionFindings
  • Bahdanau, D., Cho, K., and Bengio, Y. Neural machine translation by jointly learning to align and translate, 201URL https://arxiv.org/abs/1409.0473.
    Findings
  • Britz, D., Goldie, A., Luong, M.-T., and Le, Q. Massive exploration of neural machine translation architectures. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, 2017.
    Google ScholarLocate open access versionFindings
  • Burkov, E. and Lempitsky, V. Deep neural networks with box convolutions. In Advances in Neural Information Processing Systems 31. 2018.
    Google ScholarLocate open access versionFindings
  • Celikyilmaz, A., Bosselut, A., He, X., and Choi, Y. Deep communicating agents for abstractive summarization. Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), 2018.
    Google ScholarLocate open access versionFindings
  • Chen, M. X., Firat, O., Bapna, A., Johnson, M., Macherey, W., Foster, G., Jones, L., Schuster, M., Shazeer, N., Parmar, N., Vaswani, A., Uszkoreit, J., Kaiser, L., Chen, Z., Wu, Y., and Hughes, M. The best of both worlds: Combining recent advances in neural machine translation. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2018.
    Google ScholarLocate open access versionFindings
  • Cheng, J., Dong, L., and Lapata, M. Long short-term memory-networks for machine reading. Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, 2016.
    Google ScholarLocate open access versionFindings
  • Cho, K., van Merrienboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., and Bengio, Y. Learning phrase representations using RNN encoder–decoder for statistical machine translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2014.
    Google ScholarLocate open access versionFindings
  • Crow, F. C. Summed-area tables for texture mapping. In Proceedings of the 11th Annual Conference on Computer Graphics and Interactive Techniques, 1984.
    Google ScholarLocate open access versionFindings
  • Dai, Z., Yang, Z., Yang, Y., Carbonell, J., Le, Q., and Salakhutdinov, R. Transformer-xl: Attentive language models beyond a fixed-length context. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019.
    Google ScholarLocate open access versionFindings
  • Dauphin, Y. N., Fan, A., Auli, M., and Grangier, D. Language modeling with gated convolutional networks. In Proceedings of the 34th International Conference on Machine Learning - Volume 70, 2017.
    Google ScholarLocate open access versionFindings
  • Deng, Y., Kim, Y., Chiu, J., Guo, D., and Rush, A. M. Latent alignment and variational attention. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, 2018.
    Google ScholarLocate open access versionFindings
  • Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT, 2019.
    Google ScholarLocate open access versionFindings
  • Fan, A., Grangier, D., and Auli, M. Controllable abstractive summarization. In Proceedings of the 2nd Workshop on Neural Machine Translation and Generation. Association for Computational Linguistics, 2018.
    Google ScholarLocate open access versionFindings
  • Gehring, J., Auli, M., Grangier, D., Yarats, D., and Dauphin, Y. Convolutional sequence to sequence learning. In ICML, 2017.
    Google ScholarLocate open access versionFindings
  • Grave, E., Joulin, A., and Usunier, N. Improving neural language models with a continuous cache. In International Conference on Learning Representations, 2017.
    Google ScholarLocate open access versionFindings
  • He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
    Google ScholarLocate open access versionFindings
  • Hinton, G. E., Srivastava, N., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. R. Improving neural networks by preventing co-adaptation of feature detectors, 2012. URL https://arxiv.org/abs/1207.0580.
    Findings
  • Hochreiter, S. and Schmidhuber, J. Long short-term memory. Neural Computation, 1997.
    Google ScholarLocate open access versionFindings
  • Kalchbrenner, N., Grefenstette, E., and Blunsom, P. A convolutional neural network for modelling sentences. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, 2014.
    Google ScholarLocate open access versionFindings
  • Kalchbrenner, N., Espeholt, L., Simonyan, K., van den Oord, A., Graves, A., and Kavukcuoglu, K. Neural machine translation in linear time, 2016. URL https://arxiv.org/abs/1610.10099.
    Findings
  • Kim, Y. Convolutional neural networks for sentence classification. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, 2014.
    Google ScholarLocate open access versionFindings
  • Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. In International Conference on Learning Representations, 2015.
    Google ScholarLocate open access versionFindings
  • Kitaev, N., Kaiser, L., and Levskaya, A. Reformer: The efficient transformer. In International Conference on Learning Representations, 2020.
    Google ScholarLocate open access versionFindings
  • Micikevicius, P., Narang, S., Alben, J., Diamos, G., Elsen, E., Garcia, D., Ginsburg, B., Houston, M., Kuchaiev, O., Venkatesh, G., and Wu, H. Mixed precision training. International Conference on Learning Representations, 2018.
    Google ScholarLocate open access versionFindings
  • Nabil, M., Atyia, A., and Aly, M. CUFE at SemEval-2016 task 4: A gated recurrent model for sentiment classification. In Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016), 2016.
    Google ScholarLocate open access versionFindings
  • Ott, M., Edunov, S., Grangier, D., and Auli, M. Scaling neural machine translation. Proceedings of the Third Conference on Machine Translation: Research Papers, 2018.
    Google ScholarLocate open access versionFindings
  • Kolen, J. F. and Kremer, S. C. Gradient Flow in Recurrent Nets: The Difficulty of Learning LongTerm Dependencies. IEEE, 2001.
    Google ScholarLocate open access versionFindings
  • Paulus, R., Xiong, C., and Socher, R. A deep reinforced model for abstractive summarization. In International Conference on Learning Representations, 2018.
    Google ScholarLocate open access versionFindings
  • Ladner, R. E. and Fischer, M. J. Parallel prefix computation. J. ACM, 1980.
    Google ScholarLocate open access versionFindings
  • Lample, G., Ballesteros, M., Subramanian, S., Kawakami, K., and Dyer, C. Neural architectures for named entity recognition. Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2016.
    Google ScholarLocate open access versionFindings
  • Lecun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 1998.
    Google ScholarLocate open access versionFindings
  • Lewis, J. Fast template matching. Vis. Interface, 1994.
    Google ScholarLocate open access versionFindings
  • Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., and Sutskever, I. Language models are unsupervised multitask learners, 2019.
    Google ScholarFindings
  • Rae, J. W., Dyer, C., Dayan, P., and Lillicrap, T. P. Fast parametric learning with activation memorization. In ICML, 2018.
    Google ScholarLocate open access versionFindings
  • Ramachandran, P., Zoph, B., and Le, Q. V. Searching for activation functions, 2017. URL https://arxiv.org/abs/1710.05941.
    Findings
  • Sachan, D. S., Zaheer, M., and Salakhutdinov, R. Revisiting lstm networks for semi-supervised text classification via mixed objective function. In AAAI, 2019.
    Google ScholarLocate open access versionFindings
  • Li, J., Galley, M., Brockett, C., Spithourakis, G., Gao, J., and Dolan, B. A persona-based neural conversation model. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2016.
    Google ScholarLocate open access versionFindings
  • Sennrich, R., Haddow, B., and Birch, A. Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, 2016.
    Google ScholarLocate open access versionFindings
  • Li, Y., Pan, Q., Wang, S., Yang, T., and Cambria, E. A generative model for category text generation. Information Sciences, 2018.
    Google ScholarLocate open access versionFindings
  • Loshchilov, I. and Hutter, F. Sgdr: Stochastic gradient descent with warm restarts. In International Conference on Learning Representations, 2017.
    Google ScholarLocate open access versionFindings
  • Merity, S., Xiong, C., Bradbury, J., and Socher, R. Pointer sentinel mixture models. In International Conference on Learning Representations, 2017.
    Google ScholarLocate open access versionFindings
  • Merity, S., Keskar, N. S., and Socher, R. An analysis of neural language modeling at multiple scales, 2018. URL http://arxiv.org/abs/1803.08240.
    Findings
  • Shaw, P., Uszkoreit, J., and Vaswani, A. Self-attention with relative position representations. Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), 2018.
    Google ScholarLocate open access versionFindings
  • Shen, T., Zhou, T., Long, G., Jiang, J., and Zhang, C. Bi-directional block self-attention for fast and memoryefficient sequence modeling. In International Conference on Learning Representations, 2018.
    Google ScholarLocate open access versionFindings
  • Springenberg, J. T., Dosovitskiy, A., Brox, T., and Riedmiller, M. Striving for simplicity: The all convolutional net. International Conference on Learning Representations, 2015.
    Google ScholarLocate open access versionFindings
  • Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 2014.
    Google ScholarLocate open access versionFindings
  • Sundermeyer, M., Schluter, R., and Ney, H. Lstm neural networks for language modeling. In INTERSPEECH, 2012.
    Google ScholarLocate open access versionFindings
  • Sutskever, I., Vinyals, O., and Le, Q. V. Sequence to sequence learning with neural networks. In Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 2, 2014.
    Google ScholarLocate open access versionFindings
  • Xu, J., Chen, D., Qiu, X., and Huang, X. Cached long short-term memory neural networks for document-level sentiment classification. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, 2016.
    Google ScholarLocate open access versionFindings
  • Yu, F. and Koltun, V. Multi-scale context aggregation by dilated convolutions. International Conference on Learning Representations, 2016.
    Google ScholarLocate open access versionFindings
  • Zhang, L., Halber, M., and Rusinkiewicz, S. Accelerating large-kernel convolution using summed-area tables, 2019. URL https://arxiv.org/abs/1906.11367.
    Findings
  • Szegedy, C., Ioffe, S., Vanhoucke, V., and Alemi, A. Inception-v4, inception-resnet and the impact of residual connections on learning. In AAAI, 2016.
    Google ScholarLocate open access versionFindings
  • Tang, G., Mller, M., Rios, A., and Sennrich, R. Why selfattention? a targeted evaluation of neural machine translation architectures. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, 2018.
    Google ScholarLocate open access versionFindings
  • Tran, K., Bisazza, A., and Monz, C. Recurrent memory networks for language modeling. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, 2016.
    Google ScholarLocate open access versionFindings
  • Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L. u., and Polosukhin, I. Attention is all you need. In Advances in Neural Information Processing Systems 30. 2017.
    Google ScholarLocate open access versionFindings
  • Viola, P. and Jones, M. Robust real-time object detection. In International Journal of Computer Vision, 2001.
    Google ScholarLocate open access versionFindings
  • Vishkin, U. Prefix sums and an application thereof.: 09/224,104, 2003/04/01/ 2003. URL http://www.google.com/patents?id=qCAPAAAAEBAJ.
    Findings
  • Wu, F., Fan, A., Baevski, A., Dauphin, Y., and Auli, M. Pay less attention with lightweight and dynamic convolutions. In International Conference on Learning Representations, 2019.
    Google ScholarLocate open access versionFindings
  • Wu, Y., Schuster, M., Chen, Z., Le, Q. V., Norouzi, M., Macherey, W., Krikun, M., Cao, Y., Gao, Q., Macherey, K., Klingner, J., Shah, A., Johnson, M., Liu, X., ukasz Kaiser, Gouws, S., Kato, Y., Kudo, T., Kazawa, H., Stevens, K., Kurian, G., Patil, N., Wang, W., Young, C., Smith, J., Riesa, J., Rudnick, A., Vinyals, O., Corrado, G., Hughes, M., and Dean, J. Google’s neural machine translation system: Bridging the gap between human and machine translation, 2016. URL https://arxiv.org/abs/1609.08144.
    Findings
Full Text
Your rating :
0

 

Tags
Comments