AI帮你理解科学

AI 生成解读视频

AI抽取解析论文重点内容自动生成视频


pub
生成解读视频

AI 溯源

AI解析本论文相关学术脉络


Master Reading Tree
生成 溯源树

AI 精读

AI抽取本论文的概要总结


微博一下
We further examine greedy algorithms for pruning down models, and the potential speed, memory efficiency, and accuracy improvements obtainable therefrom

Are Sixteen Heads Really Better than One?

ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 32 (NIPS 2019), (2019): 14014-14024

被引用178|浏览106
EI
下载 PDF 全文
引用
微博一下
关键词

摘要

Attention is a powerful and ubiquitous mechanism for allowing neural models to focus on particular salient pieces of information by taking their weighted average when making predictions. In particular, multi-headedattention is a driving force behind many recent state-of-the-art natural language processing (NLP) models such as Transformer-...更多

代码

数据

简介
  • Transformers (Vaswani et al, 2017) have shown state of the art performance across a variety of NLP tasks, including, but not limited to, machine translation (Vaswani et al, 2017; Ott et al, 2018), question answering (Devlin et al, 2018), text classification (Radford et al, 2018), and semantic role labeling (Strubell et al, 2018).
  • 33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada
  • It is still not entirely clear: what do the multiple heads in these models buy us?
  • Xn ∈ Rd, and a query vector q ∈ Rd, the attention layer parametrized by Wk, Wq, Wv, Wo ∈ Rd×d computes the weighted sum: n.
  • In multi-headed attention (MHA), Nh independently parameterized attention layers are applied in parallel to obtain the final result: Nh MHAtt(x, q) =
  • In self-attention, every xi is used as the query q to compute a new sequence of representations, whereas in sequence-to-sequence models q is typically a decoder state while x corresponds to the encoder output.
重点内容
  • Transformers (Vaswani et al, 2017) have shown state of the art performance across a variety of natural language processing (NLP) tasks, including, but not limited to, machine translation (Vaswani et al, 2017; Ott et al, 2018), question answering (Devlin et al, 2018), text classification (Radford et al, 2018), and semantic role labeling (Strubell et al, 2018)
  • 33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada. It is still not entirely clear: what do the multiple heads in these models buy us? In this paper, we make the surprising observation that – in both Transformer-based models for machine translation and BERT-based (Devlin et al, 2018) natural language inference – most attention heads can be individually removed after training without any significant downside in terms of test performance (§3.2)
  • Many attention layers can even be individually reduced to a single attention head without impacting test performance (§3.3)
  • Their approach involves using LRP (Binder et al, 2016) for determining important heads and looking at specific properties such as attending to adjacent positions, rare words or syntactically related words. They propose an alternate pruning mechanism based on doing gradient descent on the mask variables ξh. While their approach and results are complementary to this paper, our study provides additional evidence of this phenomenon beyond neural machine translation (NMT), as well as an analysis of the training dynamics of pruning attention heads
  • We demonstrated that in a variety of settings, several heads can be removed from trained transformer models without statistically significant degradation in test performance, and that some layers can be reduced to only one head
  • Pruning more than 60% of the encoder-decoder attention (Enc-Dec) attention heads will result in catastrophic performance degradation, while the encoder and decoder self-attention layers can still produce reasonable translations with only 20% of the original attention heads
  • We have shown that in machine translation models, the encoder-decoder attention layers are much more reliant on multi-headedness than the self-attention layers, and provided evidence that the relative importance of each head is determined in the early stages of training
方法
  • The authors consider two trained models: WMT This is the original “large” transformer architecture from Vaswani et al 2017 with 6 layers and 16 heads per layer, trained on the WMT2014 English to French corpus.
  • The authors use the pretrained model of Ott et al 2018.3 and report BLEU scores on the newstest2013 test set.
  • BERT BERT (Devlin et al, 2018) is a single transformer pre-trained on an unsupervised clozestyle “masked language modeling task” and fine-tuned on specific tasks.
  • The authors use the pre-trained base-uncased model of Devlin et al 2018 with 12 layers and 12 attention heads which the authors fine-tune and evaluate on MultiNLI (Williams et al, 2018).
  • In contrast with the WMT model, BERT only features one attention mechanism
结果
  • The authors further find that this has significant benefits for inference-time efficiency, resulting in up to a 17.5% increase in inference speed for a BERT-based model.
  • The authors report results when the pruning order is determined by the score difference from §3.2, but find that using Ih is faster and yields better results.
  • The authors observe that this approach allows them to prune up to 20% and 40% of heads from WMT and BERT, without incurring any noticeable negative impact.
  • From epoch 10 onwards, there is a concentration of unimportant heads that can be pruned while staying within 85 − 90% of the original BLEU score
结论
  • The authors have observed that MHA does not always leverage its theoretically superior expressiveness over vanilla attention to the fullest extent.
  • The authors demonstrated that in a variety of settings, several heads can be removed from trained transformer models without statistically significant degradation in test performance, and that some layers can be reduced to only one head.
  • The authors have shown that in machine translation models, the encoder-decoder attention layers are much more reliant on multi-headedness than the self-attention layers, and provided evidence that the relative importance of each head is determined in the early stages of training.
  • The authors hope that these observations will advance the understanding of MHA and inspire models that invest their parameters and attention more efficiently
总结
  • Introduction:

    Transformers (Vaswani et al, 2017) have shown state of the art performance across a variety of NLP tasks, including, but not limited to, machine translation (Vaswani et al, 2017; Ott et al, 2018), question answering (Devlin et al, 2018), text classification (Radford et al, 2018), and semantic role labeling (Strubell et al, 2018).
  • 33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada
  • It is still not entirely clear: what do the multiple heads in these models buy us?
  • Xn ∈ Rd, and a query vector q ∈ Rd, the attention layer parametrized by Wk, Wq, Wv, Wo ∈ Rd×d computes the weighted sum: n.
  • In multi-headed attention (MHA), Nh independently parameterized attention layers are applied in parallel to obtain the final result: Nh MHAtt(x, q) =
  • In self-attention, every xi is used as the query q to compute a new sequence of representations, whereas in sequence-to-sequence models q is typically a decoder state while x corresponds to the encoder output.
  • Methods:

    The authors consider two trained models: WMT This is the original “large” transformer architecture from Vaswani et al 2017 with 6 layers and 16 heads per layer, trained on the WMT2014 English to French corpus.
  • The authors use the pretrained model of Ott et al 2018.3 and report BLEU scores on the newstest2013 test set.
  • BERT BERT (Devlin et al, 2018) is a single transformer pre-trained on an unsupervised clozestyle “masked language modeling task” and fine-tuned on specific tasks.
  • The authors use the pre-trained base-uncased model of Devlin et al 2018 with 12 layers and 12 attention heads which the authors fine-tune and evaluate on MultiNLI (Williams et al, 2018).
  • In contrast with the WMT model, BERT only features one attention mechanism
  • Results:

    The authors further find that this has significant benefits for inference-time efficiency, resulting in up to a 17.5% increase in inference speed for a BERT-based model.
  • The authors report results when the pruning order is determined by the score difference from §3.2, but find that using Ih is faster and yields better results.
  • The authors observe that this approach allows them to prune up to 20% and 40% of heads from WMT and BERT, without incurring any noticeable negative impact.
  • From epoch 10 onwards, there is a concentration of unimportant heads that can be pruned while staying within 85 − 90% of the original BLEU score
  • Conclusion:

    The authors have observed that MHA does not always leverage its theoretically superior expressiveness over vanilla attention to the fullest extent.
  • The authors demonstrated that in a variety of settings, several heads can be removed from trained transformer models without statistically significant degradation in test performance, and that some layers can be reduced to only one head.
  • The authors have shown that in machine translation models, the encoder-decoder attention layers are much more reliant on multi-headedness than the self-attention layers, and provided evidence that the relative importance of each head is determined in the early stages of training.
  • The authors hope that these observations will advance the understanding of MHA and inspire models that invest their parameters and attention more efficiently
表格
  • Table1: Difference in BLEU score for each head of the encoder’s self attention mechanism. Underlined numbers indicate that the change is statistically significant with p < 0.01. The base BLEU score is 36.05
  • Table2: Best delta BLEU by layer when only one head is kept in the WMT model. Underlined numbers indicate that the change is statistically significant with p < 0.01
  • Table3: Best delta accuracy by layer when only one head is kept in the BERT model. None of these results are statistically significant with p < 0.01
  • Table4: Average inference speed of BERT on the MNLI-matched validation set in examples per second (± standard deviation). The speedup relative to the original model is indicated in parentheses
Download tables as Excel
相关工作
基金
  • This research was supported in part by a gift from Facebook
研究对象与分析
additional datasets: 4
Performance drops sharply when pruning further, meaning that neither model can be reduced to a purely single-head attention model without retraining or incurring substantial losses to performance. We refer to Appendix B for experiments on four additional datasets. 5For the WMT model we use all newstest20[09-12] sets to estimate I

引用论文
  • Karim Ahmed, Nitish Shirish Keskar, and Richard Socher. Weighted transformer network for machine translation. arXiv preprint arXiv:1711.02132, 2017.
    Findings
  • Sajid Anwar, Kyuyeon Hwang, and Wonyong Sung. Structured pruning of deep convolutional neural networks. J. Emerg. Technol. Comput. Syst., pages 32:1–32:18, 2017.
    Google ScholarLocate open access versionFindings
  • Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. In Proceedings of the International Conference on Learning Representations (ICLR), 2015.
    Google ScholarLocate open access versionFindings
  • Alexander Binder, Grégoire Montavon, Sebastian Lapuschkin, Klaus-Robert Müller, and Wojciech Samek. Layer-wise relevance propagation for neural networks with local renormalization layers. In International Conference on Artificial Neural Networks, pages 63–71, 2016.
    Google ScholarLocate open access versionFindings
  • Mauro Cettolo, Jan Niehues, Sebastian Stüker, Luisa Bentivogli, and Marcello Federico. Report on the 11 th iwslt evaluation campaign, iwslt 2014. In Proceedings of the 2014 International Workshop on Spoken Language Translation (IWSLT), 2015.
    Google ScholarLocate open access versionFindings
  • Jianpeng Cheng, Li Dong, and Mirella Lapata. Long short-term memory-networks for machine reading. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 551–561, 2016.
    Google ScholarLocate open access versionFindings
  • Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using rnn encoder–decoder for statistical machine translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1724–1734, 2014.
    Google ScholarLocate open access versionFindings
  • Zihang Dai, Zhilin Yang, Yiming Yang, William W Cohen, Jaime Carbonell, Quoc V Le, and Ruslan Salakhutdinov. Transformer-xl: Attentive language models beyond a fixed-length context. arXiv preprint arXiv:1901.02860, 2019.
    Findings
  • Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), 2018.
    Google ScholarLocate open access versionFindings
  • William B. Dolan and Chris Brockett. Automatically constructing a corpus of sentential paraphrases. In Proceedings of the The 3rd International Workshop on Paraphrasing (IWP), 2005. URL http://aclweb.org/anthology/I05-5002.
    Locate open access versionFindings
  • Song Han, Jeff Pool, John Tran, and William Dally. Learning both weights and connections for efficient neural network. In Proceedings of the 29th Annual Conference on Neural Information Processing Systems (NIPS), pages 1135–1143, 2015.
    Google ScholarLocate open access versionFindings
  • Babak Hassibi and David G. Stork. Second order derivatives for network pruning: Optimal brain surgeon. In Proceedings of the 5th Annual Conference on Neural Information Processing Systems (NIPS), pages 164–171, 1993.
    Google ScholarLocate open access versionFindings
  • Yoon Kim and Alexander M. Rush. Sequence-level knowledge distillation. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1317–1327, 2016.
    Google ScholarLocate open access versionFindings
  • Philipp Koehn. Statistical significance tests for machine translation evaluation. In Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 388–395, 2004.
    Google ScholarLocate open access versionFindings
  • Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, Chris Dyer, Ondrej Bojar, Alexandra Constantin, and Evan Herbst. Moses: Open source toolkit for statistical machine translation. In Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics (ACL), pages 177–180, 2007.
    Google ScholarLocate open access versionFindings
  • Yann LeCun, John S. Denker, and Sara A. Solla. Optimal brain damage. In Proceedings of the 2nd Annual Conference on Neural Information Processing Systems (NIPS), pages 598–605, 1990.
    Google ScholarLocate open access versionFindings
  • Hao Li, Asim Kadav, Igor Durdanovic, Hanan Samet, and Hans Peter Graf. Pruning filters for efficient convnets. arXiv preprint arXiv:1608.08710, 2016.
    Findings
  • Thang Luong, Hieu Pham, and Christopher D. Manning. Effective approaches to attention-based neural machine translation. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1412–1421, 2015.
    Google ScholarLocate open access versionFindings
  • Paul Michel and Graham Neubig. MTNT: A testbed for machine translation of noisy text. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 543–553, 2018.
    Google ScholarLocate open access versionFindings
  • Pavlo Molchanov, Stephen Tyree, Tero Karras, Timo Aila, and Jan Kautz. Pruning convolutional neural networks for resource efficient inference. In Proceedings of the International Conference on Learning Representations (ICLR), 2017.
    Google ScholarLocate open access versionFindings
  • Kenton Murray and David Chiang. Auto-sizing neural networks: With applications to n-gram language models. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 908–916, 2015.
    Google ScholarLocate open access versionFindings
  • Graham Neubig, Zi-Yi Dou, Junjie Hu, Paul Michel, Danish Pruthi, and Xinyi Wang. compare-mt: A tool for holistic comparison of language generation systems. In Meeting of the North American Chapter of the Association for Computational Linguistics (NAACL) Demo Track, Minneapolis, USA, June 2019. URL http://arxiv.org/abs/1903.07926.
    Findings
  • Myle Ott, Sergey Edunov, David Grangier, and Michael Auli. Scaling neural machine translation. In Proceedings of the 3rd Conference on Machine Translation (WMT), pages 1–9, 2018.
    Google ScholarLocate open access versionFindings
  • Ankur Parikh, Oscar Täckström, Dipanjan Das, and Jakob Uszkoreit. A decomposable attention model for natural language inference. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 2249–2255, 2016.
    Google ScholarLocate open access versionFindings
  • Romain Paulus, Caiming Xiong, and Richard Socher. A deep reinforced model for abstractive summarization. In Proceedings of the International Conference on Learning Representations (ICLR), 2017.
    Google ScholarLocate open access versionFindings
  • Alec Radford, Karthik Narasimhan, Time Salimans, and Ilya Sutskever. Improving language understanding with unsupervised learning. Technical report, Technical report, OpenAI, 2018.
    Google ScholarFindings
  • Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. OpenAI Blog, 1:8, 2019.
    Google ScholarLocate open access versionFindings
  • Alessandro Raganato and Jörg Tiedemann. An analysis of encoder representations in transformerbased machine translation. In Proceedings of the Workshop on BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 287–297, 2018.
    Google ScholarLocate open access versionFindings
  • Abigail See, Minh-Thang Luong, and Christopher D. Manning. Compression of neural machine translation models via pruning. In Proceedings of the Computational Natural Language Learning (CoNLL), pages 291–301, 2016.
    Google ScholarLocate open access versionFindings
  • Tao Shen, Tianyi Zhou, Guodong Long, Jing Jiang, Shirui Pan, and Chengqi Zhang. Disan: Directional self-attention network for rnn/cnn-free language understanding. In Proceedings of the 32nd Meeting of the Association for Advancement of Artificial Intelligence (AAAI), 2018.
    Google ScholarLocate open access versionFindings
  • Ravid Shwartz-Ziv and Naftali Tishby. Opening the black box of deep neural networks via information. arXiv preprint arXiv:1703.00810, 2017.
    Findings
  • Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D. Manning, Andrew Ng, and Christopher Potts. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1631–1642, 2013.
    Google ScholarLocate open access versionFindings
  • Emma Strubell, Patrick Verga, Daniel Andor, David Weiss, and Andrew McCallum. Linguisticallyinformed self-attention for semantic role labeling. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 5027–5038, 2018.
    Google ScholarLocate open access versionFindings
  • Gongbo Tang, Mathias Müller, Annette Rios, and Rico Sennrich. Why self-attention? a targeted evaluation of neural machine translation architectures. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 4263–4272, 2018.
    Google ScholarLocate open access versionFindings
  • Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Proceedings of the 30th Annual Conference on Neural Information Processing Systems (NIPS), pages 5998–6008, 2017.
    Google ScholarLocate open access versionFindings
  • Elena Voita, David Talbot, Fedor Moiseev, Rico Sennrich, and Titov Ivan. Analyzing multi-head self-attention: Specialized heads do the heavy lifting, the rest can be pruned. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL), page to appear, 2019.
    Google ScholarLocate open access versionFindings
  • Alex Warstadt, Amanpreet Singh, and Samuel R. Bowman. Neural network acceptability judgments. arXiv preprint arXiv:1805.12471, 2018.
    Findings
  • Adina Williams, Nikita Nangia, and Samuel Bowman. A broad-coverage challenge corpus for sentence understanding through inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), pages 1112–1122, 2018.
    Google ScholarLocate open access versionFindings
您的评分 :
0

 

标签
评论
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn
小科