Self-Attention Guided Copy Mechanism for Abstractive Summarization

ACL, pp. 1355-1362, 2020.

Cited by: 0|Bibtex|Views123
EI
Other Links: dblp.uni-trier.de|academic.microsoft.com
Weibo:
We propose the Self-Attention Guided Copy mechanism summarization model that acquires guidance signals for the copy mechanism from the encoder self-attention graph

Abstract:

Copy module has been widely equipped in the recent abstractive summarization models, which facilitates the decoder to extract words from the source into the summary. Generally, the encoder-decoder attention is served as the copy distribution, while how to guarantee that important words in the source are copied remains a challenge. In this...More

Code:

Data:

0
Introduction
  • The explosion of information has expedited the rapid development of text summarization technology, which can help them to grasp the key points from miscellaneous information quickly.
  • One of the most successful frameworks for the summarization task is Pointer-Generator Network (See et al, 2017) that combines extractive and abstractive techniques with a pointer (Vinyals et al, 2015) enabling the model to copy words from the source text directly.
  • As shown in Table 1, words like “nominees” and “obama” are ignored by the standard copy mechanism
  • To tackle this problem, the authors intend to get some clues about the importance of words from the self-attention graph.
Highlights
  • The explosion of information has expedited the rapid development of text summarization technology, which can help us to grasp the key points from miscellaneous information quickly
  • We propose a Self-Attention Guided Copy mechanism (SAGCopy) that aims to encourage the summarizer to copy important source words
  • We introduce an auxiliary loss computed by the divergence between the copy distribution and the centrality distribution, which aims to encourage the model to focus on important words
  • We present a guided copy mechanism based on source word centrality that is obtained by the indegree or outdegree centrality measures
  • We propose the Self-Attention Guided Copy mechanism summarization model that acquires guidance signals for the copy mechanism from the encoder self-attention graph
  • The experimental results show the effectiveness of our model
Methods
  • The authors evaluate the model in CNN/Daily Mail dataset (Hermann et al, 2015) and Gigaword dataset (Rush et al, 2015).
  • The authors adopt 6 layer encoder and 6 layers decoder with 12 attention heads, and hmodel = 768.
  • Byte Pair Encoding (BPE) (Sennrich et al, 2016) word segmentation is used for data pre-processing.
  • The authors warm-start the model parameter with MASS pre-trained base model1 and trains about 10 epoches for convergence.
  • The authors use beam search with a beam size of 5
Results
  • The authors compare the proposed Self-Attention Guided Copy (SAGCopy) model with the following comparative models.

    Lead-3 uses the first three sentences of the article as its summary.

    PGNet (See et al, 2017) is the PointerGenerator Network.

    Bottom-Up (Gehrmann et al, 2018) is a sequence-to-sequence model augmented with a bottom-up content selector.

    MASS (Song et al, 2019) is a sequence-tosequence pre-trained model based on the Transformer.

    ABS (Rush et al, 2015) relies on an CNN encoder and a NNLM decoder.
Conclusion
  • The authors propose the SAGCopy summarization model that acquires guidance signals for the copy mechanism from the encoder self-attention graph.
  • The authors first calculate the centrality score for each source word.
  • The authors incorporate the importance score into the copy module.
  • The experimental results show the effectiveness of the model.
  • The authors intend to apply the method to other Transformer-based summarization models
Summary
  • Introduction:

    The explosion of information has expedited the rapid development of text summarization technology, which can help them to grasp the key points from miscellaneous information quickly.
  • One of the most successful frameworks for the summarization task is Pointer-Generator Network (See et al, 2017) that combines extractive and abstractive techniques with a pointer (Vinyals et al, 2015) enabling the model to copy words from the source text directly.
  • As shown in Table 1, words like “nominees” and “obama” are ignored by the standard copy mechanism
  • To tackle this problem, the authors intend to get some clues about the importance of words from the self-attention graph.
  • Methods:

    The authors evaluate the model in CNN/Daily Mail dataset (Hermann et al, 2015) and Gigaword dataset (Rush et al, 2015).
  • The authors adopt 6 layer encoder and 6 layers decoder with 12 attention heads, and hmodel = 768.
  • Byte Pair Encoding (BPE) (Sennrich et al, 2016) word segmentation is used for data pre-processing.
  • The authors warm-start the model parameter with MASS pre-trained base model1 and trains about 10 epoches for convergence.
  • The authors use beam search with a beam size of 5
  • Results:

    The authors compare the proposed Self-Attention Guided Copy (SAGCopy) model with the following comparative models.

    Lead-3 uses the first three sentences of the article as its summary.

    PGNet (See et al, 2017) is the PointerGenerator Network.

    Bottom-Up (Gehrmann et al, 2018) is a sequence-to-sequence model augmented with a bottom-up content selector.

    MASS (Song et al, 2019) is a sequence-tosequence pre-trained model based on the Transformer.

    ABS (Rush et al, 2015) relies on an CNN encoder and a NNLM decoder.
  • Conclusion:

    The authors propose the SAGCopy summarization model that acquires guidance signals for the copy mechanism from the encoder self-attention graph.
  • The authors first calculate the centrality score for each source word.
  • The authors incorporate the importance score into the copy module.
  • The experimental results show the effectiveness of the model.
  • The authors intend to apply the method to other Transformer-based summarization models
Tables
  • Table1: Yellow shades represent overlap with reference. The above summary generated by standard copy mechanism miss some importance words, such as “obama” and “nominees”
  • Table2: ROUGE F1 scores on the CNN/Daily Mail dataset. Results with * mark are taken from the corresponding papers. Indegree-i denote indegree centrality obtained by TextRank with i-iteration. Note that Indegree-1 is the basic indegree centrality that is equivalent to TextRank with 1-iteration
  • Table3: Experimental result on the Gigaword dataset
  • Table4: Human evaluation results on the Gigaword dataset. “Win” denotes the generated summary of SAGCopy is better than that of MASS+Copy. We evaluate the agreement by Fleiss’ kappa (<a class="ref-link" id="cFleiss_1971_a" href="#rFleiss_1971_a">Fleiss, 1971</a>)
Download tables as Excel
Related work
Funding
  • This work is partially supported by Beijing Academy of Artificial Intelligence (BAAI)
Reference
  • Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural machine translation by jointly learning to align and translate. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings.
    Google ScholarLocate open access versionFindings
  • Hangbo Bao, Li Dong, Furu Wei, Wenhui Wang, Nan Yang, Xiaodong Liu, Yu Wang, Songhao Piao, Jianfeng Gao, Ming Zhou, et al. 2020. Unilmv2: Pseudo-masked language models for unified language model pre-training. arXiv preprint arXiv:2002.12804.
    Findings
  • Phillip Bonacich. 1987. Power and centrality: A family of measures. American journal of sociology, 92(5):1170–1182.
    Google ScholarLocate open access versionFindings
  • Stephen P. Borgatti and Martin G. Everett. 2006. A graph-theoretic perspective on centrality. Soc. Networks, 28(4):466–484.
    Google ScholarLocate open access versionFindings
  • Sumit Chopra, Michael Auli, and Alexander M. Rush. 2016. Abstractive sentence summarization with attentive recurrent neural networks. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 93–98, San Diego, California. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Arman Cohan, Iz Beltagy, Daniel King, Bhavana Dalvi, and Dan Weld. 2019. Pretrained language models for sequential sentence classification. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3693–3699, Hong Kong, China. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and Łukasz Kaiser. 2019. Universal transformers. In Proceedings of ICLR.
    Google ScholarLocate open access versionFindings
  • Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Li Dong, Nan Yang, Wenhui Wang, Furu Wei, Xiaodong Liu, Yu Wang, Jianfeng Gao, Ming Zhou, and Hsiao-Wuen Hon. 201Unified language model pre-training for natural language understanding and generation. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, 8-14 December 2019, Vancouver, BC, Canada, pages 13042–13054.
    Google ScholarLocate open access versionFindings
  • Gunes Erkan and Dragomir R. Radev. 2004. Lexrank: Graph-based lexical centrality as salience in text summarization. J. Artif. Intell. Res., 22:457–479.
    Google ScholarLocate open access versionFindings
  • Joseph L Fleiss. 1971. Measuring nominal scale agreement among many raters. Psychological bulletin, 76(5):378.
    Google ScholarLocate open access versionFindings
  • Linton C Freeman. 1978. Centrality in social networks conceptual clarification. Social networks, 1(3):215– 239.
    Google ScholarLocate open access versionFindings
  • Sebastian Gehrmann, Yuntian Deng, and Alexander Rush. 2018. Bottom-up abstractive summarization. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 4098–4109, Brussels, Belgium. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Jiatao Gu, Zhengdong Lu, Hang Li, and Victor O.K. Li. 2016. Incorporating copying mechanism in sequence-to-sequence learning. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1631–1640, Berlin, Germany. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Caglar Gulcehre, Sungjin Ahn, Ramesh Nallapati, Bowen Zhou, and Yoshua Bengio. 2016. Pointing the unknown words. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 140–149, Berlin, Germany. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Karl Moritz Hermann, Tomas Kocisky, Edward Grefenstette, Lasse Espeholt, Will Kay, Mustafa Suleyman, and Phil Blunsom. 2015. Teaching machines to read and comprehend. In Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada, pages 1693–1701.
    Google ScholarLocate open access versionFindings
  • Christine Kiss and Martin Bichler. 2008. Identification of influencers—measuring influence in customer networks. Decision Support Systems, 46(1):233– 253.
    Google ScholarLocate open access versionFindings
  • Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Veselin Stoyanov, and Luke Zettlemoyer. 2019. BART: denoising sequence-to-sequence pretraining for natural language generation, translation, and comprehension. CoRR, abs/1910.13461.
    Findings
  • Haoran Li, Peng Yuan, Song Xu, Youzheng Wu, Xiaodong He, and Bowen Zhou. 2020a. Aspect-aware multimodal summarization for chinese e-commerce products. In Proceedings of the Thirty-Forth AAAI Conference on Artificial Intelligence (AAAI).
    Google ScholarLocate open access versionFindings
  • Haoran Li, Junnan Zhu, Tianshang Liu, Jiajun Zhang, and Chengqing Zong. 2018a. Multi-modal sentence summarization with modality attention and image filtering. In Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, IJCAI 2018, July 13-19, 2018, Stockholm, Sweden, pages 4152–4158.
    Google ScholarLocate open access versionFindings
  • Haoran Li, Junnan Zhu, Jiajun Zhang, and Chengqing Zong. 2018b. Ensure the correctness of the summary: Incorporate entailment knowledge into abstractive sentence summarization. In Proceedings of the 27th International Conference on Computational Linguistics, pages 1430–1441, Santa Fe, New Mexico, USA. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Haoran Li, Junnan Zhu, Jiajun Zhang, Chengqing Zong, and Xiaodong He. 2020b. Keywords-guided abstractive sentence summarization. In Proceedings of the Thirty-Forth AAAI Conference on Artificial Intelligence (AAAI).
    Google ScholarLocate open access versionFindings
  • Yung-Ming Li, Cheng-Yang Lai, and Ching-Wen Chen. 2011. Discovering influencers for marketing in the blogosphere. Information Sciences, 181(23):5143– 5157.
    Google ScholarLocate open access versionFindings
  • Chin-Yew Lin. 2004. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pages 74–81, Barcelona, Spain. Association for Computational Linguistics.
    Google ScholarFindings
  • Rada Mihalcea and Paul Tarau. 2004. TextRank: Bringing order into text. In Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing, pages 404–411, Barcelona, Spain. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Ramesh Nallapati, Feifei Zhai, and Bowen Zhou. 2017. Summarunner: A recurrent neural network based sequence model for extractive summarization of documents. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, February 4-9, 2017, San Francisco, California, USA, pages 3075– 3081.
    Google ScholarLocate open access versionFindings
  • Ramesh Nallapati, Bowen Zhou, Cicero dos Santos, Caglar Gulcehre, and Bing Xiang. 2016. Abstractive text summarization using sequence-to-sequence RNNs and beyond. In Proceedings of The 20th SIGNLL Conference on Computational Natural Language Learning, pages 280–290, Berlin, Germany. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. 1999. The pagerank citation ranking: Bringing order to the web. Technical report, Stanford InfoLab.
    Google ScholarFindings
  • Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Improving language understanding by generative pre-training.
    Google ScholarFindings
  • Alexander M. Rush, Sumit Chopra, and Jason Weston. 2015. A neural attention model for abstractive sentence summarization. In Proceedings of the 2015
    Google ScholarLocate open access versionFindings
  • Conference on Empirical Methods in Natural Language Processing, pages 379–389, Lisbon, Portugal. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Abigail See, Peter J. Liu, and Christopher D. Manning. 2017. Get to the point: Summarization with pointergenerator networks. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1073– 1083, Vancouver, Canada. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1715– 1725, Berlin, Germany. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, and TieYan Liu. 2019. MASS: masked sequence to sequence pre-training for language generation. In Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, pages 5926–5936.
    Google ScholarLocate open access versionFindings
  • Jiwei Tan, Xiaojun Wan, and Jianguo Xiao. 2017. Abstractive document summarization with a graphbased attentional neural model. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1171–1181, Vancouver, Canada. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 4-9 December 2017, Long Beach, CA, USA, pages 5998–6008.
    Google ScholarLocate open access versionFindings
  • Oriol Vinyals, Meire Fortunato, and Navdeep Jaitly. 2015. Pointer networks. In Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada, pages 2692–2700.
    Google ScholarLocate open access versionFindings
  • Dongling Xiao, Han Zhang, Yu-Kun Li, Yu Sun, Hao Tian, Hua Wu, and Haifeng Wang. 2020. ERNIEGEN: an enhanced multi-flow pre-training and finetuning framework for natural language generation. CoRR, abs/2001.11314.
    Findings
  • Xingxing Zhang, Furu Wei, and Ming Zhou. 2019. HIBERT: Document level pre-training of hierarchical bidirectional transformers for document summarization. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 5059–5069, Florence, Italy. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Hao Zheng and Mirella Lapata. 2019. Sentence centrality revisited for unsupervised summarization. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 6236–6247, Florence, Italy. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Qingyu Zhou, Nan Yang, Furu Wei, and Ming Zhou. 2017. Selective encoding for abstractive sentence summarization. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1095–1104, Vancouver, Canada. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Qingyu Zhou, Nan Yang, Furu Wei, and Ming Zhou. 2018. Sequential copying networks. In ThirtySecond AAAI Conference on Artificial Intelligence.
    Google ScholarLocate open access versionFindings
  • Junnan Zhu, Qian Wang, Yining Wang, Yu Zhou, Jiajun Zhang, Shaonan Wang, and Chengqing Zong. 2019. NCLS: Neural cross-lingual summarization. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3054– 3064, Hong Kong, China. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
Full Text
Your rating :
0

 

Tags
Comments