AI helps you reading Science

AI generates interpretation videos

AI extracts and analyses the key points of the paper to generate videos automatically


pub
Go Generating

AI Traceability

AI parses the academic lineage of this thesis


Master Reading Tree
Generate MRT

AI Insight

AI extracts a summary of this paper


Weibo:
We showcased how pretrained Bidirectional Encoder Representations from Transformers can be usefully applied in text summarization

Text Summarization with Pretrained Encoders

EMNLP/IJCNLP (1), pp.3728-3738, (2019)

Cited by: 275|Views176
EI

Abstract

Bidirectional Encoder Representations from Transformers (BERT) represents the latest incarnation of pretrained language models which have recently advanced a wide range of natural language processing tasks. In this paper, we showcase how BERT can be usefully applied in text summarization and propose a general framework for both extracti...More

Code:

Data:

0
Introduction
  • Language model pretraining has advanced the state of the art in many NLP tasks ranging from sentiment analysis, to question answering, natural language inference, named entity recognition, and textual similarity.
  • Each token wi is assigned three kinds of embeddings: token embeddings indicate the meaning of each token, segmentation embeddings are used to discriminate between two sentences and position embeddings indicate the position of each token within the text sequence
  • These three embeddings are summed to a single input vector xi and fed to a bidirectional Transformer with multiple layers: hl = LN(hl−1 + MHAtt(hl−1)) (1)
Highlights
  • Language model pretraining has advanced the state of the art in many NLP tasks ranging from sentiment analysis, to question answering, natural language inference, named entity recognition, and textual similarity
  • In most cases, pretrained language models have been employed as encoders for sentence- and paragraph-level natural language understanding problems (Devlin et al, 2019) involving various classification tasks
  • We examine the influence of language model pretraining on text summarization
  • Bidirectional Encoder Representations from Transformers (BERT; Devlin et al 2019) is a new language representation model which is trained with a masked language modeling and a “ sentence prediction” task on a corpus of 3,300M words
  • We showcased how pretrained BERT can be usefully applied in text summarization
  • Experimental results across three datasets show that our model achieves state-of-the-art results across the board under automatic and human-based evaluation protocols
  • We mainly focused on document encoding for summarization, in the future, we would like to take advantage the capabilities of BERT for language generation
Methods
  • The authors evaluated the model on three benchmark datasets, namely the CNN/DailyMail news highlights dataset (Hermann et al, 2015), the New York Times Annotated Corpus (NYT; Sandhaus 2008), and XSum (Narayan et al, 2018a).
  • These datasets represent different summary styles ranging from highlights to very brief one sentence summaries.
Results
  • The authors evaluated summarization quality automatically using ROUGE (Lin, 2004).
  • The authors report unigram and bigram overlap (ROUGE-1 and ROUGE-2) as a means of assessing informativeness and the longest common subsequence (ROUGE-L) as a means of assessing fluency.
  • Table 2 summarizes the results on the CNN/DailyMail dataset.
  • The first block in the table includes the results of an extractive ORACLE system as an upper bound.
  • The second block in the table includes various extractive models trained on the CNN/DailyMail dataset.
Conclusion
  • The authors showcased how pretrained BERT can be usefully applied in text summarization.
  • The authors introduced a novel document-level encoder and proposed a general framework for both abstractive and extractive summarization.
  • Experimental results across three datasets show that the model achieves state-of-the-art results across the board under automatic and human-based evaluation protocols.
  • The authors mainly focused on document encoding for summarization, in the future, the authors would like to take advantage the capabilities of BERT for language generation
Summary
  • Introduction:

    Language model pretraining has advanced the state of the art in many NLP tasks ranging from sentiment analysis, to question answering, natural language inference, named entity recognition, and textual similarity.
  • Each token wi is assigned three kinds of embeddings: token embeddings indicate the meaning of each token, segmentation embeddings are used to discriminate between two sentences and position embeddings indicate the position of each token within the text sequence
  • These three embeddings are summed to a single input vector xi and fed to a bidirectional Transformer with multiple layers: hl = LN(hl−1 + MHAtt(hl−1)) (1)
  • Methods:

    The authors evaluated the model on three benchmark datasets, namely the CNN/DailyMail news highlights dataset (Hermann et al, 2015), the New York Times Annotated Corpus (NYT; Sandhaus 2008), and XSum (Narayan et al, 2018a).
  • These datasets represent different summary styles ranging from highlights to very brief one sentence summaries.
  • Results:

    The authors evaluated summarization quality automatically using ROUGE (Lin, 2004).
  • The authors report unigram and bigram overlap (ROUGE-1 and ROUGE-2) as a means of assessing informativeness and the longest common subsequence (ROUGE-L) as a means of assessing fluency.
  • Table 2 summarizes the results on the CNN/DailyMail dataset.
  • The first block in the table includes the results of an extractive ORACLE system as an upper bound.
  • The second block in the table includes various extractive models trained on the CNN/DailyMail dataset.
  • Conclusion:

    The authors showcased how pretrained BERT can be usefully applied in text summarization.
  • The authors introduced a novel document-level encoder and proposed a general framework for both abstractive and extractive summarization.
  • Experimental results across three datasets show that the model achieves state-of-the-art results across the board under automatic and human-based evaluation protocols.
  • The authors mainly focused on document encoding for summarization, in the future, the authors would like to take advantage the capabilities of BERT for language generation
Tables
  • Table1: Comparison of summarization datasets: size of training, validation, and test sets and average document and summary length (in terms of words and sentences). The proportion of novel bi-grams that do not appear in source documents but do appear in the gold summaries quantifies corpus bias towards extractive methods
  • Table2: ROUGE F1 results on CNN/DailyMail test set (R1 and R2 are shorthands for unigram and bigram overlap; RL is the longest common subsequence). Results for comparison systems are taken from the authors’ respective papers or obtained on our data by running publicly released software
  • Table3: ROUGE Recall results on NYT test set. Results for comparison systems are taken from the authors’ respective papers or obtained on our data by running publicly released software
  • Table4: ROUGE F1 results on the XSum test set. Results for comparison systems are taken from the authors’ respective papers or obtained on our data by running publicly released software
  • Table5: Model perplexity (CNN/DailyMail; validation set) under different combinations of encoder and
  • Table6: QA-based evaluation. Models with † are significantly different from BERTSUM (using a paired student t-test; p < 0.05)
  • Table7: QA-based and ranking-based evaluation. Models with † are significantly different from BERTSUM (using a paired student t-test; p < 0.05)
Download tables as Excel
Funding
  • This research is supported by a Google PhD Fellowship to the first author
  • We gratefully acknowledge the support of the European Research Council (Lapata, award number 681760, “Translating Multiple Modalities into Text”)
Study subjects and analysis
datasets: 3
We also demonstrate that a two-staged fine-tuning approach can further boost the quality of the generated summaries. Experiments on three datasets show that our model achieves stateof-the-art results across the board in both extractive and abstractive settings.1. In this section, we describe the summarization datasets used in our experiments and discuss various implementation details.

4.1 Summarization Datasets

We evaluated our model on three benchmark datasets, namely the CNN/DailyMail news highlights dataset (Hermann et al, 2015), the New York Times Annotated Corpus (NYT; Sandhaus 2008), and XSum (Narayan et al, 2018a)

benchmark datasets: 3
Experiments on three datasets show that our model achieves stateof-the-art results across the board in both extractive and abstractive settings.1. In this section, we describe the summarization datasets used in our experiments and discuss various implementation details.

4.1 Summarization Datasets

We evaluated our model on three benchmark datasets, namely the CNN/DailyMail news highlights dataset (Hermann et al, 2015), the New York Times Annotated Corpus (NYT; Sandhaus 2008), and XSum (Narayan et al, 2018a)
. These datasets represent different summary styles ranging from highlights to very brief one sentence summaries

datasets: 3
We also demonstrate that a two-staged fine-tuning approach can further boost the quality of the generated summaries. Experiments on three datasets show that our model achieves stateof-the-art results across the board in both extractive and abstractive settings.1. Language model pretraining has advanced the state of the art in many NLP tasks ranging from sentiment analysis, to question answering, natural language inference, named entity recognition, and textual similarity

single-document news summarization datasets: 3
Finally, motivated by previous work showing that the combination of extractive and abstractive objectives can help generate better summaries (Gehrmann et al, 2018), we present a two-stage approach where the encoder is fine-tuned twice, first with an extractive objective and subsequently on the abstractive summarization task. We evaluate the proposed approach on three single-document news summarization datasets representative of different writing conventions (e.g., important information is concentrated at the beginning of the document or distributed more evenly throughout) and summary styles (e.g., verbose vs. more telegraphic; extractive vs. abstractive). Across datasets, we experimentally show that the proposed models achieve state-of-the-art results under both extractive and abstractive settings

benchmark datasets: 3
4.1 Summarization Datasets. We evaluated our model on three benchmark datasets, namely the CNN/DailyMail news highlights dataset (Hermann et al, 2015), the New York Times Annotated Corpus (NYT; Sandhaus 2008), and XSum (Narayan et al, 2018a). These datasets represent different summary styles ranging from highlights to very brief one sentence summaries

articles with abstractive summaries: 110540
Input documents were truncated to 512 tokens. NYT contains 110,540 articles with abstractive summaries. Following Durrett et al (2016), we split these into 100,834/9,706 training/test examples, based on the date of publication (the test set contains all articles published from January 1, 2007 onward)

news articles: 226711
Input documents were truncated to 800 tokens. XSum contains 226,711 news articles accompanied with a one-sentence summary, answering the question “What is this article about?”. We used the splits of Narayan et al (2018a) for training, validation, and testing (204,045/11,332/11,334) and followed the pre-processing introduced in their work

datasets: 3
Input documents were truncated to 512 tokens. Aside from various statistics on the three datasets, Table 1 also reports the proportion of novel bi-grams in gold summaries as a measure of their abstractiveness. We would expect models with extractive biases to perform better on datasets with (mostly) extractive summaries, and abstractive models to perform more rewrite operations on datasets with abstractive summaries

documents: 20
For the CNN/DailyMail and NYT datasets we used the same documents (20 in total) and questions from previous work (Narayan et al, 2018b; Liu et al, 2019). For XSum, we randomly selected 20 documents (and their questions) from the release of Narayan et al (2018a). We elicited 3 responses per HIT

datasets: 3
We introduced a novel document-level encoder and proposed a general framework for both abstractive and extractive summarization. Experimental results across three datasets show that our model achieves state-of-the-art results across the board under automatic and human-based evaluation protocols. Although we mainly focused on document encoding for summarization, in the future, we would like to take advantage the capabilities of BERT for language generation

Reference
  • Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. 2016. Layer normalization. arXiv preprint arXiv:1607.06450.
    Findings
  • Jaime G Carbonell and Jade Goldstein. 1998. The use of MMR and diversity-based reranking for reodering documents and producing summaries. In Proceedings of the 21st Annual International ACL SIGIR Conference on Research and Development in Information Retrieval, pages 335–336, Melbourne, Australia.
    Google ScholarLocate open access versionFindings
  • Asli Celikyilmaz, Antoine Bosselut, Xiaodong He, and Yejin Choi. 2018. Deep communicating agents for abstractive summarization. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 1662–1675, New Orleans, Louisiana.
    Google ScholarLocate open access versionFindings
  • James Clarke and Mirella Lapata. 2010. Discourse constraints for document compression. Computational Linguistics, 36(3):411–441.
    Google ScholarLocate open access versionFindings
  • Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota.
    Google ScholarLocate open access versionFindings
  • Li Dong, Nan Yang, Wenhui Wang, Furu Wei, Xiaodong Liu, Yu Wang, Jianfeng Gao, Ming Zhou, and Hsiao-Wuen Hon. 2019. Unified language model pre-training for natural language understanding and generation. arXiv preprint arXiv:1905.03197.
    Findings
  • Yue Dong, Yikang Shen, Eric Crawford, Herke van Hoof, and Jackie Chi Kit Cheung. 2018. BanditSum: Extractive summarization as a contextual bandit. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 3739–3748, Brussels, Belgium.
    Google ScholarLocate open access versionFindings
  • Greg Durrett, Taylor Berg-Kirkpatrick, and Dan Klein. 2016. Learning-based single-document summarization with compression and anaphoricity constraints. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1998–2008, Berlin, Germany.
    Google ScholarLocate open access versionFindings
  • Sergey Edunov, Alexei Baevski, and Michael Auli. 201Pre-trained language model representations for language generation. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4052–4059, Minneapolis, Minnesota.
    Google ScholarLocate open access versionFindings
  • Sebastian Gehrmann, Yuntian Deng, and Alexander Rush. 2018. Bottom-up abstractive summarization. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 4098–4109, Brussels, Belgium.
    Google ScholarLocate open access versionFindings
  • Jiatao Gu, Zhengdong Lu, Hang Li, and Victor O.K. Li. 2016. Incorporating copying mechanism in sequence-to-sequence learning. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1631–1640, Berlin, Germany. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Karl Moritz Hermann, Tomas Kocisky, Edward Grefenstette, Lasse Espeholt, Will Kay, Mustafa Suleyman, and Phil Blunsom. 2015. Teaching machines to read and comprehend. In Advances in Neural Information Processing Systems, pages 1693– 1701.
    Google ScholarLocate open access versionFindings
  • Svetlana Kiritchenko and Saif Mohammad. 2017. Best-worst scaling more reliable than rating scales: A case study on sentiment intensity annotation. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 465–470, Vancouver, Canada.
    Google ScholarLocate open access versionFindings
  • Guillaume Klein, Yoon Kim, Yuntian Deng, Jean Senellart, and Alexander Rush. 2017. OpenNMT: Open-source toolkit for neural machine translation. In Proceedings of ACL 2017, System Demonstrations, pages 67–72, Vancouver, Canada.
    Google ScholarLocate open access versionFindings
  • Wei Li, Xinyan Xiao, Yajuan Lyu, and Yuanzhuo Wang. 2018. Improving neural abstractive document summarization with explicit information selection modeling. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 1787–1796, Brussels, Belgium.
    Google ScholarLocate open access versionFindings
  • Chin-Yew Lin. 2004. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pages 74–81, Barcelona, Spain.
    Google ScholarFindings
  • Yang Liu, Ivan Titov, and Mirella Lapata. 2019. Single document summarization as tree induction. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 1745–1755, Minneapolis, Minnesota.
    Google ScholarLocate open access versionFindings
  • Christopher Manning, Mihai Surdeanu, John Bauer, Jenny Finkel, Steven Bethard, and David McClosky. 2014. The Stanford CoreNLP natural language processing toolkit. In Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pages 55–60, Baltimore, Maryland.
    Google ScholarLocate open access versionFindings
  • Ramesh Nallapati, Feifei Zhai, and Bowen Zhou. 2017. SummaRuNNer: A recurrent neural network based sequence model for extractive summarization of documents. In Proceedings of the 31st AAAI Conference on Artificial Intelligence, pages 3075–3081, San Francisco, California.
    Google ScholarLocate open access versionFindings
  • Ramesh Nallapati, Bowen Zhou, Cicero dos Santos, Caglar Gulcehre, and Bing Xiang. 2016. Abstractive text summarization using sequence-tosequence RNNs and beyond. In Proceedings of The 20th SIGNLL Conference on Computational Natural Language Learning, pages 280–290, Berlin, Germany.
    Google ScholarLocate open access versionFindings
  • Shashi Narayan, Shay B. Cohen, and Mirella Lapata. 2018a. Don’t give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 1797–1807, Brussels, Belgium.
    Google ScholarLocate open access versionFindings
  • Shashi Narayan, Shay B. Cohen, and Mirella Lapata. 2018b. Ranking sentences for extractive summarization with reinforcement learning. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 1747–1759, New Orleans, Louisiana.
    Google ScholarLocate open access versionFindings
  • Romain Paulus, Caiming Xiong, and Richard Socher. 2018. A deep reinforced model for abstractive summarization. In Proceedings of the 6th International Conference on Learning Representations, Vancouver, Canada.
    Google ScholarLocate open access versionFindings
  • Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 2227–2237, New Orleans, Louisiana.
    Google ScholarLocate open access versionFindings
  • Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Improving language understanding by generative pre-training. In CoRR, abs/1704.01444, 2017.
    Findings
  • Abigail See, Peter J. Liu, and Christopher D. Manning. 2017. Get to the point: Summarization with pointergenerator networks. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1073– 1083, Vancouver, Canada.
    Google ScholarLocate open access versionFindings
  • Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. 2016. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2818–2826, Las Vegas, Nevada.
    Google ScholarLocate open access versionFindings
  • Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems, pages 5998–6008.
    Google ScholarLocate open access versionFindings
  • Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. 2016. Google’s neural machine translation system: Bridging the gap between human and machine translation. In arXiv preprint arXiv:1609.08144.
    Findings
  • Xingxing Zhang, Mirella Lapata, Furu Wei, and Ming Zhou. 2018. Neural latent extractive document summarization. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 779–784, Brussels, Belgium.
    Google ScholarLocate open access versionFindings
  • Xingxing Zhang, Furu Wei, and Ming Zhou. 2019. HIBERT: Document level pre-training of hierarchical bidirectional transformers for document summarization. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 5059–5069, Florence, Italy. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Qingyu Zhou, Nan Yang, Furu Wei, Shaohan Huang, Ming Zhou, and Tiejun Zhao. 2018. Neural document summarization by jointly learning to score and select sentences. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 654–663, Melbourne, Australia.
    Google ScholarLocate open access versionFindings
  • Sascha Rothe, Shashi Narayan, and Aliaksei Severyn. 2019. Leveraging pre-trained checkpoints for sequence generation tasks. arXiv preprint arXiv:1907.12461.
    Findings
  • Alexander M. Rush, Sumit Chopra, and Jason Weston. 2015. A neural attention model for abstractive sentence summarization. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 379–389, Lisbon, Portugal.
    Google ScholarLocate open access versionFindings
  • Evan Sandhaus. 2008. The New York Times Annotated Corpus. Linguistic Data Consortium, Philadelphia, 6(12).
    Google ScholarLocate open access versionFindings
Author
Yang Liu
Yang Liu
Your rating :
0

 

Tags
Comments
小科