AI helps you reading Science

AI generates interpretation videos

AI extracts and analyses the key points of the paper to generate videos automatically


pub
Go Generating

AI Traceability

AI parses the academic lineage of this thesis


Master Reading Tree
Generate MRT

AI Insight

AI extracts a summary of this paper


Weibo:
We have demonstrated that Transformer language models can generate high-quality summaries of long sequences of text via an extractive step followed by an abstractive step

On Extractive and Abstractive Neural Document Summarization with Transformer Language Models

EMNLP 2020, pp.9308-9319, (2020)

Cited by: 15|Views539
Full Text
Bibtex
Weibo

Abstract

We present a method to produce abstractive summaries of long documents that exceed several thousand words via neural abstractive summarization. We perform a simple extractive step before generating a summary, which is then used to condition the transformer language model on relevant information before being tasked with generating a summar...More

Code:

Data:

0
Introduction
  • Language models (LMs) are trained to estimate the joint probability of an arbitrary sequence of words or characters using a large corpus of text.
  • Markovian assumptions and the curse of dimensionality make it harder for n-gram LMs to model long range dependencies and learn smooth functions that can learn similarities between words in the vocabulary
  • This has led to a preference for recurrent or feed-forward neural language models (Bengio et al, 2003; Mikolov et al, 2010) in recent years due to to their ability to learn expressive conditional probability distributions (Merity et al, 2017; Radford et al, 2019).
  • RNNs are limited by their sequential nature, making them 1) difficult to optimize and learn for long sequences with long range dependencies (Hochreiter, 1998; Pascanu et al, 2013), and 2) hard to parallelize on modern hardware like GPUs, limiting their scalability
Highlights
  • Language models (LMs) are trained to estimate the joint probability of an arbitrary sequence of words or characters using a large corpus of text
  • A language model serves as a “decoder” that is typically conditioned on a representation of an input sequence produced by an encoder neural network
  • We demonstrate that Transformer language models are extremely promising at summarizing long texts, and provide a new approach to deep summarization that can be used to generate more "abstractive" summaries
  • We discuss a simple neural extractive model based on pointers networks trained on documents and their salient sentences. We show that this model can be used to augment Transformer language models to generate better summarization results
  • We demonstrate that transformer language models are surprisingly effective at summarizing long scientific articles and outperform typical seq2seq approaches, even without a copy mechanism
  • We show that our architecture achieves state-of-the-art performance on a large suite of tasks, outperforming many systems with taskspecific architectures
  • We have demonstrated that Transformer language models can generate high-quality summaries of long sequences of text via an extractive step followed by an abstractive step
Results
  • Experimental setup Datasets The authors experiment with four different large-scale and long document summarization datasets - arXiv, PubMed (Cohan et al 2018), bigPatent (Sharma, Li, and Wang 2019) and Newsroom (Grusky, Naaman, and Artzi 2018).
  • Dataset arXiv PubMed Newsroom BigPatent.
  • All ROUGE numbers reported in this work have a 95% confidence interval of at most 0.24
Conclusion
  • The authors present the main results on summarizing arXiv and PubMed papers in tables 2, 4.
  • The authors' TLM conditioned on the extractive summary produced by the best extractive model (TLM-I+E (G,M)) outperforms prior abstractive/mixed results on the arXiv, Pubmed and bigPatent datasets, except on ROUGE-L.
  • In table 7 and Table 8, the authors present qualitative results of abstracts of notable papers in the field and of the TLM conditioned on the introductions and extracted summaries of a random example from the arXiv test set.
  • The authors show a performance upper bound conditioning the Transformer LM on oracle / ground-truth extracted sentences at both train and test time (TLM-I+E (G,G)).
Summary
  • Introduction:

    Language models (LMs) are trained to estimate the joint probability of an arbitrary sequence of words or characters using a large corpus of text.
  • Markovian assumptions and the curse of dimensionality make it harder for n-gram LMs to model long range dependencies and learn smooth functions that can learn similarities between words in the vocabulary
  • This has led to a preference for recurrent or feed-forward neural language models (Bengio et al, 2003; Mikolov et al, 2010) in recent years due to to their ability to learn expressive conditional probability distributions (Merity et al, 2017; Radford et al, 2019).
  • RNNs are limited by their sequential nature, making them 1) difficult to optimize and learn for long sequences with long range dependencies (Hochreiter, 1998; Pascanu et al, 2013), and 2) hard to parallelize on modern hardware like GPUs, limiting their scalability
  • Results:

    Experimental setup Datasets The authors experiment with four different large-scale and long document summarization datasets - arXiv, PubMed (Cohan et al 2018), bigPatent (Sharma, Li, and Wang 2019) and Newsroom (Grusky, Naaman, and Artzi 2018).
  • Dataset arXiv PubMed Newsroom BigPatent.
  • All ROUGE numbers reported in this work have a 95% confidence interval of at most 0.24
  • Conclusion:

    The authors present the main results on summarizing arXiv and PubMed papers in tables 2, 4.
  • The authors' TLM conditioned on the extractive summary produced by the best extractive model (TLM-I+E (G,M)) outperforms prior abstractive/mixed results on the arXiv, Pubmed and bigPatent datasets, except on ROUGE-L.
  • In table 7 and Table 8, the authors present qualitative results of abstracts of notable papers in the field and of the TLM conditioned on the introductions and extracted summaries of a random example from the arXiv test set.
  • The authors show a performance upper bound conditioning the Transformer LM on oracle / ground-truth extracted sentences at both train and test time (TLM-I+E (G,G)).
Tables
  • Table1: Statistics from (<a class="ref-link" id="cSharma_et+al_2019_a" href="#rSharma_et+al_2019_a">Sharma, Li, and Wang 2019</a>) for the datasets used in this work - The number of document/summary pairs, the ratio of the number of words in the document to the abstract and the number of words in the summary and document
  • Table2: Summarization results on the arXiv dataset. Previous work results from (<a class="ref-link" id="cCohan_et+al_2018_a" href="#rCohan_et+al_2018_a">Cohan et al 2018</a>). The following lines are a simple baseline Lead-10 extractor and the pointer and classifier models. Our transformer LMs (TLM) are conditioned either on the Introduction (I) or along with extracted sentences (E) either from ground-truth (G) or model (M) extracts
  • Table3: Qualitative Results - News articles and our model generated summaries on the NewsRoom dataset
  • Table4: Summarization results on the PubMed dataset. Previous work results from (<a class="ref-link" id="cCohan_et+al_2018_a" href="#rCohan_et+al_2018_a">Cohan et al 2018</a>). The following lines are a simple baseline Lead-10 extractor and the pointer and classifier models. Our transformer LMs (TLM) are conditioned either on the Introduction (I) or along with extracted sentences (E) either from ground-truth (G) or model (M) extracts
  • Table5: Summarization results on the bigPatent dataset. Previous work results from (<a class="ref-link" id="cSharma_et+al_2019_a" href="#rSharma_et+al_2019_a">Sharma, Li, and Wang 2019</a>). Our transformer LMs (TLM) are conditioned on the whole document or additionally with extracted sentences (E) either from ground-truth (G) or model (M) extracts
  • Table6: Summarization results on the Newsroom dataset. Previous work results from (<a class="ref-link" id="cGrusky_et+al_2018_a" href="#rGrusky_et+al_2018_a">Grusky, Naaman, and Artzi 2018</a>) and (<a class="ref-link" id="cMendes_et+al_2019_a" href="#rMendes_et+al_2019_a">Mendes et al 2019</a>)
  • Table7: Qualitative Results — Generated abstracts of select papers using our Intro Only TLM
  • Table8: Qualitative Results - Generated abstracts from our models on a random example from the test set of (<a class="ref-link" id="cCohan_et+al_2018_a" href="#rCohan_et+al_2018_a">Cohan et al 2018</a>)
Download tables as Excel
Related work
  • Automatic summarization systems seek to condense the size of a piece of text while preserving most of its important information content and meaning. The earliest attempts at automatic summarization focused on extractive techniques, which find words or sentences in a document that capture its most salient content. In the past, various similarity scores based on specific sentence features (keywords, position, length, frequency, linguistic) and metrics (structure-based, vector-based and graph-based) were employed to estimate salience (Steinberger and Jezek 2004; Erkan and Radev 2004) between a sentence in a document and its reference summary. More recently, with advances in distributed representations of words, phrases and sentences, researchers have proposed to use these to compute similarity scores. Such techniques were further refined by (Nallapati, Zhou, and Ma 2016; Cheng and Lapata 2016; Chen and Bansal 2018) with encoder-decoder architectures - the representations learned by the encoder are used to choose the most salient sentences. (Cheng and Lapata 2016) and (Nallapati, Zhou, and Ma 2016) trained encoder-decoder neural networks as a binary classifier to determine if each sentence in a document should belong to the extractive summary or not. (Nallapati et al 2016) also present an alternative that can pick an unordered set of sentences from the source document to assemble an extractive summary. (Chen and Bansal 2018) use a pointer network (Vinyals, Fortunato, and Jaitly 2015) to sequentially pick sentences from the document that comprise its extractive summary.

    Human summarizers have four common characteristics. They are able to (1) interpret a source document, (2) prioritize the most important parts of the input text, (3) paraphrase key concepts into coherent paragraphs and (4) generate diverse output summaries. While extractive methods are arguably well suited for identifying the most relevant information, such techniques may lack the fluency and coherency of human generated summaries. Abstractive summarization has shown the most promise towards addressing points (3) and (4) above. Abstractive generation may produce sentences not seen in the original input document. Motivated by neural network success in machine translation experiments, the attention-based encoder decoder paradigm has recently been widely studied in abstractive summarization (Rush, Chopra, and Weston 2015; Nallapati et al 2016; Chopra, Auli, and Rush 2016). By dynamically accessing the relevant pieces of information based on the hidden states of the decoder during generation of the output sequence, the model revisits the input and attends to important information. The advantages of extractive, abstractive and attentionbased models were first combined in (Gu et al 2016) with a copy mechanism for out-of-vocabulary words present in the source document. Similarly, (See, Liu, and Manning 2017) used the attention scores to calculate the probability of generating vs copying a word. A coverage mechanism was also added to penalize the attention score of previously attended words, diminishing the model’s tendency to repeat itself.
Funding
  • To deal with extremely long documents that exceed several thousand words, we first perform sentence extraction using two different hierarchical document models - one based on pointer networks (Vinyals, Fortunato, and Jaitly 2015), similar to the variant proposed in (Chen and Bansal 2018) and the other based on a sentence classifier (Nallapati, Zhai, and Zhou 2017). This extracts important sentences from the document (described in section ) that can be used to better condition the transformer LM on relevant information before being tasked with generating a summary. We show that this extractive step significantly improves summarization results
  • More than 10% of the 20-grams from the abstracts generated by the pointing model are also found in the article, showing that it tends to copy long sequences of words
  • We show that the proposed model achieves significantly improved translation performance than the conventional encoder decoder neural network approach, when the sentences in the training corpus are long
  • We show that our architecture achieves state-of-the-art performance on a large suite of tasks, outperforming many systems with taskspecific architectures
  • In this paper, we conjecture that the use of a fixed-length vector is a bottleneck in improving the performance of this basic encoder-decoder architecture, and propose to extend this by allowing a model to automatically (soft-)search for parts of a source sentence that are relevant to predicting a target word, without having to form these parts as a hard segment explicitly. With this new approach, we achieve a translation performance comparable to the existing state-of-the-art phrase-based system on the task of English-to-French translation
Study subjects and analysis
tested language modeling datasets: 8
The capacity of the language model is essential to the success of zero-shot task transfer and increasing it improves performance in a log-linear fashion across tasks. Our largest model, GPT-2, is a 1.5B parameter Transformer that achieves state of the art results on 7 out of 8 tested language modeling datasets in a zero-shot setting but still underfits WebText. Samples from the model reflect these improvements and contain coherent paragraphs of text

Reference
  • Bahdanau, D.; Cho, K.; and Bengio, Y. 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473.
    Findings
  • Bengio, Y.; Ducharme, R.; Vincent, P.; and Jauvin, C. 2003. A neural probabilistic language model. Journal of machine learning research 3(Feb):1137–1155.
    Google ScholarLocate open access versionFindings
  • Chen, Y.-C., and Bansal, M. 2018. Fast abstractive summarization with reinforce-selected sentence rewriting. arXiv preprint arXiv:1805.11080.
    Findings
  • Cheng, J., and Lapata, M. 2016. Neural summarization by extracting sentences and words. arXiv preprint arXiv:1603.07252.
    Findings
  • Chopra, S.; Auli, M.; and Rush, A. M. 2016. Abstractive sentence summarization with attentive recurrent neural networks. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 93–98.
    Google ScholarLocate open access versionFindings
  • Cohan, A.; Dernoncourt, F.; Kim, D. S.; Bui, T.; Kim, S.; Chang, W.; and Goharian, N. 2018. A discourse-aware attention model for abstractive summarization of long documents. CoRR abs/1804.05685.
    Findings
  • Devlin, J.; Chang, M.-W.; Lee, K.; and Toutanova, K. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
    Findings
  • Erkan, G., and Radev, D. R. 2004. Lexrank: Graph-based lexical centrality as salience in text summarization. Journal of artificial intelligence research 22:457–479.
    Google ScholarLocate open access versionFindings
  • Gehring, J.; Auli, M.; Grangier, D.; Yarats, D.; and Dauphin, Y. N. 2017. Convolutional sequence to sequence learning. 1243–1252.
    Google ScholarFindings
  • Gehrmann, S.; Deng, Y.; and Rush, A. M. 2018. Bottom-up abstractive summarization. arXiv preprint arXiv:1808.10792.
    Findings
  • Graham, Y. 2015. Re-evaluating automatic summarization with bleu and 192 shades of rouge. 128–137.
    Google ScholarFindings
  • Grusky, M.; Naaman, M.; and Artzi, Y. 2018. Newsroom: A dataset of 1.3 million summaries with diverse extractive strategies. arXiv preprint arXiv:1804.11283.
    Findings
  • Gu, J.; Lu, Z.; Li, H.; and Li, V. O. K. 2016. Incorporating copying mechanism in sequence-to-sequence learning. CoRR abs/1603.06393.
    Findings
  • Gulcehre, C.; Firat, O.; Xu, K.; Cho, K.; Barrault, L.; Lin, H.-C.; Bougares, F.; Schwenk, H.; and Bengio, Y. 2015. On using monolingual corpora in neural machine translation. arXiv preprint arXiv:1503.03535.
    Findings
  • Hochreiter, S., and Schmidhuber, J. 1997. Long short-term memory. Neural Comput. 9(8):1735–1780.
    Google ScholarLocate open access versionFindings
  • Hochreiter, S. 1998. The vanishing gradient problem during learning recurrent neural nets and problem solutions. International Journal of Uncertainty, Fuzziness and KnowledgeBased Systems 6(02):107–116.
    Google ScholarLocate open access versionFindings
  • Kalchbrenner, N.; Espeholt, L.; Simonyan, K.; Oord, A. v. d.; Graves, A.; and Kavukcuoglu, K. 2016. Neural machine translation in linear time. arXiv preprint arXiv:1610.10099.
    Findings
  • Lin, C.-Y. 2004. Looking for a few good metrics: Automatic summarization evaluation-how many samples are enough? In NTCIR. Luong, M.-T.; Pham, H.; and Manning, C. D. 2015. Effective approaches to attention-based neural machine translation. arXiv preprint arXiv:1508.04025.
    Findings
  • Mendes, A.; Narayan, S.; Miranda, S.; Marinho, Z.; Martins, A. F.; and Cohen, S. B. 20Jointly extracting and compressing documents with summary state representations. arXiv preprint arXiv:1904.02020.
    Findings
  • Micikevicius, P.; Narang, S.; Alben, J.; Diamos, G.; Elsen, E.; Garcia, D.; Ginsburg, B.; Houston, M.; Kuchaiev, O.; Venkatesh, G.; et al. 2017. Mixed precision training. arXiv preprint arXiv:1710.03740.
    Findings
  • Mikolov, T.; Karafiát, M.; Burget, L.; Cernocky, J.; and Khudanpur, S. 2010. Recurrent neural network based language model.
    Google ScholarFindings
  • Nallapati, R.; Zhou, B.; Gulcehre, C.; Xiang, B.; et al. 2016. Abstractive text summarization using sequence-to-sequence rnns and beyond. arXiv preprint arXiv:1602.06023.
    Findings
  • Nallapati, R.; Zhai, F.; and Zhou, B. 2017. Summarunner: A recurrent neural network based sequence model for extractive summarization of documents.
    Google ScholarFindings
  • Nallapati, R.; Zhou, B.; and Ma, M. 2016. Classify or select: Neural architectures for extractive document summarization. arXiv preprint arXiv:1611.04244.
    Findings
  • Pascanu, R.; Mikolov, T.; and Bengio, Y. 2013. On the difficulty of training recurrent neural networks. 1310–1318.
    Google ScholarFindings
  • Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; and Sutskever, I. 2019. Language models are unsupervised multitask learners.
    Google ScholarFindings
  • Rush, A. M.; Chopra, S.; and Weston, J. 2015. A neural attention model for abstractive sentence summarization. CoRR abs/1509.00685.
    Findings
  • See, A.; Liu, P. J.; and Manning, C. D. 2017. Get to the point: Summarization with pointer-generator networks. CoRR abs/1704.04368.
    Findings
  • Sennrich, R.; Haddow, B.; and Birch, A. 2015. Neural machine translation of rare words with subword units. arXiv preprint arXiv:1508.07909.
    Findings
  • Sharma, E.; Li, C.; and Wang, L. 2019. Bigpatent: A largescale dataset for abstractive and coherent summarization. arXiv preprint arXiv:1906.03741.
    Findings
  • Steinberger, J., and Jezek, K. 2004. Using latent semantic analysis in text summarization and summary evaluation. Proc. ISIM 4:93–100.
    Google ScholarLocate open access versionFindings
  • Sutskever, I.; Vinyals, O.; and Le, Q. V. 2014. Sequence to sequence learning with neural networks. 3104–3112. Van Den Oord, A.; Dieleman, S.; Zen, H.; Simonyan, K.; Vinyals, O.; Graves, A.; Kalchbrenner, N.; Senior, A. W.; and Kavukcuoglu, K. 2016. Wavenet: A generative model for raw audio. SSW 125.
    Google ScholarLocate open access versionFindings
  • Vanderwende, L.; Suzuki, H.; Brockett, C.; and Nenkova, A. 2007. Beyond sumbasic: Task-focused summarization with sentence simplification and lexical expansion. Information Processing & Management 43(6):1606–1618.
    Google ScholarLocate open access versionFindings
  • Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A. N.; Kaiser, L.; and Polosukhin, I. 2017. Attention is all you need. CoRR abs/1706.03762.
    Findings
  • Vinyals, O.; Fortunato, M.; and Jaitly, N. 2015. Pointer networks. 2692–2700.
    Google ScholarFindings
  • Weber, N.; Shekhar, L.; Balasubramanian, N.; and Cho, K. 2018. Controlling decoding for more abstractive summaries with copy-based networks. arXiv preprint arXiv:1803.07038.
    Findings
Author
Jonathan Pilault
Jonathan Pilault
Raymond Li
Raymond Li
Sandeep Subramanian
Sandeep Subramanian
Your rating :
0

 

Tags
Comments
小科