Multi Fact Correction in Abstractive Text Summarization

EMNLP 2020, 2020.

Cited by: 0|Bibtex|Views26
Other Links: arxiv.org
Keywords:
australian homeabstractive summarization systemabstractive text summarizationabstractive summarization modelabstractive strategyMore(16+)
Weibo:
Since question generation and question answering model and FactCC are based on comparing system-generated summary w.r.t. the source text, high score means high semantic similarity between system summary to the source

Abstract:

Pre-trained neural abstractive summarization systems have dominated extractive strategies on news summarization performance, at least in terms of ROUGE. However, system-generated abstractive summaries often face the pitfall of factual inconsistency: generating incorrect facts with respect to the source text. To address this challenge, w...More

Code:

Data:

0
Introduction
  • Informative text summarization aims to shorten a long piece of text while preserving its main message.
  • Despite the fact that extractive strategies are simpler and less expensive, and can generate summaries that are more grammatically and semantically correct, abstractive strategies are becoming increasingly popular thanks to its flexibility, coherency and vocabulary diversity (Zhang et al, 2020a).
  • Corrected by SpanFact (CNN) About a quarter of a million Australian homes and businesses have no power after a “once in a decade” storm battered Sydney and nearby areas.
  • About a quarter of a million australian homes and businesses have no power after a “once in a decade” storm
  • About 4,500 people have been isolated by flood waters as “the roads are cut off and the authors won’t be able to reach them for a few days,”... a quarter of a million australian homes and businesses have no power after a decade. about a quarter of a million australian homes and businesses have no power after a “once in a decade” storm
Highlights
  • Informative text summarization aims to shorten a long piece of text while preserving its main message
  • About 4,500 people have been isolated by flood waters as “the roads are cut off and we won’t be able to reach them for a few days,”... a quarter of a million australian homes and businesses have no power after a decade. about a quarter of a million australian homes and businesses have no power after a “once in a decade” storm
  • Our contributions are summarized as follows. (i) We propose SpanFact, a new factual correction framework that focuses on correcting erroneous facts in generated summaries, generalizable to any summarization system. (ii) We propose two methods to solve multi-fact correction problem with single or multi-span selection in an iterative or auto-regressive manner, respectively. (iii) Experimental results on multiple summarization benchmarks demonstrate that our approach can significantly improve multiple factuality measurements without a huge drop on ROUGE scores
  • Since question generation and question answering model (QGQA) and FactCC are based on comparing system-generated summary w.r.t. the source text, high score means high semantic similarity between system summary to the source
  • We present SpanFact, a suite of two factual correction models that use span selection mechanisms to replace one or multiple entity masks at a time
  • SpanFact can be used for fact correction on any abstractive summaries
Methods
  • Training data for the fact correction models are generated as described in Section 3.2 on CNN/DailyMail (Hermann et al, 2015), XSum (Narayan et al, 2018) and Gigaword (Graff et al, 2003; Rush et al, 2015)
  • The statistics of these three dataset are provided in Table 2.
  • The Transformer decoder has 1024 hidden units and the feed-forward intermediate size for all layers is 4,096
Results
  • On CNN/DailyMail (Table 3), the correction models significantly boost factual consistency measures (QGQA and FactCC) by large margins, with only small drops on ROUGE.
  • The iterative procedure of the QAspan model is more robust with high precision as it has more correct context from the query, with only minimum negative influence from other concurrent errors
  • This is reflected in the high scores of QGQA and FactCC across all the models the authors tested.
  • Since QGQA and FactCC are based on comparing system-generated summary w.r.t. the source text, high score means high semantic similarity between system summary to the source
Conclusion
  • The authors present SpanFact, a suite of two factual correction models that use span selection mechanisms to replace one or multiple entity masks at a time.
  • SpanFact can be used for fact correction on any abstractive summaries.
  • Empirical results show that the models improve the factuality of summaries generated by state-of-the-art abstractive summarization systems without a huge drop on ROUGE scores.
  • The authors plan to apply the method for other type of spans, such as noun phrases, verbs, and clauses
Summary
  • Introduction:

    Informative text summarization aims to shorten a long piece of text while preserving its main message.
  • Despite the fact that extractive strategies are simpler and less expensive, and can generate summaries that are more grammatically and semantically correct, abstractive strategies are becoming increasingly popular thanks to its flexibility, coherency and vocabulary diversity (Zhang et al, 2020a).
  • Corrected by SpanFact (CNN) About a quarter of a million Australian homes and businesses have no power after a “once in a decade” storm battered Sydney and nearby areas.
  • About a quarter of a million australian homes and businesses have no power after a “once in a decade” storm
  • About 4,500 people have been isolated by flood waters as “the roads are cut off and the authors won’t be able to reach them for a few days,”... a quarter of a million australian homes and businesses have no power after a decade. about a quarter of a million australian homes and businesses have no power after a “once in a decade” storm
  • Methods:

    Training data for the fact correction models are generated as described in Section 3.2 on CNN/DailyMail (Hermann et al, 2015), XSum (Narayan et al, 2018) and Gigaword (Graff et al, 2003; Rush et al, 2015)
  • The statistics of these three dataset are provided in Table 2.
  • The Transformer decoder has 1024 hidden units and the feed-forward intermediate size for all layers is 4,096
  • Results:

    On CNN/DailyMail (Table 3), the correction models significantly boost factual consistency measures (QGQA and FactCC) by large margins, with only small drops on ROUGE.
  • The iterative procedure of the QAspan model is more robust with high precision as it has more correct context from the query, with only minimum negative influence from other concurrent errors
  • This is reflected in the high scores of QGQA and FactCC across all the models the authors tested.
  • Since QGQA and FactCC are based on comparing system-generated summary w.r.t. the source text, high score means high semantic similarity between system summary to the source
  • Conclusion:

    The authors present SpanFact, a suite of two factual correction models that use span selection mechanisms to replace one or multiple entity masks at a time.
  • SpanFact can be used for fact correction on any abstractive summaries.
  • Empirical results show that the models improve the factuality of summaries generated by state-of-the-art abstractive summarization systems without a huge drop on ROUGE scores.
  • The authors plan to apply the method for other type of spans, such as noun phrases, verbs, and clauses
Tables
  • Table1: Examples of factual error correction on different summarization datasets. Factual errors are marked in red. Corrections made by the proposed SpanFact models are marked in orange
  • Table2: Comparison of summarization datasets on train/validation/test set splits, average document and summary length (numbers of words). We also report the average number of entity masks on the reference summary for each dataset
  • Table3: Factual correctness scores and ROUGE scores on CNN/DailyMail test set
  • Table4: Factual correctness scores and ROUGE scores on XSum test set
  • Table5: Factual correctness scores and ROUGE scores on Gigaword test set
  • Table6: Human evaluation results on pairwise comparison of factual correctness on 450 (9 × 50) randomly sampled articles
  • Table7: Test results on the human annotated dataset provided by FactCC (<a class="ref-link" id="cKryscinski_et+al_2019_a" href="#rKryscinski_et+al_2019_a">Kryscinski et al, 2019</a>). We show the performance comparisons of the original summaries and the summaries corrected by SpanFact
  • Table8: Examples of factual error correction on FactCC dataset (a human annotated subset from CNNDM obtained by <a class="ref-link" id="cKryscinski_et+al_2019_a" href="#rKryscinski_et+al_2019_a">Kryscinski et al (2019</a>)). Factual errors by abstractive summarization system are marked in red. Corrections made by the proposed SpanFact models are marked in orange
Download tables as Excel
Related work
  • The general neural-based encoder-decoder structure for abstractive summarization is first proposed by Rush et al (2015). Later work improves this structure with better encoders, such as LSTMs (Chopra et al, 2016) and GRUs (Nallapati et al, 2016), that are able to capture longrange dependencies, as well as with reinforcement learning methods that directly optimize summarization evaluation scores (Paulus et al, 2018). One drawback of the earlier neural-based summarization models is the inability to produce out-of-

    vocabulary words, as the model can only generate whole words based on a fixed vocabulary. See et al (2017) proposes a pointer-generator framework that can copy words directly from the source through a pointer network (Vinyals et al, 2015), in addition to the traditional sequence-to-sequence generation model.

    Abstractive summarization starts to shine with the advent of self-supervised algorithms, which allow deeper and more complicated neural networks such as Transformers (Vaswani et al, 2017) to learn diverse language priors from large-scale corpora. Models such as BERT (Devlin et al, 2019), GPT (Radford et al, 2018) and BART (Lewis et al, 2020) have achieved new state-of-the-art performances on abstractive summarization (Liu and Lapata, 2019; Lewis et al, 2020; Zhang et al, 2020a; Shi et al, 2019; Fabbri et al, 2019). These models often finetune pre-trained Transformers with supervised summarization datasets that contain pairs of source and summary.
Funding
  • This research was supported in part by Microsoft Dynamics 365 AI Research and the Canada CIFAR AI Chair program
Study subjects and analysis
people: 4500
Corrected by SpanFact (CNN) About a quarter of a million Australian homes and businesses have no power after a “once in a decade” storm battered Sydney and nearby areas. About 4,500 people have been isolated by flood waters as “the roads are cut off and we won’t be able to reach them for a few days,”... a quarter of a million australian homes and businesses have no power after a decade. about a quarter of a million australian homes and businesses have no power after a “once in a decade” storm. all the 12 victims including 8 killed and 4 injured have been identified as senior high school students of the second senior high school of ruzhou city, central china’s henan province, local police said friday

victims: 12
About 4,500 people have been isolated by flood waters as “the roads are cut off and we won’t be able to reach them for a few days,”... a quarter of a million australian homes and businesses have no power after a decade. about a quarter of a million australian homes and businesses have no power after a “once in a decade” storm. all the 12 victims including 8 killed and 4 injured have been identified as senior high school students of the second senior high school of ruzhou city, central china’s henan province, local police said friday. 12 killed, 4 injured in central china school shooting

randomly selected samples: 50
We select three state-of-the-art abstractive summarization models as the backbones, and collect three sets of pairwise summaries for each setting: (i) Original vs. QA-Span corrected; (ii) Original vs. Auto-regressive corrected; (iii) QA-Span corrected vs. Auto-regressive corrected. Nine sets of 50 randomly selected samples (total 450 samples) are labeled by AMT tuckers. For each pair (in anonymized order), three annotators from Amazon Mechanical Turk (AMT) are asked to judge which is more factually correct based on the

Reference
  • Ziqiang Cao, Furu Wei, Wenjie Li, and Sujian Li. 2018. Faithful to the original: Fact aware neural abstractive summarization. In Thirty-Second AAAI Conference on Artificial Intelligence.
    Google ScholarLocate open access versionFindings
  • Sumit Chopra, Michael Auli, and Alexander M Rush. 2016. Abstractive sentence summarization with attentive recurrent neural networks. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 93–98.
    Google ScholarLocate open access versionFindings
  • Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186.
    Google ScholarLocate open access versionFindings
  • Esin Durmus, He He, and Mona Diab. 2020. Feqa: A question answering evaluation framework for faithfulness assessment in abstractive summarization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Alexander Fabbri, Irene Li, Tianwei She, Suyi Li, and Dragomir Radev. 2019. Multi-news: A large-scale multi-document summarization dataset and abstractive hierarchical model. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 1074–1084, Florence, Italy. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Tobias Falke, Leonardo FR Ribeiro, Prasetya Ajie Utama, Ido Dagan, and Iryna Gurevych. 2019. Ranking generated summaries by correctness: An interesting but challenging application for natural language inference. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2214–2220.
    Google ScholarLocate open access versionFindings
  • Sebastian Gehrmann, Yuntian Deng, and Alexander M Rush. 2018. Bottom-up abstractive summarization. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 4098–4109.
    Google ScholarLocate open access versionFindings
  • Ben Goodrich, Vinay Rao, Peter J Liu, and Mohammad Saleh. 2019. Assessing the factual accuracy of generated text. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 166–175.
    Google ScholarLocate open access versionFindings
  • David Graff, Junbo Kong, Ke Chen, and Kazuaki Maeda. 2003. English gigaword. Linguistic Data Consortium, Philadelphia, 4(1):34.
    Google ScholarLocate open access versionFindings
  • Karl Moritz Hermann, Tomas Kocisky, Edward Grefenstette, Lasse Espeholt, Will Kay, Mustafa Suleyman, and Phil Blunsom. 2015. Teaching machines to read and comprehend. In Advances in neural information processing systems, pages 1693– 1701.
    Google ScholarLocate open access versionFindings
  • Matthew Honnibal and Ines Montani. 2017. spacy 2: Natural language understanding with bloom embeddings, convolutional neural networks and incremental parsing. To appear, 7(1).
    Google ScholarFindings
  • Philipp Koehn and Rebecca Knowles. 2017. Six challenges for neural machine translation. In Proceedings of the First Workshop on Neural Machine Translation, pages 28–39.
    Google ScholarLocate open access versionFindings
  • the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP).
    Google ScholarLocate open access versionFindings
  • Ramesh Nallapati, Bowen Zhou, Cicero dos Santos, Caglar Gulcehre, and Bing Xiang. 2016. Abstractive text summarization using sequence-tosequence rnns and beyond. In Proceedings of The 20th SIGNLL Conference on Computational Natural Language Learning, pages 280–290.
    Google ScholarLocate open access versionFindings
  • Shashi Narayan, Shay B Cohen, and Mirella Lapata. 2018. Don’t give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 1797–1807.
    Google ScholarLocate open access versionFindings
  • Wojciech Kryscinski, Bryan McCann, Caiming Xiong, and Richard Socher. 2019. Evaluating the factual consistency of abstractive text summarization. arXiv preprint arXiv:1910.12840.
    Findings
  • Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. 20Automatic differentiation in pytorch.
    Google ScholarFindings
  • Katherine Lee, Orhan Firat, Ashish Agarwal, Clara Fannjiang, and David Sussillo. 20Hallucinations in neural machine translation.
    Google ScholarFindings
  • Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Veselin Stoyanov, and Luke Zettlemoyer. 2020. BART: Denoising sequence-to-sequence pretraining for natural language generation, translation, and comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Haoran Li, Junnan Zhu, Jiajun Zhang, and Chengqing Zong. 2018. Ensure the correctness of the summary: Incorporate entailment knowledge into abstractive sentence summarization. In Proceedings of the 27th International Conference on Computational Linguistics, pages 1430–1441.
    Google ScholarLocate open access versionFindings
  • Chin-Yew Lin. 2004. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pages 74–81.
    Google ScholarFindings
  • Yang Liu and Mirella Lapata. 2019. Text summarization with pretrained encoders. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3721–3731.
    Google ScholarLocate open access versionFindings
  • Ilya Loshchilov and Frank Hutter. 2017. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101.
    Findings
  • Joshua Maynez, Shashi Narayan, Bernd Bohnet, and Ryan McDonald. 2020. On faithfulness and factuality in abstractive summarization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Cao Meng, Yue Cheung Dong, Jiapeng Wu, and Jackie Chi Kit. 2020. Factual error correction for abstractive summarization models. In Proceedings of
    Google ScholarLocate open access versionFindings
  • Romain Paulus, Caiming Xiong, and Richard Socher. 2018. A deep reinforced model for abstractive summarization. In International Conference on Learning Representations.
    Google ScholarLocate open access versionFindings
  • Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Improving language understanding by generative pre-training.
    Google ScholarFindings
  • Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140):1–67.
    Google ScholarLocate open access versionFindings
  • Pranav Rajpurkar, Robin Jia, and Percy Liang. 2018. Know what you don’t know: Unanswerable questions for squad. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 784–789.
    Google ScholarLocate open access versionFindings
  • Alexander M Rush, Sumit Chopra, and Jason Weston. 2015. A neural attention model for abstractive sentence summarization. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 379–389.
    Google ScholarLocate open access versionFindings
  • Abigail See, Peter J Liu, and Christopher D Manning. 2017. Get to the point: Summarization with pointergenerator networks. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1073– 1083.
    Google ScholarLocate open access versionFindings
  • Darsh J Shah, Tal Schuster, and Regina Barzilay. 2020. Automatic fact-guided sentence modification. In Proceedings of The Thirty-Fourth AAAI Conference on Artificial Intelligence.
    Google ScholarLocate open access versionFindings
  • Tian Shi, Ping Wang, and Chandan K. Reddy. 2019. LeafNATS: An open-source toolkit and live demo system for neural abstractive text summarization. In Proceedings of the 2019 Conference of the North
    Google ScholarLocate open access versionFindings
  • American Chapter of the Association for Computational Linguistics (Demonstrations), pages 66–71, Minneapolis, Minnesota. Association for Computational Linguistics.
    Google ScholarFindings
  • Kaiqiang Song, Logan Lebanoff, Qipeng Guo, Xipeng Qiu, Xiangyang Xue, Chen Li, Dong Yu, and Fei Liu. 2020. Joint parsing and generation for abstractive summarization. In Proceedings of The ThirtyFourth AAAI Conference on Artificial Intelligence, pages 4098–4109.
    Google ScholarLocate open access versionFindings
  • Adam Trischler, Tong Wang, Xingdi Yuan, Justin Harris, Alessandro Sordoni, Philip Bachman, and Kaheer Suleman. 2017. Newsqa: A machine comprehension dataset. In Proceedings of the 2nd Workshop on Representation Learning for NLP, pages 191–200.
    Google ScholarLocate open access versionFindings
  • Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008.
    Google ScholarLocate open access versionFindings
  • Oriol Vinyals, Meire Fortunato, and Navdeep Jaitly. 2015. Pointer networks. In Advances in neural information processing systems, pages 2692–2700.
    Google ScholarLocate open access versionFindings
  • Oriol Vinyals and Quoc Le. 2015. A neural conversational model. arXiv preprint arXiv:1506.05869.
    Findings
  • Alex Wang, Kyunghyun Cho, and Mike Lewis. 2020. Asking and answering questions to evaluate the factual consistency of summaries. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, R’emi Louf, Morgan Funtowicz, and Jamie Brew. 2019. Huggingface’s transformers: State-of-the-art natural language processing. ArXiv, abs/1910.03771.
    Findings
  • Jingqing Zhang, Yao Zhao, Mohammad Saleh, and Peter J Liu. 2020a. Pegasus: Pre-training with extracted gap-sentences for abstractive summarization. In Thirty-seventh International Conference on Machine Learning (ICML 2020).
    Google ScholarLocate open access versionFindings
  • Tianyi Zhang, Varsha Kishore*, Felix Wu*, Kilian Q. Weinberger, and Yoav Artzi. 2020b. Bertscore: Evaluating text generation with bert. In International Conference on Learning Representations.
    Google ScholarLocate open access versionFindings
  • Chenguang Zhu, William Hinthorn, Ruochen Xu, Qingkai Zeng, Michael Zeng, Xuedong Huang, and Meng Jiang. 2020. Boosting factual correctness of abstractive summarization with knowledge graph. arXiv preprint arXiv:2003.08612.
    Findings
Full Text
Your rating :
0

 

Tags
Comments