AI helps you reading Science

AI generates interpretation videos

AI extracts and analyses the key points of the paper to generate videos automatically


pub
Go Generating

AI Traceability

AI parses the academic lineage of this thesis


Master Reading Tree
Generate MRT

AI Insight

AI extracts a summary of this paper


Weibo:
Human evaluation reports that accompany ROUGE scores are limited in scope and coverage

What Have We Achieved on Text Summarization?

EMNLP 2020, pp.446-469, (2020)

Cited by: 0|Views266
Full Text
Bibtex
Weibo

Abstract

Deep learning has led to significant improvement in text summarization with various methods investigated and improved ROUGE scores reported over the years. However, gaps still exist between summaries produced by automatic summarizers and human professionals. Aiming to gain more understanding of summarization systems with respect to their ...More

Code:

Data:

0
Introduction
  • Automatic text summarization has received constant research attention due to its practical importance.
  • Improved ROUGE scores have been reported on standard benchmarks such as Gigaword (Graff et al, 2003), NYT (Grusky et al, 2018) and CNN/DM (Hermann et al, 2015) over the years, it is commonly accepted that the quality of machine-generated summaries still falls far behind human written ones.
  • Human evaluation reports that accompany ROUGE scores are limited in scope and coverage.
  • On a fine-grained level, it still remains uncertain what the authors have achieved overall and what fundamental changes each milestone technique has brought
Highlights
  • Automatic text summarization has received constant research attention due to its practical importance
  • Human evaluation reports that accompany ROUGE scores are limited in scope and coverage
  • PolyTope is an error-oriented fine-grained human evaluation method based on Multidimensional Quality Metric (MQM) (Mariana, 2014)
  • The main goal of this paper is to investigate the differences between summarization systems, rather than to promote a human evaluation metric
  • We empirically compared 10 representative text summarizers using a fine-grained set of human evaluation metrics designed according to MQM for human writing, aiming to achieve a better understanding on neural text summarization systems and the effect of milestone techniques investigated recently
  • Our observations suggest that extractive summarizers generally outperform abstractive summarizers by human evaluation, and more details are found about the unique advantages gained by copy, coverage, hybrid and especially pre-training technologies
Methods
  • BertSumExt demonstrates advantages only in Duplication, for the likely reason that the contextualized representations of the same phrases can be different by BERT encoding
  • It co-insides with previous findings (Kedzie et al, 2018) which demonstrate that more complicated architectures for producing sentence representations do not lead to better performance under the setting of extractive summarization.
  • The Point-Generator model reduces the error count to 14, demonstrating the effectiveness of the copy mechanism in faithfully reproducing details.
  • This is observed by Gehrmann et al (2018) and Balachandran et al (2020)
Results
  • Evaluation Method

    The authors analyze system performance by using ROUGE (Lin, 2004) for automatic scoring and PolyTope for human scoring.
  • PolyTope is an error-oriented fine-grained human evaluation method based on Multidimensional Quality Metric (MQM) (Mariana, 2014).
  • It consists of 8 issue types (Section 4.1), 8 syntactic labels (Section 4.2) and a set of severity rules (Section 4.3) to locate errors and to automatically calculate an overall score for the tested document.
  • As illustrated in Figure 3, compared with ROUGE, PolyTope is more fine-grained in offering detailed and diagnostic aspects of overall quality
Conclusion
  • The authors empirically compared 10 representative text summarizers using a fine-grained set of human evaluation metrics designed according to MQM for human writing, aiming to achieve a better understanding on neural text summarization systems and the effect of milestone techniques investigated recently.
  • The authors' observations suggest that extractive summarizers generally outperform abstractive summarizers by human evaluation, and more details are found about the unique advantages gained by copy, coverage, hybrid and especially pre-training technologies.
  • The overall conclusions are largely in line with existing research, while the authors provide more details in an error diagnostics aspect.
Tables
  • Table1: ROUGE scores of 10 summarizers on CNN/DM Dataset (non-anonymous version). We get the score of Lead-3 and TextRank from <a class="ref-link" id="cNallapati_et+al_2017_a" href="#rNallapati_et+al_2017_a">Nallapati et al (2017</a>) and <a class="ref-link" id="cZhou_et+al_2018_a" href="#rZhou_et+al_2018_a">Zhou et al (2018</a>), respectively
  • Table2: PolyTope for summarization diagnostics. This error matrix avoids subjectivity as human judgers only need to annotate issue types and syntactic labels of each mistake. Severity rules and scores is predefined and automatically calculated, without providing their own preference and scores
  • Table3: ROUGE and PolyTope results on 150 instances from CNN/DM dataset. ROUGE is the F1 score with stemming and stopwords not removed, giving the best agreement with human evaluation
  • Table4: Pearson correlation coefficients between ROUGE scores and human annotations from the perspective of instance and system level, respectively
  • Table5: Pearson between ROUGE-P and PolyTope
Download tables as Excel
Related work
  • Extractive Summarization Early efforts based on statistical methods (Neto et al, 2002; Mihalcea and Tarau, 2004) make use of expertise knowledge to manually design features or rules. Recent work based on neural architectures considers summarization as a word or sentence level classification problem and addresses it by calculating sentence representations (Cheng and Lapata, 2016; Nallapati

    2 https://github.com/hddbang/PolyTope et al, 2017; Xu and Durrett, 2019). Most recently, Zhong et al (2020) adopts document-level features to rerank extractive summaries.

    Abstractive Summarization Jing and McKeown (2000) presented a cut-paste based abstractive summarizer, which edited and merged extracted snippets into coherent sentences. Rush et al (2015) proposed a sequence-to-sequence architecture for abstractive summarization. Subsequently, Transformer was used and outperformed traditional abstractive summarizer by ROUGE scores (Duan et al, 2019). Techniques such as AMR parsing (Liu et al, 2015), copy (Gu et al, 2016), coverage (Tu et al, 2016; See et al, 2017), smoothing (Muller et al, 2019) and pre-training (Lewis et al, 2019; Liu and Lapata, 2019) were also examined to enhance summarization. Hybrid abstractive and extractive methods adopt a two-step approach including content selection and text generation (Gehrmann et al, 2018; Hsu et al, 2018; Celikyilmaz et al, 2018), achieving higher performance than end-to-end models in ROUGE.
Funding
  • This work is supported by NSFC 61976180 and a research grant from Tencent Inc
Study subjects and analysis
articles: 500
Extractive Methods TextRank Summa

Abstractive Methods

PG-Coverage Bottom-Up ically consist of 2 to 3 aspects such as informativeness, fluency and succinctness. Recently, Maynez et al (2020) conducted a human evaluation of 5 neural abstractive models on 500 articles. Their main goal is to verify the faithfulness and factuality in abstractive models

articles: 500
ically consist of 2 to 3 aspects such as informativeness, fluency and succinctness. Recently, Maynez et al (2020) conducted a human evaluation of 5 neural abstractive models on 500 articles. Their main goal is to verify the faithfulness and factuality in abstractive models

documents: 20
After PolyTope evaluation, 3-dimensional error points show the overall quality of the tested model (Figure 1). The inter-annotator agreement over 20 documents is 0.8621 in terms of Pearson correlation coefficient, which shows that PolyTope can significantly reduce subjective bias of annotators. More human annotation details are illustrated in Appendix B

Reference
  • Vidhisha Balachandran, Artidoro Pagnoni, Jay Yoon Lee, Dheeraj Rajagopal, Jaime Carbonell, and Yulia Tsvetkov. 2020. Structsum: Incorporating latent and explicit sentence dependencies for single document summarization. arXiv preprint arXiv:2003.00576.
    Findings
  • Florian Bohm, Yang Gao, Christian M. Meyer, Ori Shapira, Ido Dagan, and Iryna Gurevych. 2019. Better rewards yield better summaries: Learning to summarise without references. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3108–3118, Hong Kong, China. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Asli Celikyilmaz, Antoine Bosselut, Xiaodong He, and Yejin Choi. 2018. Deep communicating agents for abstractive summarization. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 1662–1675, New Orleans, Louisiana. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Jianpeng Cheng and Mirella Lapata. 2016. Neural summarization by extracting sentences and words. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, ACL 2016, August 7-12, 2016, Berlin, Germany, Volume 1: Long Papers.
    Google ScholarLocate open access versionFindings
  • James Clarke and Mirella Lapata. 2010. Discourse constraints for document compression. Computational Linguistics, 36(3):411–441.
    Google ScholarLocate open access versionFindings
  • Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), pages 4171–4186.
    Google ScholarLocate open access versionFindings
  • Bonnie Dorr, David Zajic, and Richard Schwartz. 2003. Hedge trimmer: A parse-and-trim approach to headline generation. In Proceedings of the HLT-NAACL 03 Text Summarization Workshop, pages 1–8.
    Google ScholarLocate open access versionFindings
  • Xiangyu Duan, Hoongfei Yu, Mingming Yin, Min Zhang, Weihua Luo, and Yue Zhang. 2019. Contrastive attention mechanism for abstractive sentence summarization. CoRR, abs/1910.13114.
    Findings
  • Sebastian Gehrmann, Yuntian Deng, and Alexander Rush. 2018. Bottom-up abstractive summarization. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 4098–4109, Brussels, Belgium. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • David Graff, Junbo Kong, Ke Chen, and Kazuaki Maeda. 2003. English gigaword. Linguistic Data Consortium, Philadelphia, 4(1):34.
    Google ScholarLocate open access versionFindings
  • Max Grusky, Mor Naaman, and Yoav Artzi. 2018. Newsroom: A dataset of 1.3 million summaries with diverse extractive strategies. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2018, New Orleans, Louisiana, USA, June 1-6, 2018, Volume 1 (Long Papers), pages 708–719.
    Google ScholarLocate open access versionFindings
  • Jiatao Gu, Zhengdong Lu, Hang Li, and Victor O.K. Li. 2016. Incorporating copying mechanism in sequence-to-sequence learning. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1631–1640, Berlin, Germany. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Karl Moritz Hermann, Tomas Kocisky, Edward Grefenstette, Lasse Espeholt, Will Kay, Mustafa Suleyman, and Phil Blunsom. 2015. Teaching machines to read and comprehend. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems 28, pages 1693–1701. Curran Associates, Inc.
    Google ScholarLocate open access versionFindings
  • Wan-Ting Hsu, Chieh-Kai Lin, Ming-Ying Lee, Kerui Min, Jing Tang, and Min Sun. 2018. A unified model for extractive and abstractive summarization using inconsistency loss. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 132–141, Melbourne, Australia. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Hongyan Jing and Kathleen R. McKeown. 2000. Cut and paste based text summarization. In 1st Meeting of the North American Chapter of the Association for Computational Linguistics.
    Google ScholarFindings
  • Chris Kedzie, Kathleen R. McKeown, and Hal Daume III. 2018. Content selection in deep learning models of summarization. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31 - November 4, 2018, pages 1818–1828.
    Google ScholarLocate open access versionFindings
  • Svetlana Kiritchenko and Saif Mohammad. 20Bestworst scaling more reliable than rating scales: A case study on sentiment intensity annotation. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 465–470, Vancouver, Canada. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Wojciech Kryscinski, Nitish Shirish Keskar, Bryan McCann, Caiming Xiong, and Richard Socher. 2019. Neural text summarization: A critical evaluation. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 540– 551, Hong Kong, China. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Mirella Lapata and Regina Barzilay. 2005. Automatic evaluation of text coherence: Models and representations. In IJCAI, volume 5, pages 1085–1090.
    Google ScholarLocate open access versionFindings
  • Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov, and Luke Zettlemoyer. 2019. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461.
    Findings
  • Chin-Yew Lin. 2004. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pages 74–81, Barcelona, Spain. Association for Computational Linguistics.
    Google ScholarFindings
  • Fei Liu, Jeffrey Flanigan, Sam Thomson, Norman Sadeh, and Noah A. Smith. 2015. Toward abstractive summarization using semantic representations. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1077–1086, Denver, Colorado. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Feifan Liu and Yang Liu. 2008. Correlation between ROUGE and human evaluation of extractive meeting summaries. In Proceedings of ACL-08: HLT, Short Papers, pages 201–204, Columbus, Ohio. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Yang Liu and Mirella Lapata. 2019. Text summarization with pretrained encoders. CoRR, abs/1908.08345.
    Findings
  • Valerie Ruth Mariana. 2014. The multidimensional quality metric (mqm) framework: A new framework for translation quality assessment.
    Google ScholarFindings
  • Joshua Maynez, Shashi Narayan, Bernd Bohnet, and Ryan McDonald. 2020. On faithfulness and factuality in abstractive summarization. arXiv preprint arXiv:2005.00661.
    Findings
  • Rada Mihalcea and Paul Tarau. 2004. TextRank: Bringing order into text. In Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing, pages 404–411, Barcelona, Spain. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Rafael Muller, Simon Kornblith, and Geoffrey Hinton. 2019. When does label smoothing help? arXiv preprint arXiv:1906.02629.
    Findings
  • Ramesh Nallapati, Feifei Zhai, and Bowen Zhou. 2017. Summarunner: A recurrent neural network based sequence model for extractive summarization of documents. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, February 4-9, 2017, San Francisco, California, USA., pages 3075– 3081.
    Google ScholarLocate open access versionFindings
  • Ramesh Nallapati, Bowen Zhou, Caglar dos Santos, Cicero andxiang GuI‡lcehre, and Bing Xiang. 2016. Abstractive text summarization using sequence-tosequence RNNs and beyond. In Proceedings of The 20th SIGNLL Conference on Computational Natural Language Learning, pages 280–290, Berlin, Germany. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Shashi Narayan, Shay B. Cohen, and Mirella Lapata. 2018. Ranking sentences for extractive summarization with reinforcement learning. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 1747–1759, New Orleans, Louisiana. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Ani Nenkova and Rebecca Passonneau. 2004. Evaluating content selection in summarization: The pyramid method. In Proceedings of the Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics: HLT-NAACL 2004, pages 145–152, Boston, Massachusetts, USA. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Joel Larocca Neto, Alex Alves Freitas, and Celso A. A. Kaestner. 2002. Automatic text summarization using a machine learning approach. In Advances in Artificial Intelligence, 16th Brazilian Symposium on Artificial Intelligence, SBIA 2002, Porto de Galinhas/Recife, Brazil, November 11-14, 2002, Proceedings, pages 205–215.
    Google ScholarLocate open access versionFindings
  • Jekaterina Novikova, Ondrej Dusek, Amanda Cercas Curry, and Verena Rieser. 2017. Why we need new evaluation metrics for NLG. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, EMNLP 2017, Copenhagen, Denmark, September 9-11, 2017, pages 2241–2252.
    Google ScholarLocate open access versionFindings
  • L. Page, S. Brin, R. Motwani, and T. Winograd. 1998. The pagerank citation ranking: Bringing order to the web. In Proceedings of the 7th International World Wide Web Conference, pages 161–172, Brisbane, Australia.
    Google ScholarLocate open access versionFindings
  • Maxime Peyrard, Teresa Botschen, and Iryna Gurevych. 2017. Learning to score system summaries for better content selection evaluation. In Proceedings of the Workshop on New Frontiers in Summarization, pages 74–84, Copenhagen, Denmark. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Alexander M. Rush, Sumit Chopra, and Jason Weston. 2015. A neural attention model for abstractive sentence summarization. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 379–389, Lisbon, Portugal. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Abigail See, Peter J. Liu, and Christopher D. Manning. 2017. Get to the point: Summarization with pointergenerator networks. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, ACL 2017, Vancouver, Canada, July 30 August 4, Volume 1: Long Papers, pages 1073–1083.
    Google ScholarLocate open access versionFindings
  • Nisan Stiennon, Long Ouyang, Jeff Wu, Daniel M Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, and Paul Christiano. 2020. Learning to summarize from human feedback. arXiv preprint arXiv:2009.01325.
    Findings
  • Simeng Sun, Ori Shapira, Ido Dagan, and Ani Nenkova. 2019. How to compare summarizers without target length? pitfalls, solutions and re-examination of the neural summarization literature. In Proceedings of the Workshop on Methods for Optimizing and Evaluating Neural Language Generation, pages 21–29, Minneapolis, Minnesota. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Markus Zopf. 2018. Estimating summary quality with pairwise preferences. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 1687–1696, New Orleans, Louisiana. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Zhaopeng Tu, Zhengdong Lu, Yang Liu, Xiaohua Liu, and Hang Li. 2016. Modeling coverage for neural machine translation. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 76– 85, Berlin, Germany. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Oriol Vinyals, Meire Fortunato, and Navdeep Jaitly. 2015. Pointer networks. In Advances in Neural Information Processing Systems, pages 2692–2700.
    Google ScholarLocate open access versionFindings
  • Jiacheng Xu and Greg Durrett. 2019. Neural extractive text summarization with syntactic compression. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3290– 3301, Hong Kong, China. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Fangfang Zhang, Jin-ge Yao, and Rui Yan. 2018. On the abstractiveness of neural document summarization. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 785–790, Brussels, Belgium. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q Weinberger, and Yoav Artzi. 2019. Bertscore: Evaluating text generation with bert. arXiv preprint arXiv:1904.09675.
    Findings
  • Ming Zhong, Pengfei Liu, Yiran Chen, Danqing Wang, Xipeng Qiu, and Xuanjing Huang. 2020. Extractive summarization as text matching. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 6197–6208, Online. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Ming Zhong, Pengfei Liu, Danqing Wang, Xipeng Qiu, and Xuanjing Huang. 2019. Searching for effective neural extractive summarization: What works and what’s next. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 1049–1058, Florence, Italy. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Qingyu Zhou, Nan Yang, Furu Wei, Shaohan Huang, Ming Zhou, and Tiejun Zhao. 2018. Neural document summarization by jointly learning to score and select sentences. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 654– 663, Melbourne, Australia. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
Author
Sen Yang
Sen Yang
Guangsheng Bao
Guangsheng Bao
Jun Xie
Jun Xie
Your rating :
0

 

Tags
Comments
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn
小科