AI帮你理解科学

AI 生成解读视频

AI抽取解析论文重点内容自动生成视频


pub
生成解读视频

AI 溯源

AI解析本论文相关学术脉络


Master Reading Tree
生成 溯源树

AI 精读

AI抽取本论文的概要总结


微博一下
Correlation is calculated for each document, among the different system outputs of that document, and the mean value is reported

Re evaluating Evaluation in Text Summarization

EMNLP 2020, pp.9347-9359, (2020)

被引用2|浏览227
下载 PDF 全文
引用
微博一下

摘要

Automated evaluation metrics as a stand-in for manual evaluation are an essential part of the development of text-generation tasks such as text summarization. However, while the field has progressed, our standard metrics have not – for nearly 20 years ROUGE has been the standard evaluation in most summarization papers. In this paper, we m...更多

代码

数据

0
简介
  • Manual evaluation, as exemplified by the Pyramid method (Nenkova and Passonneau, 2004), is the gold-standard in evaluation.
  • Due to time required and relatively high cost of annotation, the great majority of research papers on summarization use exclusively automatic evaluation metrics, such as ROUGE (Lin, 2004) , JS-2 (Louis and Nenkova, 2013), S3 (Peyrard et al, 2017), BERTScore (Zhang et al, 2020), MoverScore (Zhao et al, 2019) etc
  • Among these metrics, ROUGE is by far the most popular, and there is relatively little discussion of how ROUGE may deviate from human judgment and the potential for this deviation to change conclusions drawn regarding relative merit of baseline and proposed methods.
  • Two earlier works exemplify this disconnect: (1) Peyrard (2019) observed that the human-annotated summaries in the TAC dataset are mostly of lower quality than those produced by modern systems and that various automated evaluation metrics strongly disagree in the higher-scoring range in which current systems operate. (2) Rankel et al (2013) observed that the correlation between ROUGE and human judgments in the TAC dataset decreases when looking at the best systems only, even for systems from eight years ago, which are far from today’s state-of-the-art
重点内容
  • In text summarization, manual evaluation, as exemplified by the Pyramid method (Nenkova and Passonneau, 2004), is the gold-standard in evaluation
  • We find that many of the previously attested properties of metrics found on the TAC dataset demonstrate different trends on our newly collected CNNDM dataset, as shown in Tab. 1
  • We examine eight metrics that measure the agreement between two texts, in our case, between the system summary and reference summary
  • Correlation is calculated for each document, among the different system outputs of that document, and the mean value is reported
  • We focus on shorter, more fine-grained Semantic Content Units (SCUs) that contain at most 2-3 entities
  • Motivated by the central research question: “does the rapid progress of model development in summarization models require us to re-evaluate the evaluation process used for text summarization?” We use the collected human judgments to meta-evaluate current metrics from four diverse viewpoints, measuring the ability of metrics to: (1) evaluate all systems; (2) evaluate top-k strongest systems; (3) compare two systems; (4) evaluate individual summaries
方法
  • Motivated by the central research question: “does the rapid progress of model development in summarization models require them to re-evaluate the evaluation process used for text summarization?” The authors use the collected human judgments to meta-evaluate current metrics from four diverse viewpoints, measuring the ability of metrics to: (1) evaluate all systems; (2) evaluate top-k strongest systems; (3) compare two systems; (4) evaluate individual summaries.
  • In meta-evaluation studies, calculating correlation of automatic metrics with human judgments at the system level is a commonly-used setting (Novikova et al, 2017; Bojar et al, 2016; Graham, 2015).
  • The authors follow this setting and ask two questions: Can metrics reliably compare different systems?
结果
  • Evaluation Metrics

    The authors examine eight metrics that measure the agreement between two texts, in the case, between the system summary and reference summary.
  • BERTScore (BScore) measures soft overlap between contextual BERT embeddings of tokens between the two texts4 (Zhang et al, 2020).
  • MoverScore (MScore) applies a distance measure to contextualized BERT and ELMo word embeddings5 (Zhao et al, 2019).
  • Sentence Mover Similarity (SMS) applies minimum distance matching between text based on sentence embeddings (Clark et al, 2019).
  • Word Mover Similarity (WMS) measures similarity using minimum distance matching between texts which are represented as a bag of word embeddings6 (Kusner et al, 2015).
  • The authors use the recall variant of all metrics except MScore which has no specific recall variant
结论
  • Summary-level correlation is calculated as follows: Kmsu1mm2 = 1 n n.
  • Correlation is calculated for each document, among the different system outputs of that document, and the mean value is reported.
  • 2.5.2 System Level System-level correlation is calculated as follows Kmsy1sm2 = K.
  • N m1(siJ ) i=1 i=1
表格
  • Table1: Summary of our experiments, observations on existing human judgments on the TAC, and contrasting observations on newly obtained human judgments on the CNNDM dataset. Please refer to Sec. 4 for more details
  • Table2: Example of a summary and corresponding annotation. (a) shows a reference summary from the representative sample of the CNNDM test set. (b) shows the corresponding system summary generated by BART, one of the abstractive systems used in the study. (c) shows the SCUs (Semantic Content Units) extracted from (a) and the “Present( )”/“Not Present(×)” marked by crowd workers when evaluating (b)
Download tables as Excel
相关工作
  • This work is connected to the following threads of topics in text summarization. Human Judgment Collection Despite many approaches to the acquisition of human judgment (Chaganty et al, 2018; Nenkova and Passonneau, 2004; Shapira et al, 2019; Fan et al, 2018), Pyramid (Nenkova and Passonneau, 2004) has been a mainstream method to meta-evaluate various automatic metrics. Specifically, Pyramid provides a robust technique for evaluating content selection by exhaustively obtaining a set of Semantic Content Units (SCUs) from a set of references, and then scoring system summaries on how many SCUs can be inferred from them. Recently, Shapira et al (2019) proposed a lightweight and crowdsourceable version of the original Pyramid, and demonstrated it on the DUC 2005 (Dang, 2005) and 2006 (Dang, 2006) multi-document summarization datasets. In this paper, our human evaluation methodology is based on the Pyramid (Nenkova and Passonneau, 2004) and LitePyramids (Shapira et al, 2019) techniques. Chaganty et al (2018) also obtain human evaluations on system summaries on the CNNDM dataset, but with a focus on language quality of summaries. In comparison, our work is focused on evaluating content selection. Our work also covers more systems than their study (11 extractive + 14 abstractive vs. 4 abstractive).
基金
  • We calculate the weighted macro F1 score for all metrics and view them in Fig. 4
研究对象与分析
documents: 100
3.2 Representative Sample Selection. Since collecting human annotations is costly, we sample 100 documents from CNNDM test set (11,490 samples) and evaluate system generated summaries of these 100 documents. We aim to include documents of varying difficulties in the representative sample

documents: 4
As a proxy to the difficulty of summarizing a document, we use the mean score received by the system generated summaries for the document. Based on this, we partition the CNNDM test set into 5 equal sized bins and sample 4 documents from each bin. We repeat this process for 5 metrics (BERTScore, MoverScore, R-1, R-2, RL) obtaining a sample of 100 documents

documents: 100
Based on this, we partition the CNNDM test set into 5 equal sized bins and sample 4 documents from each bin. We repeat this process for 5 metrics (BERTScore, MoverScore, R-1, R-2, RL) obtaining a sample of 100 documents. This methodology is detailed in Alg. 1 in Sec

crowd workers: 4
Tab. 2 depicts an example reference summary, system summary, SCUs extracted from the reference summary, and annotations obtained in evaluating the system summary. Annotation Scoring For robustness (Shapira et al, 2019), each system summary is evaluated by 4 crowd workers. Each worker annotates up to 16 SCUs by marking an SCU “present” if it can be

documents: 100
inferred from the system summary or “not present” otherwise. We obtain a total of 10,000 human annotations (100 documents × 25 systems × 4 workers). For each document, we identify a “noisy” worker as one who disagrees with the majority (i.e. marks an SCU as “present” when majority thinks “not present” or vice-versa), on the largest number of SCUs

引用论文
  • Ondrej Bojar, Yvette Graham, Amir Kamran, and Milos Stanojevic. 2016. Results of the WMT16 metrics shared task. In Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers, pages 199–231, Berlin, Germany. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Florian Bohm, Yang Gao, Christian M. Meyer, Ori Shapira, Ido Dagan, and Iryna Gurevych. 2019. Better rewards yield better summaries: Learning to summarise without references.
    Google ScholarFindings
  • Arun Tejasvi Chaganty, Stephen Mussman, and Percy Liang. 2018. The price of debiasing automatic metrics in natural language evaluation.
    Google ScholarFindings
  • Yen-Chun Chen and Mohit Bansal. 2018. Fast abstractive summarization with reinforce-selected sentence rewriting. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 675–686, Melbourne, Australia. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Elizabeth Clark, Asli Celikyilmaz, and Noah A Smith. 2019. Sentence mover’s similarity: Automatic evaluation for multi-sentence texts. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2748–2760.
    Google ScholarLocate open access versionFindings
  • Hoa Dang and Karolina Owczarzak. 2008. Overview of the tac 2008 update summarization task. In Proceedings of the First Text Analysis Conference (TAC 2008), pages 1 – 16.
    Google ScholarLocate open access versionFindings
  • Hoa Dang and Karolina Owczarzak. 2009. Overview of the tac 2009 summarization track. In Proceedings of the First Text Analysis Conference (TAC 2009), pages 1 – 16.
    Google ScholarLocate open access versionFindings
  • Hoa Trang Dang. 2005. Overview of duc 2005. In In Proceedings of the Document Understanding Conf. Wksp. 2005 (DUC 2005) at the Human Language Technology Conf./Conf. on Empirical Methods in Natural Language Processing (HLT/EMNLP.
    Google ScholarLocate open access versionFindings
  • Hoa Trang Dang. 2006. Overview of duc 2006. In In Proceedings of HLT-NAACL 2006.
    Google ScholarLocate open access versionFindings
  • Li Dong, Nan Yang, Wenhui Wang, Furu Wei, Xiaodong Liu, Yu Wang, Jianfeng Gao, Ming Zhou, and Hsiao-Wuen Hon. 2019. Unified language model pre-training for natural language understanding and generation. In Advances in Neural Information Processing Systems, pages 13042–13054.
    Google ScholarLocate open access versionFindings
  • Yue Dong, Yikang Shen, Eric Crawford, Herke van Hoof, and Jackie Chi Kit Cheung. 2018. BanditSum: Extractive summarization as a contextual bandit. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 3739–3748, Brussels, Belgium. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Rotem Dror, Gili Baumer, Segev Shlomov, and Roi Reichart. 2018. The hitchhiker’s guide to testing statistical significance in natural language processing. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1383–1392, Melbourne, Australia. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Angela Fan, David Grangier, and Michael Auli. 2018. Controllable abstractive summarization. In Proceedings of the 2nd Workshop on Neural Machine Translation and Generation, pages 45–54, Melbourne, Australia. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Sebastian Gehrmann, Yuntian Deng, and Alexander Rush. 2018. Bottom-up abstractive summarization. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 4098–4109.
    Google ScholarLocate open access versionFindings
  • Cyril Goutte and Eric Gaussier. 2005. A probabilistic interpretation of precision, recall and f-score, with implication for evaluation. In European Conference on Information Retrieval, pages 345–359. Springer.
    Google ScholarLocate open access versionFindings
  • Yvette Graham. 2015. Re-evaluating automatic summarization with BLEU and 192 shades of ROUGE. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 128–137, Lisbon, Portugal. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Yvette Graham and Timothy Baldwin. 2014. Testing for significance of increased correlation with human judgment. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 172–176, Doha, Qatar. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Karl Moritz Hermann, Tomas Kocisky, Edward Grefenstette, Lasse Espeholt, Will Kay, Mustafa Suleyman, and Phil Blunsom. 2015. Teaching machines to read and comprehend. In Advances in Neural Information Processing Systems, pages 1684–1692.
    Google ScholarLocate open access versionFindings
  • Chris Kedzie, Kathleen McKeown, and Hal Daume III. 2018. Content selection in deep learning models of summarization. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 1818–1828.
    Google ScholarLocate open access versionFindings
  • Philipp Koehn. 2004. Statistical significance tests for machine translation evaluation. In Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing, pages 388– 395, Barcelona, Spain. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Klaus Krippendorff. 2011. Computing krippendorff’s alpha-reliability.
    Google ScholarFindings
  • Matt Kusner, Yu Sun, Nicholas Kolkin, and Kilian Weinberger. 2015. From word embeddings to document distances. In International conference on machine learning, pages 957–966.
    Google ScholarLocate open access versionFindings
  • W Alan Lee Rodgers. 1988. Thirteen ways to look at the correlation coefficient. The American Statistician, 42(1):59–66.
    Google ScholarLocate open access versionFindings
  • Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov, and Luke Zettlemoyer. 2019. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461.
    Findings
  • Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. Text Summarization Branches Out.
    Google ScholarFindings
  • Chin-Yew Lin, Guihong Cao, Jianfeng Gao, and JianYun Nie. 2006. An information-theoretic approach to automatic evaluation of summaries. In Proceedings of the Human Language Technology Conference of the NAACL, Main Conference, pages 463– 470, New York City, USA. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Chin-Yew Lin and Franz Josef Och. 2004. ORANGE: a method for evaluating automatic evaluation metrics for machine translation. In COLING 2004: Proceedings of the 20th International Conference on Computational Linguistics, pages 501–507, Geneva, Switzerland. COLING.
    Google ScholarLocate open access versionFindings
  • Yang Liu and Mirella Lapata. 2019a. Text summarization with pretrained encoders. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3730–3740, Hong Kong, China. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Yang Liu and Mirella Lapata. 2019b. Text summarization with pretrained encoders.
    Google ScholarFindings
  • Annie Louis and Ani Nenkova. 2013. Automatically assessing machine summary content without a gold standard. Computational Linguistics, 39(2):267– 300.
    Google ScholarLocate open access versionFindings
  • Christopher D. Manning, Mihai Surdeanu, John Bauer, Jenny Finkel, Steven J. Bethard, and David McClosky. 2014. The Stanford CoreNLP natural language processing toolkit. In Association for Computational Linguistics (ACL) System Demonstrations, pages 55–60.
    Google ScholarLocate open access versionFindings
  • Ramesh Nallapati, Bowen Zhou, Cicero dos Santos, Ca glar Gulcehre, and Bing Xiang. 2016. Abstractive text summarization using sequence-to-sequence rnns and beyond. CoNLL 2016, page 280.
    Google ScholarLocate open access versionFindings
  • Shashi Narayan, Shay B. Cohen, and Mirella Lapata. 2018. Ranking sentences for extractive summarization with reinforcement learning. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 1747–1759, New Orleans, Louisiana. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Ani Nenkova and Rebecca Passonneau. 2004. Evaluating content selection in summarization: The pyramid method. In Proceedings of the Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics: HLT-NAACL 2004, pages 145–152, Boston, Massachusetts, USA. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Jun-Ping Ng and Viktoria Abrecht. 2015. Better summarization evaluation with word embeddings for ROUGE. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 1925–1930, Lisbon, Portugal. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Jekaterina Novikova, Ondrej Dusek, Amanda Cercas Curry, and Verena Rieser. 2017. Why we need new evaluation metrics for NLG. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 2241–2252, Copenhagen, Denmark. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Maxime Peyrard. 2019. Studying summarization evaluation metrics in the appropriate scoring range. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 5093–5100, Florence, Italy. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Maxime Peyrard, Teresa Botschen, and Iryna Gurevych. 2017. Learning to score system summaries for better content selection evaluation. In Proceedings of the Workshop on New Frontiers in Summarization, pages 74–84, Copenhagen, Denmark. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2019. Exploring the limits of transfer learning with a unified text-to-text transformer.
    Google ScholarFindings
  • Peter A. Rankel, John M. Conroy, Hoa Trang Dang, and Ani Nenkova. 2013. A decade of automatic content evaluation of news summaries: Reassessing the state of the art. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 131–136, Sofia, Bulgaria. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Abigail See, Peter J. Liu, and Christopher D. Manning. 2017. Get to the point: Summarization with pointergenerator networks. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1073– 1083, Vancouver, Canada. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Thibault Sellam, Dipanjan Das, and Ankur P Parikh. 2020. Bleurt: Learning robust metrics for text generation. arXiv preprint arXiv:2004.04696.
    Findings
  • Ori Shapira, David Gabay, Yang Gao, Hadar Ronen, Ramakanth Pasunuru, Mohit Bansal, Yael Amsterdamer, and Ido Dagan. 2019. Crowdsourcing lightweight pyramids for manual summary evaluation. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 682– 687, Minneapolis, Minnesota. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Wei Zhao, Maxime Peyrard, Fei Liu, Yang Gao, Christian M. Meyer, and Steffen Eger. 2019. MoverScore: Text generation evaluating with contextualized embeddings and earth mover distance. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 563–578, Hong Kong, China. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • 0 Contributors. 2020. SciPy 1.0: Fundamental Algorithms for Scientific Computing in Python. Nature Methods, 17:261–272.
    Google ScholarLocate open access versionFindings
  • Danqing Wang, Pengfei Liu, Yining Zheng, Xipeng Qiu, and Xuanjing Huang. 2020. Heterogeneous graph neural networks for extractive document summarization. arXiv preprint arXiv:2004.12393.
    Findings
  • Ming Zhong, Pengfei Liu, Yiran Chen, Danqing Wang, Xipeng Qiu, and Xuanjing Huang. 2020. Extractive summarization as text matching. arXiv preprint arXiv:2004.08795.
    Findings
  • Ming Zhong, Pengfei Liu, Danqing Wang, Xipeng Qiu, and Xuan-Jing Huang. 2019. Searching for effective neural extractive summarization: What works and what’s next. In Proceedings of the 57th Conference of the Association for Computational Linguistics, pages 1049–1058.
    Google ScholarLocate open access versionFindings
  • Qingyu Zhou, Nan Yang, Furu Wei, Shaohan Huang, Ming Zhou, and Tiejun Zhao. 2018. Neural document summarization by jointly learning to score and select sentences. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 654– 663, Melbourne, Australia. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Evan J. Williams. 1959. Regression analysis. Wiley, New York, 14.
    Google ScholarFindings
  • Wonjin Yoon, Yoon Sun Yeo, Minbyul Jeong, BongJun Yi, and Jaewoo Kang. 2020. Learning by semantic similarity makes abstractive summarization better. arXiv preprint arXiv:2002.07767.
    Findings
  • Haoyu Zhang, Yeyun Gong, Yu Yan, Nan Duan, Jianjun Xu, Ji Wang, Ming Gong, and Ming Zhou. 2019a. Pretraining-based natural language generation for text summarization. arXiv preprint arXiv:1902.09243.
    Findings
  • Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q. Weinberger, and Yoav Artzi. 2020. Bertscore: Evaluating text generation with bert. In International Conference on Learning Representations.
    Google ScholarLocate open access versionFindings
  • Xingxing Zhang, Mirella Lapata, Furu Wei, and Ming Zhou. 2018. Neural latent extractive document summarization. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 779–784, Brussels, Belgium. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Xingxing Zhang, Furu Wei, and Ming Zhou. 2019b. HIBERT: Document level pre-training of hierarchical bidirectional transformers for document summarization. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 5059–5069, Florence, Italy. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
作者
Manik Bhandari
Manik Bhandari
Pranav Narayan Gour
Pranav Narayan Gour
Atabak Ashfaq
Atabak Ashfaq
Pengfei Liu
Pengfei Liu
您的评分 :
0

 

标签
评论
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn
小科