AI帮你理解科学

AI 生成解读视频

AI抽取解析论文重点内容自动生成视频


pub
生成解读视频

AI 溯源

AI解析本论文相关学术脉络


Master Reading Tree
生成 溯源树

AI 精读

AI抽取本论文的概要总结


微博一下
In this paper we present COMET, a novel neural framework for training machine translation evaluation models that can serve as automatic metrics and be adapted and optimized to different types of human judgements of MT quality

COMET: A Neural Framework for MT Evaluation

EMNLP 2020, (2020)

被引用0|浏览7
下载 PDF 全文
引用
微博一下

摘要

We present COMET, a neural framework for training multilingual machine translation evaluation models which obtains new state-of-theart levels of correlation with human judgements. Our framework leverages recent breakthroughs in cross-lingual pretrained language modeling resulting in highly multilingual and adaptable MT evaluation models t...更多

代码

数据

0
简介
  • Metrics for evaluating the quality of machine translation (MT) have relied on assessing the similarity between an MT-generated hypothesis and a human-generated reference translation in the target language.
  • Modern neural approaches to MT result in much higher quality of translation that often deviates from monotonic lexical transfer between languages
  • For this reason, it has become increasingly evident that the authors can no longer rely on metrics such as BLEU to provide an accurate estimate of the quality of MT (Barrault et al, 2019).
  • The Metrics Shared Task of the same year saw only 24 submissions, almost half of which were entrants to the Quality Estimation Shared Task, adapted as metrics (Ma et al, 2019)
重点内容
  • Metrics for evaluating the quality of machine translation (MT) have relied on assessing the similarity between an MT-generated hypothesis and a human-generated reference translation in the target language
  • We present COMET1, a PyTorchbased framework for training highly multilingual and adaptable MT evaluation models that can function as metrics
  • To evaluate whether our new MT evaluation models better address this issue, we followed the described evaluation setup used in the analysis presented in (Ma et al, 2019), where correlation levels are examined for portions of the DAs are then mapped into relative rankings (DARR) data that include only the top 10, 8, 6 and 4 MT systems
  • Even though the MQM Estimator is trained on only 12K annotated segments, it performs roughly on par with the Human-mediated Translation Edit Rate (HTER) Estimator for most language-pairs, and outperforms all the other metrics in en-ru
  • In this paper we present COMET, a novel neural framework for training MT evaluation models that can serve as automatic metrics and be adapted and optimized to different types of human judgements of MT quality
  • We trained three distinct models which achieve new state-of-the-art results for segment-level correlation with human judgments, and show promising ability to better differentiate high-performing systems
方法
  • The authors train two versions of the Estimator model described in section 2.3: one that regresses on HTER (COMET-HTER) trained with the QT21 corpus, and another that regresses on the proprietary implementation of MQM (COMET-MQM) trained with the internal MQM corpus.
  • The authors load the pretrained encoder and initialize both the pooling layer and the feed-forward regressor.
  • The entire model is fine-tuned but the learning rate for the encoder parameters is set to 1e−5 in order to avoid catastrophic forgetting
结果
  • 5.1 From English into X

    Table 1 shows results for all eight language pairs with English as source.
  • The authors contrast the three COMET models against baseline metrics such as BLEU and CHRF, the 2019 task winning metric YISI-1, as well as the more recent BERTSCORE.
  • The authors contrast the three COMET models against baseline metrics such as BLEU and CHRF, the 2019 task winning metric YISI-1, as well as the recently published metrics BERTSCORE and BLEURT.
  • As in Table 1 the DARR model shows strong correlations with human judgements outperforming the recently proposed English-specific BLEURT metric in five out of seven language pairs.
  • The encoder used in the trained models is highly multilingual, the authors hypothesise that this powerful “zero-shot” result is due to the inclusion of the source in the models
结论
  • In this paper the authors present COMET, a novel neural framework for training MT evaluation models that can serve as automatic metrics and be adapted and optimized to different types of human judgements of MT quality.
  • Whilst the authors outline the potential importance of the source text above, the authors note that the COMET-RANK model weighs source and reference differently during inference but in its training loss function.
  • Future work will investigate the optimality of this formulation and further examine the interdependence of the different inputs
总结
  • Introduction:

    Metrics for evaluating the quality of machine translation (MT) have relied on assessing the similarity between an MT-generated hypothesis and a human-generated reference translation in the target language.
  • Modern neural approaches to MT result in much higher quality of translation that often deviates from monotonic lexical transfer between languages
  • For this reason, it has become increasingly evident that the authors can no longer rely on metrics such as BLEU to provide an accurate estimate of the quality of MT (Barrault et al, 2019).
  • The Metrics Shared Task of the same year saw only 24 submissions, almost half of which were entrants to the Quality Estimation Shared Task, adapted as metrics (Ma et al, 2019)
  • Methods:

    The authors train two versions of the Estimator model described in section 2.3: one that regresses on HTER (COMET-HTER) trained with the QT21 corpus, and another that regresses on the proprietary implementation of MQM (COMET-MQM) trained with the internal MQM corpus.
  • The authors load the pretrained encoder and initialize both the pooling layer and the feed-forward regressor.
  • The entire model is fine-tuned but the learning rate for the encoder parameters is set to 1e−5 in order to avoid catastrophic forgetting
  • Results:

    5.1 From English into X

    Table 1 shows results for all eight language pairs with English as source.
  • The authors contrast the three COMET models against baseline metrics such as BLEU and CHRF, the 2019 task winning metric YISI-1, as well as the more recent BERTSCORE.
  • The authors contrast the three COMET models against baseline metrics such as BLEU and CHRF, the 2019 task winning metric YISI-1, as well as the recently published metrics BERTSCORE and BLEURT.
  • As in Table 1 the DARR model shows strong correlations with human judgements outperforming the recently proposed English-specific BLEURT metric in five out of seven language pairs.
  • The encoder used in the trained models is highly multilingual, the authors hypothesise that this powerful “zero-shot” result is due to the inclusion of the source in the models
  • Conclusion:

    In this paper the authors present COMET, a novel neural framework for training MT evaluation models that can serve as automatic metrics and be adapted and optimized to different types of human judgements of MT quality.
  • Whilst the authors outline the potential importance of the source text above, the authors note that the COMET-RANK model weighs source and reference differently during inference but in its training loss function.
  • Future work will investigate the optimality of this formulation and further examine the interdependence of the different inputs
表格
  • Table1: Kendall’s Tau (τ ) correlations on language pairs with English as source for the WMT19 Metrics DARR corpus. For BERTSCORE we report results with the default encoder model for a complete comparison, but also with XLM-RoBERTa (base) for fairness with our models. The values reported for YiSi-1 are taken directly from the shared task paper (<a class="ref-link" id="cMa_et+al_2019_a" href="#rMa_et+al_2019_a">Ma et al, 2019</a>)
  • Table2: Kendall’s Tau (τ ) correlations on language pairs with English as a target for the WMT19 Metrics DARR corpus. As for BERTSCORE, for BLEURT we report results for two models: the base model, which is comparable in size with the encoder we used and the large model that is twice the size
  • Table3: Kendall’s Tau (τ ) correlations on language pairs not involving English for the WMT19 Metrics DARR corpus
  • Table4: Comparison between COMET-RANK (section 2.4) and a reference-only version thereof on WMT18 data. Both models were trained with WMT17 which means that the reference-only model is never exposed to English during training
  • Table5: Hyper-parameters used in our COMET framework to train the presented models
  • Table6: Statistics for the QT21 corpus
  • Table7: Statistics for the WMT 2017 DARR corpus
  • Table8: Statistics for the WMT 2019 DARR into-English language pairs
  • Table9: Statistics for the WMT 2019 DARR from-English and no-English language pairs
  • Table10: MQM corpus (section 3.3) statistics
  • Table11: Statistics for the WMT 2018 DARR language pairs
  • Table12: Metrics performance over all and the top (10,8, 6, and 4) MT systems for all from-English language pairs. The color scheme is as follows: COMET-RANK, COMET-HTER, COMET-MQM, BLEU, BERTSCORE
  • Table13: Metrics performance over all and the top (10,8, 6, and 4) MT systems for all into-English language pairs. The color scheme is as follows: COMET-RANK, COMET-HTER, COMET-MQM, BLEU, BERTSCORE , BLEURT
Download tables as Excel
相关工作
  • Classic MT evaluation metrics are commonly characterized as n-gram matching metrics because, using hand-crafted features, they estimate MT quality by counting the number and fraction of ngrams that appear simultaneous in a candidate translation hypothesis and one or more humanreferences. Metrics such as BLEU (Papineni et al, 2002), METEOR (Lavie and Denkowski, 2009), and CHRF (Popovic, 2015) have been widely studied and improved (Koehn et al, 2007; Popovic, 2017; Denkowski and Lavie, 2011; Guo and Hu, 2019), but, by design, they usually fail to recognize and capture semantic similarity beyond the lexical level.

    In recent years, word embeddings (Mikolov et al, 2013; Pennington et al, 2014; Peters et al, 2018; Devlin et al, 2019) have emerged as a commonly used alternative to n-gram matching for capturing word semantics similarity. Embeddingbased metrics like METEOR-VECTOR (Servan et al, 2016), BLEU2VEC (Tattar and Fishel, 2017), YISI-1 (Lo, 2019), MOVERSCORE (Zhao et al, 2019), and BERTSCORE (Zhang et al, 2020) create soft-alignments between reference and hypothesis in an embedding space and then compute a score that reflects the semantic similarity between those segments. However, human judgements such as DA and MQM, capture much more than just semantic similarity, resulting in a correlation upperbound between human judgements and the scores produced by such metrics.

    Learnable metrics (Shimanaka et al, 2018; Mathur et al, 2019; Shimanaka et al, 2019) attempt to directly optimize the correlation with human judgments, and have recently shown promising results. BLEURT (Sellam et al, 2020), a learnable metric based on BERT (Devlin et al, 2019), claims state-of-the-art performance for the last 3 years of the WMT Metrics Shared task. Because BLEURT builds on top of English-BERT (Devlin et al, 2019), it can only be used when English is the target language which limits its applicability. Also, to the best of our knowledge, all the previously proposed learnable metrics have focused on optimizing DA which, due to a scarcity of annotators, can prove inherently noisy (Ma et al, 2019).
基金
  • This work was supported in part by the P2020 Program through projects MAIA and Unbabel4EU, supervised by ANI under contract numbers 045909 and 042671, respectively
研究对象与分析
language pairs: 18
Early experimentation revealed that the value added by the source embedding as extra input features to our regressor was negligible at best. A variation on our HTER estimator model trained with the vector x = [h; s; r; h s; h r; |h − s|; |h − r|] as input to the feed-forward only succeed in boosting segment-level performance in 8 of the 18 language pairs outlined in section 5 below and the average improvement in Kendall’s Tau in those settings was +0.0009. As noted in Zhao et al (2020), while cross-lingual pretrained models are adaptive to multiple languages, the feature space between languages is poorly aligned

language pairs: 12
chat messages that were annotated according to the guidelines set out in Burchardt and Lommel (2014). This data contains a total of 12K tuples, covering 12 language pairs from English to: German (en-de), Spanish (en-es), Latin-American Spanish (en-es-latam), French (en-fr), Italian (en-it), Japanese (en-ja), Dutch (en-nl), Portuguese (en-pt), Brazilian Portuguese (en-pt-br), Russian (en-ru), Swedish (en-sv), and Turkish (en-tr). Note that in this corpus English is always seen as the source language, but never as the target language

language pairs with English as source: 8
5.1 From English into X. Table 1 shows results for all eight language pairs with English as source. We contrast our three COMET models against baseline metrics such as BLEU and CHRF, the 2019 task winning metric YISI-1, as well as the more recent BERTSCORE

to-English language pairs: 7
5.2 From X into English. Table 2 shows results for the seven to-English language pairs. Again, we contrast our three COMET models against baseline metrics such as BLEU and CHRF, the 2019 task winning metric YISI-1, as well as the recently published metrics BERTSCORE and BLEURT

引用论文
  • Mikel Artetxe and Holger Schwenk. 2019. Massively multilingual sentence embeddings for zeroshot cross-lingual transfer and beyond. Transactions of the Association for Computational Linguistics, 7:597–610.
    Google ScholarLocate open access versionFindings
  • Loıc Barrault, Ondrej Bojar, Marta R. Costa-jussa, Christian Federmann, Mark Fishel, Yvette Graham, Barry Haddow, Matthias Huck, Philipp Koehn, Shervin Malmasi, Christof Monz, Mathias Muller, Santanu Pal, Matt Post, and Marcos Zampieri. 2019. Findings of the 2019 conference on machine translation (WMT19). In Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1), pages 1–61, Florence, Italy. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Lucia Specia. 201Findings of the 2013 Workshop on Statistical Machine Translation. In Proceedings of the Eighth Workshop on Statistical Machine Translation, pages 1–44, Sofia, Bulgaria. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Ondrej Bojar, Christian Buck, Christian Federmann, Barry Haddow, Philipp Koehn, Johannes Leveling, Christof Monz, Pavel Pecina, Matt Post, Herve Saint-Amand, Radu Soricut, Lucia Specia, and Ales Tamchyna. 201Findings of the 2014 workshop on statistical machine translation. In Proceedings of the Ninth Workshop on Statistical Machine Translation, pages 12–58, Baltimore, Maryland, USA. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Ondrej Bojar, Rajen Chatterjee, Christian Federmann, Yvette Graham, Barry Haddow, Shujian Huang, Matthias Huck, Philipp Koehn, Qun Liu, Varvara Logacheva, Christof Monz, Matteo Negri, Matt Post, Raphael Rubino, Lucia Specia, and Marco Turchi. 2017a. Findings of the 2017 conference on machine translation (WMT17). In Proceedings of the Second Conference on Machine Translation, pages 169– 214, Copenhagen, Denmark. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Ondrej Bojar, Rajen Chatterjee, Christian Federmann, Yvette Graham, Barry Haddow, Matthias Huck, Antonio Jimeno Yepes, Philipp Koehn, Varvara Logacheva, Christof Monz, Matteo Negri, Aurelie Neveol, Mariana Neves, Martin Popel, Matt Post, Raphael Rubino, Carolina Scarton, Lucia Specia, Marco Turchi, Karin Verspoor, and Marcos Zampieri. 201Findings of the 2016 conference on machine translation. In Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers, pages 131–198, Berlin, Germany. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Ondrej Bojar, Rajen Chatterjee, Christian Federmann, Barry Haddow, Matthias Huck, Chris Hokamp, Philipp Koehn, Varvara Logacheva, Christof Monz, Matteo Negri, Matt Post, Carolina Scarton, Lucia Specia, and Marco Turchi. 2015. Findings of the 2015 workshop on statistical machine translation. In Proceedings of the Tenth Workshop on Statistical Machine Translation, pages 1–46, Lisbon, Portugal. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Ondrej Bojar, Yvette Graham, and Amir Kamran. 2017b. Results of the WMT17 metrics shared task. In Proceedings of the Second Conference on Machine Translation, pages 489–513, Copenhagen, Denmark. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Aljoscha Burchardt and Arle Lommel. 2014. Practical Guidelines for the Use of MQM in Scientific Research on Translation quality. (access date: 202005-26).
    Google ScholarFindings
  • Guzman, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Unsupervised cross-lingual representation learning at scale. arXiv preprint arXiv:1911.02116.
    Findings
  • Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1), pages 501–506, Florence, Italy. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Alexis Conneau and Guillaume Lample. 2019. Crosslingual language model pretraining. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d‘Alche Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems 32, pages 7059– 7069. Curran Associates, Inc.
    Google ScholarLocate open access versionFindings
  • Jeremy Howard and Sebastian Ruder. 2018. Universal language model fine-tuning for text classification. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 328–339, Melbourne, Australia. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Michael Denkowski and Alon Lavie. 2011. Meteor 1.3: Automatic metric for reliable optimization and evaluation of machine translation systems. In Proceedings of the Sixth Workshop on Statistical Machine Translation, pages 85–91, Edinburgh, Scotland. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • WA Falcon. 2019. PyTorch Lightning: The lightweight PyTorch wrapper for high-performance AI research. GitHub.
    Google ScholarLocate open access versionFindings
  • Erick Fonseca, Lisa Yankovskaya, Andre F. T. Martins, Mark Fishel, and Christian Federmann. 2019. Findings of the WMT 2019 shared tasks on quality estimation. In Proceedings of the Fourth Conference on Machine Translation (Volume 3: Shared Task Papers, Day 2), pages 1–10, Florence, Italy. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Yvette Graham, Timothy Baldwin, Alistair Moffat, and Justin Zobel. 2013. Continuous measurement scales in human evaluation of machine translation. In Proceedings of the 7th Linguistic Annotation Workshop and Interoperability with Discourse, pages 33–41, Sofia, Bulgaria. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Yvette Graham, Timothy Baldwin, Alistair Moffat, and Justin Zobel. 2014. Is machine translation getting better over time? In Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, pages 443–451, Gothenburg, Sweden. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Yvette Graham, Timothy Baldwin, Alistair Moffat, and Justin Zobel. 2017. Can machine translation systems be evaluated by the crowd alone. Natural Language Engineering, 23(1):330.
    Google ScholarLocate open access versionFindings
  • Yinuo Guo and Junfeng Hu. 2019. Meteor++ 2.0: Adopt syntactic level paraphrase knowledge into machine translation evaluation. In Proceedings of the
    Google ScholarLocate open access versionFindings
  • Fabio Kepler, Jonay Trenous, Marcos Treviso, Miguel Vera, Antonio Gois, M. Amin Farajian, Antonio V. Lopes, and Andre F. T. Martins. 2019a. Unbabel’s participation in the WMT19 translation quality estimation shared task. In Proceedings of the Fourth Conference on Machine Translation (Volume 3: Shared Task Papers, Day 2), pages 78–84, Florence, Italy. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Fabio Kepler, Jonay Trenous, Marcos Treviso, Miguel Vera, and Andre F. T. Martins. 2019b. OpenKiwi: An open source framework for quality estimation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pages 117–122, Florence, Italy. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, Chris Dyer, Ondrej Bojar, Alexandra Constantin, and Evan Herbst. 2007. Moses: Open source toolkit for statistical machine translation. In Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics Companion Volume Proceedings of the Demo and Poster Sessions, pages 177–180, Prague, Czech Republic. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Dan Kondratyuk and Milan Straka. 2019. 75 languages, 1 model: Parsing universal dependencies universally. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLPIJCNLP), pages 2779–2795, Hong Kong, China. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Alon Lavie and Michael Denkowski. 2009. The meteor metric for automatic evaluation of machine translation. Machine Translation, 23:105–115.
    Google ScholarLocate open access versionFindings
  • Chi-kiu Lo. 2019. YiSi - a unified semantic MT quality evaluation and estimation metric for languages with different levels of available resources. In Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1), pages 507–513, Florence, Italy. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Arle Lommel, Aljoscha Burchardt, and Hans Uszkoreit. 2014. Multidimensional quality metrics (MQM): A
    Google ScholarFindings
  • Qingsong Ma, Ondrej Bojar, and Yvette Graham. 2018. Results of the WMT18 metrics shared task: Both characters and embeddings achieve good performance. In Proceedings of the Third Conference on Machine Translation: Shared Task Papers, pages 671–688, Belgium, Brussels. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Qingsong Ma, Johnny Wei, Ondrej Bojar, and Yvette Graham. 2019. Results of the WMT19 metrics shared task: Segment-level and strong MT systems pose big challenges. In Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1), pages 62–90, Florence, Italy. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Nitika Mathur, Timothy Baldwin, and Trevor Cohn. 2019. Putting evaluation in context: Contextual embeddings improve machine translation evaluation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2799–2808, Florence, Italy. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems 26, pages 3111–3119. Curran Associates, Inc.
    Google ScholarLocate open access versionFindings
  • Kishore Papineni, Salim Roukos, Todd Ward, and WeiJing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 311–318, Philadelphia, Pennsylvania, USA. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1532–1543, Doha, Qatar. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 2227–2237, New Orleans, Louisiana. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • 5001, Florence, Italy. Association for Computational Linguistics.
    Google ScholarFindings
  • Maja Popovic. 2015. chrF: character n-gram f-score for automatic MT evaluation. In Proceedings of the Tenth Workshop on Statistical Machine Translation, pages 392–395, Lisbon, Portugal. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Maja Popovic. 2017. chrF++: words helping character n-grams. In Proceedings of the Second Conference on Machine Translation, pages 612–618, Copenhagen, Denmark. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Nils Reimers and Iryna Gurevych. 2019. SentenceBERT: Sentence embeddings using Siamese BERTnetworks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3982–3992, Hong Kong, China. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. 2019.
    Google ScholarFindings
  • F. Schroff, D. Kalenichenko, and J. Philbin. 2015. Facenet: A unified embedding for face recognition and clustering. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 815–823.
    Google ScholarLocate open access versionFindings
  • Thibault Sellam, Dipanjan Das, and Ankur Parikh. 2020. BLEURT: Learning robust metrics for text generation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7881–7892, Online. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Christophe Servan, Alexandre Berard, Zied Elloumi, Herve Blanchon, and Laurent Besacier. 2016. Word2Vec vs DBnary: Augmenting METEOR using vector representations or lexical resources? In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, pages 1159–1168, Osaka, Japan. The COLING 2016 Organizing Committee.
    Google ScholarLocate open access versionFindings
  • Hiroki Shimanaka, Tomoyuki Kajiwara, and Mamoru Komachi. 2018. RUSE: Regressor using sentence embeddings for automatic machine translation evaluation. In Proceedings of the Third Conference on Machine Translation: Shared Task Papers, pages 751–758, Belgium, Brussels. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Telmo Pires, Eva Schlinger, and Dan Garrette. 2019. How multilingual is multilingual BERT? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4996–
    Google ScholarLocate open access versionFindings
  • Hiroki Shimanaka, Tomoyuki Kajiwara, and Mamoru Komachi. 2019. Machine Translation Evaluation with BERT Regressor. arXiv preprint arXiv:1907.12679.
    Findings
  • Matthew Snover, Bonnie Dorr, Richard Schwartz, Linnea Micciulla, and John Makhoul. 2006. A study of translation edit rate with targeted human annotation. In In Proceedings of Association for Machine Translation in the Americas, pages 223–231.
    Google ScholarLocate open access versionFindings
  • Kong, China. Association for Computational Linguistics.
    Google ScholarFindings
  • Lucia Specia, Frederic Blain, Varvara Logacheva, Ramon Astudillo, and Andre F. T. Martins. 2018. Findings of the WMT 2018 shared task on quality estimation. In Proceedings of the Third Conference on Machine Translation: Shared Task Papers, pages 689–709, Belgium, Brussels. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Lucia Specia, Kim Harris, Frederic Blain, Aljoscha Burchardt, Viviven Macketanz, Inguna Skadina, Matteo Negri,, and Marco Turchi. 2017. Translation quality and productivity: A study on rich morphology languages. In Machine Translation Summit XVI, pages 55–71, Nagoya, Japan.
    Google ScholarLocate open access versionFindings
  • Kosuke Takahashi, Katsuhito Sudoh, and Satoshi Nakamura. 2020. Automatic machine translation evaluation using source language inputs and cross-lingual language model. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 3553–3558, Online. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Andre Tattar and Mark Fishel. 2017. bleu2vec: the painfully familiar metric on continuous vector space steroids. In Proceedings of the Second Conference on Machine Translation, pages 619–622, Copenhagen, Denmark. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Ian Tenney, Dipanjan Das, and Ellie Pavlick. 2019. BERT rediscovers the classical NLP pipeline. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4593– 4601, Florence, Italy. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q. Weinberger, and Yoav Artzi. 2020. Bertscore: Evaluating text generation with bert. In International Conference on Learning Representations.
    Google ScholarLocate open access versionFindings
  • Wei Zhao, Goran Glavas, Maxime Peyrard, Yang Gao, Robert West, and Steffen Eger. 2020. On the limitations of cross-lingual encoders as exposed by reference-free machine translation evaluation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1656– 1671, Online. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Wei Zhao, Maxime Peyrard, Fei Liu, Yang Gao, Christian M. Meyer, and Steffen Eger. 2019. MoverScore: Text generation evaluating with contextualized embeddings and earth mover distance. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 563–578, Hong
    Google ScholarLocate open access versionFindings
作者
Ricardo Rei
Ricardo Rei
Craig Stewart
Craig Stewart
Ana C Farinha
Ana C Farinha
Alon Lavie
Alon Lavie
您的评分 :
0

 

标签
评论
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn
小科