AI helps you reading Science

AI generates interpretation videos

AI extracts and analyses the key points of the paper to generate videos automatically


pub
Go Generating

AI Traceability

AI parses the academic lineage of this thesis


Master Reading Tree
Generate MRT

AI Insight

AI extracts a summary of this paper


Weibo:
Our segment-level machine translation evaluation metric with BERT achieved the best performance in segment-level metrics tasks on the WMT17 dataset for all to-English language pairs

Machine Translation Evaluation with BERT Regressor

Cited by: 0|Views15
Full Text
Bibtex
Weibo

Abstract

We introduce the metric using BERT (Bidirectional Encoder Representations from Transformers) (Devlin et al., 2019) for automatic machine translation evaluation. The experimental results of the WMT-2017 Metrics Shared Task dataset show that our metric achieves state-of-the-art performance in segment-level metrics task for all to-English ...More

Code:

Data:

0
Introduction
  • This study describes a segment-level metric for automatic machine translation evaluation (MTE).
  • In WMT-2018 Metrics Shared Task (Ma et al, 2018), RUSE was the best metric on segment-level for all to-English language pairs.
  • This result indicates that pre-trained sentence embeddings are effective feature for automatic evaluation of machine translation.
  • BERT is designed to pre-train using “masked language model” (MLM) and “ sentence prediction”
Highlights
  • This study describes a segment-level metric for automatic machine translation evaluation (MTE)
  • In our previous study (Shimanaka et al, 2018), we proposed RUSE1 (Regressor Using Sentence Embeddings) that is a segment-level MTE metric using pre-trained sentence embeddings capable of capturing global information that cannot be captured by local features based on character or word N-grams
  • BERT for MTE achieved the best per-formance in all toEnglish language pairs
  • We proposed the metric for automatic machine translation evaluation with BERT
  • Our segment-level MTE metric with BERT achieved the best performance in segment-level metrics tasks on the WMT17 dataset for all to-English language pairs
Methods
  • The authors performed experiments using the WMT-2017 Metrics Shared Task dataset to verify the performance of BERT for MTE. 4.1 Settings

    Table 1 shows the number of instances in WMT Metrics Shared Task dataset for to-English language pairs3 used in this study.
  • The authors performed experiments using the WMT-2017 Metrics Shared Task dataset to verify the performance of BERT for MTE.
  • Table 1 shows the number of instances in WMT Metrics Shared Task dataset for to-English language pairs3 used in this study.
  • A total of 5,360 instances in WMT-2015 and WMT2016 Metrics Shared Task datasets will be divided randomly, and 90% is used for training and 10% for development.
  • A total of 3,920 instances (560 instances for each language pair) in WMT-2017 Metrics Shared Task dataset is used for evaluation.
  • The Hyper-parameters for fine-tuning BERT are determined through grid search in the following parameters using the development data
Results
  • Table 2 presents the experimental results of the WMT-2017 Metrics Shared Task dataset.
  • In order to analyze the three main points of difference between RUSE and BERT, the pre-training method, the sentence-pair encoding, and the finetuning of the pre-trained encoder, the authors conduct an experiment with the following settings.
  • The input of the MLP Regressor in Figure 1(b).
  • In this case, the part of the BERT encoder is not finetuned
Conclusion
  • The authors proposed the metric for automatic machine translation evaluation with BERT.
  • The authors' segment-level MTE metric with BERT achieved the best performance in segment-level metrics tasks on the WMT17 dataset for all to-English language pairs.
  • As a result of analysis based on comparison with RUSE which is the previous work, it is shown that three points of the pre-training method, the sentence-pair encoding, and the fine-tuning of the pre-trained encoder contributed to the performance improvement of BERT respectively
Summary
  • Introduction:

    This study describes a segment-level metric for automatic machine translation evaluation (MTE).
  • In WMT-2018 Metrics Shared Task (Ma et al, 2018), RUSE was the best metric on segment-level for all to-English language pairs.
  • This result indicates that pre-trained sentence embeddings are effective feature for automatic evaluation of machine translation.
  • BERT is designed to pre-train using “masked language model” (MLM) and “ sentence prediction”
  • Methods:

    The authors performed experiments using the WMT-2017 Metrics Shared Task dataset to verify the performance of BERT for MTE. 4.1 Settings

    Table 1 shows the number of instances in WMT Metrics Shared Task dataset for to-English language pairs3 used in this study.
  • The authors performed experiments using the WMT-2017 Metrics Shared Task dataset to verify the performance of BERT for MTE.
  • Table 1 shows the number of instances in WMT Metrics Shared Task dataset for to-English language pairs3 used in this study.
  • A total of 5,360 instances in WMT-2015 and WMT2016 Metrics Shared Task datasets will be divided randomly, and 90% is used for training and 10% for development.
  • A total of 3,920 instances (560 instances for each language pair) in WMT-2017 Metrics Shared Task dataset is used for evaluation.
  • The Hyper-parameters for fine-tuning BERT are determined through grid search in the following parameters using the development data
  • Results:

    Table 2 presents the experimental results of the WMT-2017 Metrics Shared Task dataset.
  • In order to analyze the three main points of difference between RUSE and BERT, the pre-training method, the sentence-pair encoding, and the finetuning of the pre-trained encoder, the authors conduct an experiment with the following settings.
  • The input of the MLP Regressor in Figure 1(b).
  • In this case, the part of the BERT encoder is not finetuned
  • Conclusion:

    The authors proposed the metric for automatic machine translation evaluation with BERT.
  • The authors' segment-level MTE metric with BERT achieved the best performance in segment-level metrics tasks on the WMT17 dataset for all to-English language pairs.
  • As a result of analysis based on comparison with RUSE which is the previous work, it is shown that three points of the pre-training method, the sentence-pair encoding, and the fine-tuning of the pre-trained encoder contributed to the performance improvement of BERT respectively
Tables
  • Table1: Number of segment-level DA human evaluation datasets for to-English language pairs in WMT-2015 (Stanojevicet al., 2015), WMT-2016 (<a class="ref-link" id="cBojar_et+al_2016_a" href="#rBojar_et+al_2016_a">Bojar et al, 2016</a>), and WMT-2017 Metrics Shared Task (<a class="ref-link" id="cBojar_et+al_2017_a" href="#rBojar_et+al_2017_a">Bojar et al, 2017</a>)
  • Table2: Segment-level Pearson correlation of metric scores and DA human evaluation scores for toEnglish language pairs in WMT-2017 Metrics Shared Task
  • Table3: Comparison of RUSE and BERT in WMT-2017 Metrics Shared Task (segment-level, to-English language pairs)
Download tables as Excel
Related work
  • In this section, we describe the MTE metric that achieves the best performance in WMT2017 (Bojar et al, 2017) and -2018 (Ma et al, 2018) Metrics Shared Task. In this task, we use direct assessment (DA) datasets of human evaluation data. DA datasets provides the absolute quality scores of hypotheses by measuring to what extent a hypothesis adequately expresses the meaning of the reference translation. Each metric estimates the quality score with the translation and reference sentence pair as input, and is evaluated by Pearson correlation with human evaluation. In this paper, we discuss the metrics task in segmentlevel for to-English language pairs.

    (a) MTE with RUSE.

    (b) MTE with BERT.

    2.1 Blend: the metric based on local features

    Blend which achieved the best performance in WMT-2017 is an ensemble metric that incorporates 25 lexical metrics provided by the Asiya MT evaluation toolkit, as well as four other metrics. Blend is a metric that uses many features, but relies only on local information that can not simultaneously consider the whole sentence simultaneously, such as character-based editing distances and features based on word N-grams.
Funding
  • Part of this research was funded by JSPS Grant-in-Aid for Scientific Research (Grant-in-Aid for Research Activity start-up, task number: 18H06465)
Reference
  • Ondrej Bojar, Yvette Graham, and Amir Kamran. 2017. Results of the WMT17 Metrics Shared Task. In Proceedings of the Second Conference on Machine Translation, pages 489–513.
    Google ScholarLocate open access versionFindings
  • Ondrej Bojar, Yvette Graham, Amir Kamran, and Milos Stanojevic. 2016. Results of the WMT16 Metrics Shared Task. In Proceedings of the First Conference on Machine Translation, pages 199–231.
    Google ScholarLocate open access versionFindings
  • Alexis Conneau, Douwe Kiela, Holger Schwenk, Loıc Barrault, and Antoine Bordes. 2017. Supervised Learning of Universal Sentence Representations from Natural Language Inference Data. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 670–680.
    Google ScholarLocate open access versionFindings
  • Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186.
    Google ScholarLocate open access versionFindings
  • Rohit Gupta, Constantin Orasan, and Josef van Genabith. 201ReVal: A Simple and Effective Machine Translation Evaluation Metric Based on Recurrent Neural Networks. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 1066–1072.
    Google ScholarLocate open access versionFindings
  • Lushan Han, Abhay L. Kashyap, Tim Finin, James Mayfield, and Jonathan Weese. 2013. UMBC EBIQUITYCORE: Semantic Textual Similarity Systems. In Second Joint Conference on Lexical and Computational Semantics, Volume 1: Proceedings of the Main Conference and the Shared Task: Semantic Textual Similarity, pages 44– 52.
    Google ScholarLocate open access versionFindings
  • Lajanugen Logeswaran and Honglak Lee. 2018. An Efficient Framework for Learning Sentence Representations. In International Conference on Learning Representations, pages 1–16.
    Google ScholarLocate open access versionFindings
  • Qingsong Ma, Ondrej Bojar, and Yvette Graham. 201Results of the WMT18 Metrics Shared Task: Both Characters and Embeddings Achieve Good Performance. In Proceedings of the Third Conference on Machine Translation, Volume 2: Shared Task Papers, pages 682–701.
    Google ScholarLocate open access versionFindings
  • Qingsong Ma, Yvette Graham, Shugen Wang, and Qun Liu. 2017. Blend: a Novel Combined MT Metric Based on Direct Assessment - CASICT-DCU submission to WMT17 Metrics Task. In Proceedings of the Second Conference on Machine Translation, pages 598–603.
    Google ScholarLocate open access versionFindings
  • Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. GloVe: Global Vectors for Word Representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, pages 1532– 1543.
    Google ScholarLocate open access versionFindings
  • Hiroki Shimanaka, Tomoyuki Kajiwara, and Mamoru Komachi. 2018. RUSE: Regressor Using Sentence Embeddings for Automatic Machine Translation Evaluation. In Proceedings of the Third Conference on Machine Translation, Volume 2: Shared Task Papers, pages 764–771.
    Google ScholarLocate open access versionFindings
  • Milos Stanojevic, Philipp Koehn, and Ondrej Bojar. 2015. Results of the WMT15 Metrics Shared Task. In Proceedings of the Tenth Workshop on Statistical Machine Translation, pages 256–273.
    Google ScholarLocate open access versionFindings
  • Yukun Zhu, Ryan Kiros, Richard S. Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. 2015. Aligning Books and Movies: Towards StoryLike Visual Explanations by Watching Movies and Reading Books. 2015 IEEE International Conference on Computer Vision, pages 19–27.
    Google ScholarLocate open access versionFindings
Author
Shimanaka Hiroki
Shimanaka Hiroki
Your rating :
0

 

Tags
Comments
小科