AI helps you reading Science

AI generates interpretation videos

AI extracts and analyses the key points of the paper to generate videos automatically


pub
View the video

AI Traceability

AI parses the academic lineage of this thesis


Master Reading Tree
Generate MRT

AI Insight

AI extracts a summary of this paper


Weibo:
This analysis includes one of statistical power that will be useful as a reference for future Machine Translation evaluations to reduce the likelihood of future claims of human parity resulting from statistical ties produced from tests with low statistical power

Statistical Power and Translationese in Machine Translation Evaluation

EMNLP 2020, (2020)

Cited by: 0|Views374
EI
Full Text
Bibtex
Weibo

Abstract

The term translationese has been used to describe the presence of unusual features of translated text. In this paper, we provide a detailed analysis of the adverse effects of translationese on machine translation evaluation results. Our analysis shows evidence to support differences in text originally written in a given language relativ...More

Code:

Data:

0
Introduction
  • Human-translated text is thought to display features that deviate to some degree from those of text originally composed in the that language. Baker et al (1993) report that translated text can: be more explicit than the original source, less ambiguous, simplified; display a preference for conventional grammaticality; avoid repetition; exaggerate target language features; as well as display features of the source language.
  • Table 1 shows results reproduced from the Hassan et al (2018) data set, where the authors report both the number of human judgments collected, N , and the number of distinct test sentences included, n, in addition to adding separate results for forward and reverse-created test data.
Highlights
  • Human-translated text is thought to display features that deviate to some degree from those of text originally composed in the that language. Baker et al (1993) report that translated text can: be more explicit than the original source, less ambiguous, simplified; display a preference for conventional grammaticality; avoid repetition; exaggerate target language features; as well as display features of the source language
  • As mentioned previously in Section 2, past reevaluations of human parity claims were hampered by low inter-annotator agreement levels, employment of older human evaluation technologies than the original, treatment of Trueskill clusters to draw conclusions of statistical significance and lack of statistical power analysis for planned sample size, while the original evaluation itself suffered severely from inclusion of reverse-created data we have shown to be problematic, as well as a very low number of distinct translations included in the evaluation
  • We explore issues relating to the reliability of machine translation evaluations
  • In terms of the legitimacy of machine translation evaluation results, our analysis provides sufficient evidence that translationese is a problem for evaluation of systems, in particular in terms of comparison of system performance with automatic metrics such as BLEU
  • This results in our first recommendation in future Machine Translation (MT) evaluations to avoid the use of test data that was created via human translation from another language
  • This analysis includes one of statistical power that will be useful as a reference for future MT evaluations to reduce the likelihood of future claims of human parity resulting from statistical ties produced from tests with low statistical power
Results
  • When interpreting results in Table 1, it is important to remember, that the reliability of even the conclusions drawn from forward-created test data only is still uncertain due to the small n, as only 92 distinct translations were included in the evaluation claiming human parity.
  • The scatter plot in Figure 7 shows relative differences in BLEU scores when the authors change from forward to reverse test data for all pairs of systems participating in WMT-15–WMT-18.
  • The absence of systems in the upper-left and lower-right quadrants shows a similar trend for human evaluation, where relative differences in DA scores for pairs of systems correspond very closely when the authors change from reverse to forward-created test data.
  • The correspondence of relative differences for pairs of systems was extremely close for human evaluation and this provides evidence of the validity of conclusions made in past human evaluations of MT that included reverse test data.
  • The correspondence between forward and reverse rank correlation of systems according to BLEU varies considerably across different evaluation test sets, from as low as a τ of 0.2, where BLEU score rankings are extremely different depending on test data creation direction, up to a τ of 1.0, where rank correlation is identical.
  • The authors' analysis of differences in both BLEU and human evaluation scores reveal differences in system rankings when tested on reverse and forward-created test data, differences substantial in some cases.
Conclusion
  • As mentioned previously in Section 2, past reevaluations of human parity claims were hampered by low inter-annotator agreement levels, employment of older human evaluation technologies than the original, treatment of Trueskill clusters to draw conclusions of statistical significance and lack of statistical power analysis for planned sample size, while the original evaluation itself suffered severely from inclusion of reverse-created data the authors have shown to be problematic, as well as a very low number of distinct translations included in the evaluation.
  • Since the test set used in Hassan et al (2018) included a far lower number of test documents basing the evaluation on document ratings would lead to low statistical power and likely result in Type II errors cause by this small sample size.
Summary
  • Human-translated text is thought to display features that deviate to some degree from those of text originally composed in the that language. Baker et al (1993) report that translated text can: be more explicit than the original source, less ambiguous, simplified; display a preference for conventional grammaticality; avoid repetition; exaggerate target language features; as well as display features of the source language.
  • Table 1 shows results reproduced from the Hassan et al (2018) data set, where the authors report both the number of human judgments collected, N , and the number of distinct test sentences included, n, in addition to adding separate results for forward and reverse-created test data.
  • When interpreting results in Table 1, it is important to remember, that the reliability of even the conclusions drawn from forward-created test data only is still uncertain due to the small n, as only 92 distinct translations were included in the evaluation claiming human parity.
  • The scatter plot in Figure 7 shows relative differences in BLEU scores when the authors change from forward to reverse test data for all pairs of systems participating in WMT-15–WMT-18.
  • The absence of systems in the upper-left and lower-right quadrants shows a similar trend for human evaluation, where relative differences in DA scores for pairs of systems correspond very closely when the authors change from reverse to forward-created test data.
  • The correspondence of relative differences for pairs of systems was extremely close for human evaluation and this provides evidence of the validity of conclusions made in past human evaluations of MT that included reverse test data.
  • The correspondence between forward and reverse rank correlation of systems according to BLEU varies considerably across different evaluation test sets, from as low as a τ of 0.2, where BLEU score rankings are extremely different depending on test data creation direction, up to a τ of 1.0, where rank correlation is identical.
  • The authors' analysis of differences in both BLEU and human evaluation scores reveal differences in system rankings when tested on reverse and forward-created test data, differences substantial in some cases.
  • As mentioned previously in Section 2, past reevaluations of human parity claims were hampered by low inter-annotator agreement levels, employment of older human evaluation technologies than the original, treatment of Trueskill clusters to draw conclusions of statistical significance and lack of statistical power analysis for planned sample size, while the original evaluation itself suffered severely from inclusion of reverse-created data the authors have shown to be problematic, as well as a very low number of distinct translations included in the evaluation.
  • Since the test set used in Hassan et al (2018) included a far lower number of test documents basing the evaluation on document ratings would lead to low statistical power and likely result in Type II errors cause by this small sample size.
Tables
  • Table1: Results of <a class="ref-link" id="cHassan_et+al_2018_a" href="#rHassan_et+al_2018_a">Hassan et al (2018</a>) for forward, reverse and both test set creation directions reproduced from published data set. N is the number of human judgments collected for that system while n is the number of distinct translations assessed for that system, Reference-HT are human translations created by (<a class="ref-link" id="cHassan_et+al_2018_a" href="#rHassan_et+al_2018_a">Hassan et al, 2018</a>), Reference-PE are the outputs of an online MT system after human correction, ReferenceWMT are the original WMT reference translations
  • Table2: Effect size for all systems included in <a class="ref-link" id="cHassan_et+al_2018_a" href="#rHassan_et+al_2018_a">Hassan et al (2018</a>)
  • Table3: Comparison of human evaluation scores of MT systems participating in WMT-17 and WMT-18 for test data created in the same/forward direction (F) and reverse (R) direction, where R>F (%) = the proportion of systems with a reverse DA score greater than its forward score for precisely the same test scenario; R−F μ = mean of the difference in reverse and forward DA scores; R−F σ = standard deviation of the difference in reverse and forward DA scores; n = number of MT systems
  • Table4: Comparison of BLEU scores of MT systems participating in WMT-15 – WMT-18 for test data created in the same/forward (F) and reverse (R) direction, where R>F (%) = the proportion of systems with a reverse BLEU score greater than its forward score for precisely the same test scenario; R−F μ = mean of the difference in reverse and forward BLEU scores; R−F σ = standard deviation of the difference in reverse and forward BLEU scores; n = number of MT systems
  • Table5: Pearson (r), Spearman (ρ) and Kendall’s τ correlation of forward and reverse BLEU scores of all systems participating in WMT-15 – WMT-18 news translation task; language pairs ordered from lowest to highest Pearson correlation
  • Table6: Pearson (r), Spearman (ρ) and Kendall’s τ correlation of forward and reverse Human DA scores of all systems participating in WMT-17 – WMT-18 news translation task; language pairs ordered from lowest to highest Pearson correlation
  • Table7: Statistical Power of two-sided Wilcoxon Rank Sum Test for a range of sample and effect sizes; power ≥ 0.8 highlighted in bold
  • Table8: Effect size, probability of a translation produced by the system in a given row receiving a lower DA score than that of the system in a given column; systems and data taken from <a class="ref-link" id="cHassan_et+al_2018_a" href="#rHassan_et+al_2018_a">Hassan et al (2018</a>) human evaluation
  • Table9: Re-evaluation of human-parity-claimed Chinese to English system of <a class="ref-link" id="cHassan_et+al_2018_a" href="#rHassan_et+al_2018_a"><a class="ref-link" id="cHassan_et+al_2018_a" href="#rHassan_et+al_2018_a">Hassan et al (2018</a></a>); ∗ denotes system that significantly outperforms all lower ranked systems according to a two-sided Wilcoxon rank-sum test p < 0.05 themselves and against which human-parity of MT was claimed, while REF-PE is machine translated outputs that have been post-edited by humans, and Combo-6 is the best-performing system in <a class="ref-link" id="cHassan_et+al_2018_a" href="#rHassan_et+al_2018_a"><a class="ref-link" id="cHassan_et+al_2018_a" href="#rHassan_et+al_2018_a">Hassan et al (2018</a></a>)
Download tables as Excel
Related work
  • Hassan et al (2018) provide one of the earliest claims in MT of systems achieving human-parity in terms of the quality of translations. Läubli et al (2018) and Toral et al (2018) both question the reliability of conclusions due to it following the 50/50 set-up of test data creation (shown in Figure 1), highlighting the inclusion of reverse-created test data as a likely confound. Läubli et al (2018) and Toral et al (2018) repeat the human evaluation of translations produced by Hassan et al (2018)

    only for test data that originated in the source language and with some additional distinctions.

    Firstly, and making a positive change, both Läubli et al (2018) and Toral et al (2018) include more context than the original sentence-level evaluation, the former now asking human judges to assess entire documents, and the latter involving assessment of MT output sentences in the order that they appeared in original documents. Secondly, both reassessments again move away from the evaluation method employed in Hassan et al (2018), Direct Assessment (Graham et al, 2016), and revert to an older method of human evaluation, relative ranking, no longer used at WMT for evaluation of systems.

    In addition, in both re-evaluations, besides use of older evaluation methodologies, another concern is that they were limited to only a small number of human judges with low levels of interannotator agreement. Therefore, although both reevaluations improved the methodology employed in two respects, by eliminating reverse-created test data and including more context, both potentially include other sources of inaccuracy, such as lack of reliability of human judges when human evaluation takes the form of relative ranking (CallisonBurch et al, 2007, 2008, 2009, 2010, 2011, 2012; Bojar et al, 2013, 2014, 2015, 2016).
Study subjects and analysis
cases: 5
0.600 0.619 0.644 0.667 0.902 1.000 1.000 0.925 0.900 0.969 0.976 0.865 0.974 0.947 τ. 0.758 0.778 0.782 0.857 0.765 1.000 0.722 0.889 0.822 0.771 1.000 0.822 0.788 1.000 τ range from little correspondence for tr-en newstest2018 at 0.4 in the worst case to identical system rankings τ of 1.0in five cases (cs-en; fin-en; en-tr newstest2017; en-ru; en-cs newstest2018). In overall summary, our analysis of differences in both BLEU and human evaluation scores reveal differences in system rankings when tested on reverse and forward-created test data, differences substantial in some cases

documents: 55
Statistical power is of particular importance when considering document-level evaluation due to the fact that gathering ratings of documents as opposed to sentences requires substantially more annotator time and for this reason is likely to result in a reduction in the number of assessments collected in any evaluation. For example, Läubli et al (2018) included as few as 55 documents in their re-evaluation of Hassan et al (2018). Our concern about a potential substantial reduction in sample size in future document-level evaluations is wellfounded therefore, especially considering standard segment-level MT human evaluations commonly include a sample of 1,500 segments

documents: 55
As shown in Table 7 for the usual sample size employed in WMT evaluations, 1,500, statistical power even for closely performing systems, where the probability of the translations of system A being scored lower than those of system B is 0.47, statistical power is still above 0.8. For such pairs of systems, however, if we were to employ the smaller sample size of 55 documents, as in Läubli et al (2018), the power of the test to identify a significant differences falls as low as 0.081, approaching one tenth of acceptable statistical power levels.3. In order to further put into context the closeness in human performance of systems we can expect to encounter in our planned re-evaluation, we examine the effect size for pairs of systems in the original

Reference
  • Mona Baker, Gill Francis, and Elena TogniniBonelli. 1993. Corpus linguistics and translation studies: Implications and applications. In Text and Technology: In Honour of John Sinclair, Netherlands. John Benjamins Publishing Company.
    Google ScholarLocate open access versionFindings
  • Ondrej Bojar, Christian Buck, Christian Federmann, Barry Haddow, Philipp Koehn, Johannes Leveling, Christof Monz, Pavel Pecina, Matt Post, Herve Saint-Amand, Radu Soricut, Lucia Specia, and Aleš Tamchyna. 2014. Findings of the 2014 workshop on statistical machine translation. In Proceedings of the Ninth Workshop on Statistical Machine Translation, pages 12– 58, Baltimore, Maryland, USA. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Ondrej Bojar, Christian Buck, Chris CallisonBurch, Christian Federmann, Barry Haddow, Philipp Koehn, Christof Monz, Matt Post, Radu Soricut, and Lucia Specia. 201Findings of the 2013 Workshop on Statistical Machine Translation. In Proceedings of the Eighth Workshop on Statistical Machine Translation, pages 1–44, Sofia, Bulgaria. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Ondrej Bojar, Rajen Chatterjee, Christian Federmann, Yvette Graham, Barry Haddow, Matthias Huck, Antonio Jimeno Yepes, Philipp Koehn, Varvara Logacheva, Christof Monz, Matteo Negri, Aurelie Neveol, Mariana Neves, Martin Popel, Matt Post, Raphael Rubino, Carolina Scarton, Lucia Specia, Marco Turchi, Karin Verspoor, and Marcos Zampieri. 2016. Findings of the 2016 conference on machine translation. In Proceedings of the First Conference on Machine Translation, pages 131–198, Berlin, Germany. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Ondrej Bojar, Rajen Chatterjee, Christian Federmann, Barry Haddow, Matthias Huck, Chris Hokamp, Philipp Koehn, Varvara Logacheva, Christof Monz, Matteo Negri, Matt Post, Carolina Scarton, Lucia Specia, and Marco Turchi. 201Findings of the 2015 workshop on statistical machine translation. In Proceedings of the Tenth Workshop on Statistical Machine Translation, pages 1–46, Lisbon, Portugal. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Ondrej Bojar, Christian Federmann, Mark Fishel, Yvette Graham, Barry Haddow, Matthias Huck, Philipp Koehn, and Christof Monz. 2018. Findings of the 2018 conference on machine translation (wmt18). In Proceedings of the Third Conference on Machine Translation, Volume 2: Shared Task Papers, pages 272–307, Belgium, Brussels. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Chris Callison-Burch, Cameron Fordyce, Philipp Koehn, Christof Monz, and Josh Schroeder. 200(meta-) evaluation of machine translation. In Proceedings of the Second Workshop on Statistical Machine Translation, pages 136– 158, Prague, Czech Republic. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Chris Callison-Burch, Cameron Fordyce, Philipp Koehn, Christof Monz, and Josh Schroeder. 200Further meta-evaluation of machine translation. In Proceedings of the Third Workshop on Statistical Machine Translation, pages 70–106, Columbus, Ohio. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Chris Callison-Burch, Philipp Koehn, Christof Monz, Kay Peterson, Mark Przybocki, and Omar Zaidan. 2010. Findings of the 2010 joint workshop on statistical machine translation and metrics for machine translation. In Proceedings of the Joint Fifth Workshop on Statistical Machine Translation and MetricsMATR, pages 17– 53, Uppsala, Sweden. Association for Computational Linguistics. Revised August 2010.
    Google ScholarLocate open access versionFindings
  • Chris Callison-Burch, Philipp Koehn, Christof Monz, Matt Post, Radu Soricut, and Lucia Specia. 2012. Findings of the 2012 workshop on statistical machine translation. In Proceedings of the Seventh Workshop on Statistical Machine Translation, pages 10–51, Montréal, Canada. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Chris Callison-Burch, Philipp Koehn, Christof Monz, and Josh Schroeder. 2009. Findings of the 2009 Workshop on Statistical Machine Translation. In Proceedings of the Fourth Workshop on Statistical Machine Translation, pages 1–28, Athens, Greece. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Chris Callison-Burch, Philipp Koehn, Christof Monz, and Omar Zaidan. 2011. Findings of the 2011 workshop on statistical machine translation. In Proceedings of the Sixth Workshop on Statistical Machine Translation, pages 22–64, Edinburgh, Scotland. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Jacob Cohen. 1988. Statistical power analysis for the social sciences. Hillsdale, NJ: Erlbaum.
    Google ScholarFindings
  • Yvette Graham, Timothy Baldwin, Alistair Moffat, and Justin Zobel. 2016. Can machine translation systems be evaluated by the crowd alone. Natural Language Engineering, FirstView:1– 28.
    Google ScholarLocate open access versionFindings
  • Hany Hassan, Anthony Aue, Chang Chen, Vishal Chowdhary, Jonathan Clark, Christian Federmann, Xuedong Huang, Marcin JunczysDowmunt, William Lewis, Mu Li, Shujie Liu, Tie-Yan Liu, Renqian Luo, Arul Menezes, Tao Qin, Frank Seide, Xu Tan, Fei Tian, Lijun Wu, Shuangzhi Wu, Yingce Xia, Dongdong Zhang, Zhirui Zhang, and Ming Zhou. 2018. Achieving human parity on automatic chinese to english news translation. CoRR, abs/1803.05567.
    Findings
  • Gennadi Lambersky, Noam Ordan, and Shuly Wintner. 2012. Language models for machine translation: Original vs. translated texts. Computational Linguistics, 38:4.
    Google ScholarLocate open access versionFindings
  • Samuel Läubli, Rico Sennrich, and Martin Volk. 2018. Has Neural Machine Translation Achieved Human Parity? A Case for Document-level Evaluation. In EMNLP 2018, Brussels, Belgium. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: A Method for Automatic Evaluation of Machine Translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, ACL ’02, pages 311–318, Philadelphia, Pennsylvania.
    Google ScholarLocate open access versionFindings
  • Antonio Toral, Sheila Castilho, Ke Hu, and Andy Way. 2018. Attaining the unattainable? reassessing claims of human parity in neural machine translation. CoRR, abs/1808.10432.
    Findings
Your rating :
0

 

Tags
Comments
小科