Accurate Evaluation of Segment-level Machine Translation Metrics.

HLT-NAACL(2015)

引用 113|浏览32
暂无评分
摘要
Evaluation of segment-level machine translation metrics is currently hampered by: (1) low inter-annotator agreement levels in human assessments; (2) lack of an effective mechanism for evaluation of translations of equal quality; and (3) lack of methods of significance testing improvements over a baseline. In this paper, we provide solutions to each of these challenges and outline a new human evaluation methodology aimed specifically at assessment of segment-level metrics. We replicate the human evaluation component of WMT-13 and reveal that the current state-of-the-art performance of segment-level metrics is better than previously believed. Three segment-level metrics — METEOR, NLEPOR and SENTBLEUMOSES — are found to correlate with human assessment at a level not significantly outperformed by any other metric in both the individual language pair assessment for Spanish-toEnglish and the aggregated set of 9 language pairs.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要