Cache-based Document-level Statistical Machine Translation.

EMNLP '11: Proceedings of the Conference on Empirical Methods in Natural Language Processing(2011)

引用 46|浏览58
暂无评分
摘要
Statistical machine translation systems are usually trained on a large amount of bilingual sentence pairs and translate one sentence at a time, ignoring document-level information. In this paper, we propose a cache-based approach to document-level translation. Since caches mainly depend on relevant data to supervise subsequent decisions, it is critical to fill the caches with highly-relevant data of a reasonable size. In this paper, we present three kinds of caches to store relevant document-level information: 1) a dynamic cache, which stores bilingual phrase pairs from the best translation hypotheses of previous sentences in the test document; 2) a static cache, which stores relevant bilingual phrase pairs extracted from similar bilingual document pairs (i.e. source documents similar to the test document and their corresponding target documents) in the training parallel corpus; 3) a topic cache, which stores the target-side topic words related with the test document in the source-side. In particular, three new features are designed to explore various kinds of document-level information in above three kinds of caches. Evaluation shows the effectiveness of our cache-based approach to document-level translation with the performance improvement of 0.81 in BLUE score over Moses. Especially, detailed analysis and discussion are presented to give new insights to document-level translation.
更多
查看译文
关键词
test document,document-level information,cache-based approach,relevant document-level information,Statistical machine translation system,best translation hypothesis,bilingual phrase pair,bilingual sentence pair,corresponding target document,relevant bilingual phrase pair,Cache-based document-level statistical machine
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要