Extracting translations from comparable corpora for Cross-Language Information Retrieval using the language modeling framework

Information Processing and Management: an International Journal（2016）

引用 25|浏览79

暂无评分

摘要

Proposing a language modeling method to extract translations from comparable corpora.Comparing two similarity functions for deriving bilingual word correlations.Improving translation quality by integrating co-occurrence relations into word models.Comparing different estimations of translation probabilities from word correlations.Showing the significant impact of probability estimation methods on CLIR performance. A main challenge in Cross-Language Information Retrieval (CLIR) is to estimate a proper translation model from available translation resources, since translation quality directly affects the retrieval performance. Among different translation resources, we focus on obtaining translation models from comparable corpora, because they provide appropriate translations for both languages and domains with limited linguistic resources. In this paper, we employ a two-step approach to build an effective translation model from comparable corpora, without requiring any additional linguistic resources, for the CLIR task. In the first step, translations are extracted by deriving correlations between sourcetarget word pairs. These correlations are used to estimate word translation probabilities in the second step. We propose a language modeling approach for the first step, where modeling based on probability distribution provides two key advantages. First, our approach can be tuned easier in comparison with heuristically adjusted previous work. Second, it provides a principled basis for integrating additional lexical and translational relations to improve the accuracy of translations from comparable corpora. As an indication, we integrate monolingual relations of word co-occurrences into the process of translation extraction, which helps to extract more reliable translations for low-frequency words in a comparable corpus. Experimental results on an EnglishPersian comparable corpus show that our method outperforms the previous approaches in terms of both translation quality and the performance of CLIR. Indeed, the proposed method is naturally applicable to any comparable corpus, regardless of its languages. In addition, we demonstrate the significant impact of word translation probabilities, estimated in the second step of our approach, on the performance of CLIR.

查看译文

关键词

Translation model,Bilingual lexicon,Comparable corpora,Cross-Language Information Retrieval,Language modeling framework

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要