Parallel Text Identification Using Lexical And Corpus Features For The English-Maori Language Pair
2016 15th IEEE International Conference on Machine Learning and Applications (ICMLA)(2016)
摘要
Comparable corpora contain significant quantities of useful data for Natural Language Processing tasks, especially in the area of Machine Translation. They are mainly the source of parallel text fragments. This paper investigates how to effectively extract bilingual texts from comparable corpora relying on a small-size parallel training corpus. We propose a new technique to filter non parallel articles in Wikipedia based on Zipfian frequency distribution. We also use the SVM approach to find parallel chunks of text in a candidate comparable document. In our approach we use a parallel corpus to generate the required features for the training step. The evaluations of generated bilingual texts are promising.
更多查看译文
关键词
natural language processing,machine translation,bilingual corpora,Zipfian frequency distribution
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络