Graph-Based Bilingual Word Embedding for Statistical Machine Translation.

ACM Trans. Asian & Low-Resource Lang. Inf. Process.(2018)

引用 21|浏览61
暂无评分
摘要
Bilingual word embedding has been shown to be helpful for Statistical Machine Translation (SMT). However, most existing methods suffer from two obvious drawbacks. First, they only focus on simple contexts such as an entire document or a fixed-sized sliding window to build word embedding and ignore latent useful information from the selected context. Second, the word sense but not the word should be the minimal semantic unit; however, most existing methods still use word representation. To overcome these drawbacks, this article presents a novel Graph-Based Bilingual Word Embedding (GBWE) method that projects bilingual word senses into a multidimensional semantic space. First, a bilingual word co-occurrence graph is constructed using the co-occurrence and pointwise mutual information between the words. Then, maximum complete subgraphs (cliques), which play the role of a minimal unit for bilingual sense representation, are dynamically extracted according to the contextual information. Consequently, correspondence analysis, principal component analyses, and neural networks are used to summarize the clique-word matrix into lower dimensions to build the embedding model. Without contextual information, the proposed GBWE can be applied to lexical translation. In addition, given contextual information, GBWE is able to give a dynamic solution for bilingual word representations, which can be applied to phrase translation and generation. Empirical results show that GBWE can enhance the performance of lexical translation, as well as Chinese/French-to-English and Chinese-to-Japanese phrase-based SMT tasks (IWSLT, NTCIR, NIST, and WAT).
更多
查看译文
关键词
Statistical machine translation, bilingual word embedding, lexical translation
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要