Training Effective Neural CLIR by Bridging the Translation Gap

SIGIR '20: The 43rd International ACM SIGIR conference on research and development in Information Retrieval Virtual Event China July, 2020(2020)

引用 28|浏览102
暂无评分
摘要
We introduce Smart Shuffling, a cross-lingual embedding (CLE) method that draws from statistical word alignment approaches to leverage dictionaries, producing dense representations that are significantly more effective for cross-language information retrieval (CLIR) than prior CLE methods. This work is motivated by the observation that although neural approaches are successful for monolingual IR, they are less effective in the cross-lingual setting. We hypothesize that neural CLIR fails because typical cross-lingual embeddings "translate" query terms into related terms -- i.e., terms that appear in a similar context -- in addition to or sometimes rather than synonyms in the target language. Adding related terms to a query (i.e., query expansion) can be valuable for retrieval, but must be mitigated by also focusing on the starting query. We find that prior neural CLIR models are unable to bridge the translation gap, apparently producing queries that drift from the intent of the source query. We conduct extrinsic evaluations of a range of CLE methods using CLIR performance, compare them to neural and statistical machine translation systems trained on the same translation data, and show a significant gap in effectiveness. Our experiments on standard CLIR collections across four languages indicate that Smart Shuffling fills the translation gap and provides significantly improved semantic matching quality. Having such a representation allows us to exploit deep neural (re-)ranking methods for the CLIR task, leading to substantial improvement with up to 21% gain in MAP, approaching human translation performance. Evaluations on bilingual lexicon induction show a comparable improvement.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要