Enhancing Cross-lingual Sentence Embedding for Low-resource Languages with Word Alignment
arxiv(2024)
摘要
The field of cross-lingual sentence embeddings has recently experienced
significant advancements, but research concerning low-resource languages has
lagged due to the scarcity of parallel corpora. This paper shows that
cross-lingual word representation in low-resource languages is notably
under-aligned with that in high-resource languages in current models. To
address this, we introduce a novel framework that explicitly aligns words
between English and eight low-resource languages, utilizing off-the-shelf word
alignment models. This framework incorporates three primary training
objectives: aligned word prediction and word translation ranking, along with
the widely used translation ranking. We evaluate our approach through
experiments on the bitext retrieval task, which demonstrate substantial
improvements on sentence embeddings in low-resource languages. In addition, the
competitive performance of the proposed model across a broader range of tasks
in high-resource languages underscores its practicality.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要