When your Cousin has the Right Connections: Unsupervised Bilingual Lexicon Induction for Related Data-Imbalanced Languages
International Conference on Language Resources and Evaluation(2023)
摘要
Most existing approaches for unsupervised bilingual lexicon induction (BLI)
depend on good quality static or contextual embeddings requiring large
monolingual corpora for both languages. However, unsupervised BLI is most
likely to be useful for low-resource languages (LRLs), where large datasets are
not available. Often we are interested in building bilingual resources for LRLs
against related high-resource languages (HRLs), resulting in severely
imbalanced data settings for BLI. We first show that state-of-the-art BLI
methods in the literature exhibit near-zero performance for severely
data-imbalanced language pairs, indicating that these settings require more
robust techniques. We then present a new method for unsupervised BLI between a
related LRL and HRL that only requires inference on a masked language model of
the HRL, and demonstrate its effectiveness on truly low-resource languages
Bhojpuri and Magahi (with <5M monolingual tokens each), against Hindi. We
further present experiments on (mid-resource) Marathi and Nepali to compare
approach performances by resource range, and release our resulting lexicons for
five low-resource Indic languages: Bhojpuri, Magahi, Awadhi, Braj, and
Maithili, against Hindi.
更多查看译文
关键词
unsupervised bilingual lexicon induction,related language pairs,data-imbalanced
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要