The Johns Hopkins University Bible Corpus - 1600+ Tongues for Typological Exploration.

LREC(2020)

引用 31|浏览146
暂无评分
摘要
We present findings from the creation of a massively parallel corpus in over 1600 languages, the Johns Hopkins University Bible Corpus (JHUBC). The corpus consists of over 4000 unique translations of the Christian Bible and counting. Our data is derived from scraping several online resources and merging them with existing corpora, combining them under a common scheme that is verse-parallel across all translations. We detail our effort to scrape, clean, align, and utilize this ripe multilingual dataset. The corpus captures the great typological variety of the world's languages. We catalog this by showing highly similar proportions of representation of Ethnologue's typological features in our corpus. We also give an example application: projecting pronoun features like clusivity across alignments to richly annotate languages which do not mark the distinction.
更多
查看译文
关键词
low-resource NLP, parallel corpus, typology, function words, Bible
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要