AI4Bharat-IndicNLP Corpus: Monolingual Corpora and Word Embeddings for Indic Languages

Kunchukuttan Anoop,Kakwani Divyanshu,Golla Satish,C. Gokul N.,Bhattacharyya Avik,Khapra Mitesh M.,Kumar Pratyush

arxiv（2020）

引用 1|浏览39

暂无评分

摘要

We present the IndicNLP corpus, a large-scale, general-domain corpus containing 2.7 billion words for 10 Indian languages from two language families. We share pre-trained word embeddings trained on these corpora. We create news article category classification datasets for 9 languages to evaluate the embeddings. We show that the IndicNLP embeddings significantly outperform publicly available pre-trained embedding on multiple evaluation tasks. We hope that the availability of the corpus will accelerate Indic NLP research. The resources are available at https://github.com/ai4bharat-indicnlp/indicnlp_corpus.

查看译文

关键词

monolingual corpora,ai4bharat-indicnlp languages,word embeddings

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要