BanglaLM: Data Mining based Bangla Corpus for Language Model Research

Md. Kowsher,Md. Jashim Uddin,Anik Tahabilder,Md Ruhul Amin,Md. Fahim Shahriar,Md. Shohanur Islam Sobuj

2021 Third International Conference on Inventive Research in Computing Applications (ICIRCA)（2021）

引用 3|浏览7

暂无评分

摘要

Natural language processing (NLP) is an area of machine learning that has garnered a lot of attention in recent days due to the revolution in artificial intelligence, robotics, and smart devices. NLP focuses on training machines to understand and analyze various languages, extract meaningful information from those, translate from one language to another, correct grammar, predict the next word, complete a sentence, or even generate a completely new sentence from an existing corpus. A major challenge in NLP lies in training the model for obtaining high prediction accuracy since training needs a vast dataset. For widely used languages like English, there are many datasets available that can be used for NLP tasks like training a model and summarization but for languages like Bengali, which is only spoken primarily in South Asia, there is a dearth of big datasets which can be used to build a robust machine learning model. Therefore, NLP researchers who mainly work with the Bengali language will find an extensive, robust dataset incredibly useful for their NLP tasks involving the Bengali language. With this pressing issue in mind, this research work has prepared a dataset whose content is curated from social media, blogs, newspapers, wiki pages, and other similar resources. The amount of samples in this dataset is 19132010, and the length varies from 3 to 512 words. This dataset can easily be used to build any unsupervised machine learning model with an aim to performing necessary NLP tasks involving the Bengali language. Also, this research work is releasing two preprocessed version of this dataset that is especially suited for training both core machine learning-based and statistical-based model. As very few attempts have been made in this domain, keeping Bengali language researchers in mind, it is believed that the proposed dataset will significantly contribute to the Bengali machine learning and NLP community.

查看译文

关键词

Bangla NLP,Bangla Dataset,Bangla Corpus,BanglaLM,Bangla Language Model,Data Mining,Web Scraping

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要