No Data to Crawl? Monolingual Corpus Creation from PDF Files of Truly low-Resource Languages in Peru.

Gina Bustamante,Arturo Oncevay,Roberto Zariquiey

LREC（2020）

引用 0|浏览26

暂无评分

摘要

We introduce new monolingual corpora for four indigenous and endangered languages from Peru: Shipibo-konibo, Ashaninka, Yanesha and Yine. Given the total absence of these languages in the web, the extraction and processing of texts from PDF files is relevant in a truly low-resource language scenario. Our procedure for monolingual corpus creation considers language-specific and language-agnostic steps, and focuses on educational PDF files with multilingual sentences, noisy pages and low-structured content. Through an evaluation based on language modelling and character-level perplexity on a subset of manually extracted sentences, we determine that our method allows the creation of clean corpora for the four languages, a key resource for natural language processing tasks nowadays.

查看译文

关键词

Shipibo-Konibo, Ashaninka, Yanesha, Yine, endangered languages, indigenous languages, low-resource languages, pdf processing, monolingual corpus, corpus creation

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要