The BDCamoes Collection of Portuguese Literary Documents: a Research Resource for Language Technology and Digital Humanities

LREC(2020)

引用 0|浏览16
暂无评分
摘要
This paper presents the BDCamoes Collection of Portuguese Literary Documents, a new corpus of literary texts written in Portuguese that in its inaugural version includes close to 4 million words from over 200 complete documents from 83 authors in 14 genres, covering a time span from the 16th to the 21st century, and adhering to different orthographic conventions. Many of the texts in the corpus have also been automatically parsed with state-of-the-art language processing tools, forming the BDCamoes Treebank subcorpus. This set of characteristics makes of BDCamoes an invaluable resource for research in language technology (e.g. authorship detection, genre classification, etc.) and in language science and digital humanities (e.g. comparative literature, diachronic linguistics, etc.).
更多
查看译文
关键词
Portuguese, corpus, language technology, digital humanities, literary studies, history, diachronic linguistics, cultural landmarks, cultural heritage
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要