150 years of written Dutch

Nederlandse Taalkunde(2021)

引用 1|浏览2
暂无评分
摘要
In this article, we present a new corpus spanning 163 years of written Dutch. This Dutch Corpus of Contemporary and late Modern Periodicals (Dutch C-CLAMP) comprises 47,738 part-of-speech tagged articles published in Dutch periodicals from 1837 until 1999, totaling approximately 200 million tokens in size. We explain the measures we took to overcome the shortcomings of existing corpora of historical Dutch covering the same period. We provide a detailed description of how the corpus has been compiled and enriched. Several aspects are covered: text-markup, preprocessing of the data, including foreign language recognition and spelling normalization, and the enrichment of both textual data as well as metadata of the authors of the corpus files. We also carry out two case studies to illustrate the reliability of the corpus.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要