CCOHA - Clean Corpus of Historical American English.
LREC(2020)
摘要
Modelling language change is an increasingly important area of interest within the fields of sociolinguistics and historical linguistics. In recent years, there has been a growing number of publications whose main concern is studying changes that have occurred within the past centuries. The Corpus of Historical American English (COHA) is one of the most commonly used large corpora in diachronic studies in English. This paper describes methods applied to the downloadable version of the COHA corpus in order to overcome its main limitations, such as inconsistent lemmas and malformed tokens, without compromising its qualitative and distributional properties. The resulting clean corpus of historical American English (CCOHA) contains a larger number of cleaned word tokens which can offer better insights into language change and allow for a larger variety of tasks to be performed.
更多查看译文
关键词
COHA, Corpora, Historical Linguistics, Language Change
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络