Regionalized models for Spanish language variations based on Twitter

LANGUAGE RESOURCES AND EVALUATION(2021)

引用 0|浏览4
暂无评分
摘要
Spanish is one of the most spoken languages in the globe, but not necessarily Spanish is written and spoken in the same way in dif-ferent countries. Understanding local language variations can help to improve model performances on regional tasks, both understanding local structures and also improving the message’s content. For instance, think about a machine learning engineer who automatizes some language classification task on a particular region or a social scientist trying to understand a regional event with echoes on social media; both can take advantage of dialect-based language models to understand what is happening with more contextual information hence more precision. This manuscript presents and describes a set of regionalized resources for the Spanish language built on four-year Twitter public messages geotagged in 26 Spanish-speaking countries. We introduce word embeddings based on FastText, language models based on BERT, and per-region sample corpora. We also provide a broad comparison among regions covering lexical and semantical similarities; as well as examples of using regional resources on message classification tasks.
更多
查看译文
关键词
Linguistic resources,Semantic space,Spanish Twitter
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要