Dialectones: Finding statistically significant dialectal boundaries using Twitter data

EasyChair Preprints(2018)

引用 1|浏览3
暂无评分
摘要
MOTIVATION Most NLP applications assume that a particular language is homogeneous in the regions where it is spoken. However, each language varies considerably throughout its geographical distribution. When dialectal variation is significant, the effectiveness of oral and written communication can be significantly affected. To make NLP sensitive to dialects, a reliable, representative and up-to-date source of information that quantitatively represents such variation must be necessary. PROBLEM Some of the current approaches have disadvantages such as the subjectivity of the regions found, the need for parameters, ignoring the geographical coordinates in the analysis and the lack of a statistical test of the existence of the identified dialectal regions. METHOD Detection of ecotones is an analogous problem in the field of ecology that focuses on the detection of boundaries in ecosystems instead of region, facilitating the construction of statistical tests. We adapted a popular ecotone detection technique called “wombling” to the detection of dialectal boundaries by using as underlying non-parametric statistical test, the Hilbert-Schmidt independence criterion (HSIC). In addition to dealing with the aforementioned drawbacks, the use of HSIC provides robustness against to non-linearities present in the linguistic and geographical variables. The proposed method was applied to a large corpus of Spanish tweets produced in 250 locations in Colombia through the analysis of unigram features. RESULTS The resulting dialectal boundaries (i.e. dialectones) showed to be meaningful and spatially correlated with regions identified by other authors using classic dialectology. CONCLUSION We concluded that the automatic detection of dialectones is convenient alternative to classical methods in dialectology.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要