Offensive Language Detection in Spanish Social Media: Testing From Bag-of-Words to Transformers Models.

IEEE Access(2023)

引用 0|浏览6
暂无评分
摘要
Social networks allow us to communicate with people around the world. However, some users usually take advantage of anonymity for writing offensive comments to others, which might affect those who receive offensive messages or discourage the use of these networks. However, it is impossible to manually check every message. This has promoted several proposals for automatic detection systems. Current state-of-the-art systems are based on the transformers' architecture and most of the work has been focused on the English language. However, these systems do not pay too much attention to the unbalanced nature of data, since there are fewer offensive comments than non-offensive in a real environment. Besides, these previous works have not studied the impact on the final results of pre-processing or the corpora used for pre-training the models. In this work, we propose and evaluate a series of automatic methods aimed at detecting offensive language in Spanish texts addressing the unbalanced nature of data. We test different learning models, from those based on classical Machine Learning algorithms using Bag-of-Words as data representation to those based in large language models and neural networks such as transformers, paying more attention to minor classes and the corpora used for pre-training the transformer-based models. We show how transformer-based models continue obtaining the best results, but we improved previous results by a 6,2% by adding new steps of pre-processing and using models pre-trained with Spanish social-media data, setting new state-of-the-art results.
更多
查看译文
关键词
Task analysis, Social networking (online), Transformers, Data models, Blogs, Hate speech, Feature extraction, Natural language processing, Offensive language, natural language processing, transformers-based models
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要