Anatomy of Hate Speech Datasets: Composition Analysis and Cross-dataset Classification

Samuel Guimaraes,Gabriel Kakizaki,Philipe Melo,Marcio Silva,Fabricio Murai,Julio C. S. Reis,Fabricio Benevenuto

34TH ACM CONFERENCE ON HYPERTEXT AND SOCIAL MEDIA, HT 2023（2023）

引用 0|浏览20

暂无评分

摘要

Manifestations of hate speech in different scenarios are increasingly frequent on social platforms. In this context, there is a large number of works that propose solutions for identifying this type of content in these environments. Most efforts to automatically detect hate speech follow the same process of supervised learning, using annotators to label a predefined set of messages, which are, in turn, used to train classifiers. However, annotators can create labels for different classification tasks, with divergent definitions of hate speech, binary or multi-label schemes, and various methodologies for collecting data. In this context, we examine the principal publicly available datasets for hate speech research. We investigate the types of hate speech (e.g., ethnicity, religion, sexual orientation) present in their composition, explore their content beyond the labels, and use cross-dataset classification to examine the use of the labeled data beyond its original work. Our results reveal interesting insights toward a better understanding of the hate speech phenomenon and improving its detection on social platforms. Warning. This paper contains offensive words and tweet examples.

查看译文

关键词

Hate Speech,Classification,HateBase,Datasets,Toxicity,Offensive Speech,Abusive Speech

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要