A Large Labeled Corpus for Online Harassment Research.

Jennifer Golbeck,Zahra Ashktorab, Rashad O. Banjo, Alexandra Berlinger, Siddharth Bhagwan,Cody Buntain,Paul Cheakalos, Alicia A. Geller, Quint Gergory,Rajesh Kumar Gnanasekaran, Raja Rajan Gunasekaran,Kelly M. Hoffman, Jenny Hottle, Vichita Jienjitlert, Shivika Khare, Ryan Lau,Marianna J. Martindale,Shalmali Naik, Heather L. Nixon, Piyush Ramachandran,Kristine M. Rogers, Lisa Rogers, Meghna Sardana Sarin, Gaurav Shahane, Jayanee Thanki, Priyanka Vengataraman,Zijian Wan, Derek Michael Wu

WebSci(2017)

引用 221|浏览1689
暂无评分
摘要
A fundamental part of conducting cross-disciplinary web science research is having useful, high-quality datasets that provide value to studies across disciplines. In this paper, we introduce a large, hand-coded corpus of online harassment data. A team of researchers collaboratively developed a codebook using grounded theory and labeled 35,000 tweets. Our resulting dataset has roughly 15% positive harassment examples and 85% negative examples. This data is useful for training machine learning models, identifying textual and linguistic features of online harassment, and for studying the nature of harassing comments and the culture of trolling.
更多
查看译文
关键词
online harassment, datasets
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要