Lshwe: Improving Similarity-Based Word Embedding With Locality Sensitive Hashing For Cyberbullying Detection

2020 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN)(2020)

引用 9|浏览16
暂无评分
摘要
Word embedding methods use low-dimensional vectors to represent words in the corpus. Such low-dimensional vectors can capture lexical semantics and greatly improve the cyberbullying detection performance. However, existing word embedding methods have a major limitation in cyberbullying detection task: they cannot represent well on "deliberately obfuscated words", which are used by users to replace bullying words in order to evade detection. These deliberately obfuscated words are often regarded as "rare words" with a little contextual information and are removed during preprocessing. In this paper, we propose a word embedding method called LSHWE to solve this limitation, which is based on an idea that deliberately obfuscated words have a high context similarity with their corresponding bullying words. LSHWE has two steps: firstly, it generates the nearest neighbor matrix according to the co-occurrence matrix and the nearest neighbor list obtained by Locality Sensitive Hashing (LSH); secondly, it uses an LSH-based autoencoder to learn word representations based on these two matrices. Especially, the reconstructed nearest neighbor matrix generated by the LSH-based autoencoder is used to make the representations of deliberately obfuscated words close to their corresponding bullying words. In order to improve the algorithm efficiency, LSHWE uses LSH to generate the nearest neighbor list and the reconstructed nearest neighbor list. Empirical experiments prove the effectiveness of LSHWE in cyberbullying detection, particularly on the "deliberately obfuscated words" problem. Moreover, LSHWE is highly efficient, it can represent tens of thousands of words in a few minutes on a typical single machine.
更多
查看译文
关键词
Word Embedding, Locality Sensitive Hashing, Cyberbullying Detection
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要