Phishing URL Detection with Oversampling based on Text Generative Adversarial Networks

2018 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA)(2018)

引用 49|浏览102
暂无评分
摘要
The problem of imbalanced classes arises frequently in binary classification tasks. If one class outnumbers another, trained classifiers become heavily biased towards the majority class. For phishing URL detection, it is very natural that the number of collected benign URLs (i.e., the majority class) is much larger than the number of collected phishy URLs (i.e., the minority class). Oversampling the minority class can be a powerful tool to overcome this situation. However, existing methods perform the oversampling task in the feature space where the original data format is removed and URLs are succinctly represented by vectors. These methods are successful only if feature definitions are correct and the dataset is diverse and not too sparse. In this paper, we propose an oversampling technique in the data space. We train text generative adversarial networks (text-GANs) with URLs in the minority class and generate synthetic URLs that can be made part of the training set. We crawl a crowd-sourced URL repository to collect recently discovered phishy and benign URLs. Our experiments demonstrate significant performance improvements after using the proposed oversampling technique. Interestingly, some of the original test URLs are exactly regenerated by the proposed text generative model.
更多
查看译文
关键词
phishing, text-GANs, generative adversarial networks, oversampling
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要