Detecting malware using text documents extracted from spam email through machine learning

Document Engineering(2022)

引用 0|浏览4
暂无评分
摘要
ABSTRACTSpam has become an effective way for cybercriminals to spread malware. Although cybersecurity agencies and companies develop products and organise courses for people to detect malicious spam email patterns, spam attacks are not totally avoided yet. In this work, we present and make publicly available "Spam Email Malware Detection - 600" (SEMD-600), a new dataset, based on Bruce Guenter's, for malware detection in spam using only the text of the email. We also introduce a pipeline for malware detection based on traditional Natural Language Processing (NLP) techniques. Using SEMD-600, we compare the text representation techniques Bag of Words and Term Frequency-Inverse Document Frequency (TF-IDF), in combination with three different supervised classifiers: Support Vector Machine, Naive Bayes and Logistic Regression, to detect malware in plain text documents. We found that combining TF-IDF with Logistic Regression achieved the best performance, with a macro F1 score of 0.763.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要