Static Analysis through Topic Modeling and its Application to Malware Programs Classification

2019 IEEE National Aerospace and Electronics Conference (NAECON)(2019)

引用 4|浏览6
暂无评分
摘要
We perform static analysis of malware programs in the BIG 2015 dataset, a repository containing nine different families of malware programs. Our main goal is to provide a framework for classification of the programs in the dataset. Our analysis of the programs is static in the sense that the contents of the said programs are looked at and their representations are constructed without executing the programs. More precisely, assembly language opcodes are extracted from the programs in the dataset and concatenated in order to construct documents representing these programs. Opcodes being words, we then employ Natural Language Processing tools and techniques for analysis of the documents. Mainly, the Latent Dirichlet Allocation (LDA) algorithm is used to model documents as weighted mixtures of a fixed number of topics. A topic is a collection of words grouped together for their ability to capture meaningful attributes about the documents. We note that the weight distribution of topics within documents of the same family (visually) shows a common pattern that seemingly varies from one family to another. This, therefore, aids in justifying the use of the LDA technique as a feature extraction method, with the features here being the weights of the topics representing each and every document. Ensuing, after training a fine k-nearest neighbors classifier, which takes topic weights as inputs, testing results show a 97.2% classification accuracy, thereby attesting to the efficacy of the overall approach.
更多
查看译文
关键词
Cybersecurity,Natural Language Processing,Topic modeling,Latent Dirichlet Allocation,Machine Learning.
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要