Using Multi-Features And Ensemble Learning Method For Imbalanced Malware Classification
2016 IEEE Trustcom/BigDataSE/ISPA(2016)
摘要
The ever-growing malware threats in the cyber space calls for techniques that are more effective than widely deployed signature-based detection system. To counter large volumes of malware variants, machine learning techniques have been applied for automated malware classification. Despite these efforts have achieved a certain success, the accuracy and efficiency still remained inadequate to meet demand, especially when these machine learning techniques are used in the situation of multiple class classification and imbalanced training data. Against this backdrop, the goal of this paper is to build a malware classification system that could be used to improve the above mentioned situation. Our system is based on multiple categories of static features and ensemble learning method. Compared to some traditional systems it has the following advantages. Firstly, with multiple categories of features, our system could classify malware to their corresponding family effectively and efficiently and at the same time avoid the influence of evasion in certain extent. Our method don't need any unpacking process and extract features from the bytes file and disassembled asm file directly. Secondly, the system employed two efficient ensemble learning models, namely XGBoost and ExtraTreeClassifer, and also combined stacking method to construct the final classifier. Finally, we experimented our system with the dataset provided by Microsoft hosted in Kaggle for malware classification competition, and the final results show that our method could classify malware to their family effectively and efficiently with the accuracy of 0.9972 in training set and logloss of 0.00395 in testing set. Our work not only offer insights into how to use multiple features for classification, but also shed light on how to develop a scalable techniques for automated malware classification in practice.
更多查看译文
关键词
ensemble learning method,multifeatures method,imbalanced malware classification,malware threats,cyber space calls,signature-based detection system,machine learning,automated malware classification,multiple class classification,imbalanced training data,static features,feature extraction,disassembled asm file,bytes file,XGBoost models,ExtraTreeClassifer models,Kaggle
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络