Parallelization of Machine Learning Applied to Call Graphs of Binaries for Malware Detection

Robert Searles,Lifan Xu,William Killian,Tristan Vanderbruggen, Teague Forren, John Howe,Zachary Pearson, Corey Shannon, Joshua Simmons,John Cavazos

2017 25th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP)(2017)

引用 14|浏览55
暂无评分
摘要
Malicious applications have become increasingly numerous. This demands adaptive, learning-based techniques for constructing malware detection engines, instead of the traditional manual-based strategies. Prior work in learning-based malware detection engines primarily focuses on dynamic trace analysis and byte-level n-grams. Our approach in this paper differs in that we use compiler intermediate representations, i.e., the callgraph representation of binaries. Using graph-based program representations for learning provides structure of the program, which can be used to learn more advanced patterns. We use the Shortest Path Graph Kernel (SPGK) to identify similarities between call graphs extracted from binaries. The output similarity matrix is fed into a Support Vector Machine (SVM) algorithm to construct highly-accurate models to predict whether a binary is malicious or not. However, SPGK is computationally expensive due to the size of the input graphs. Therefore, we evaluate different parallelization methods for CPUs and GPUs to speed up this kernel, allowing us to continuously construct up-to-date models in a timely manner. Our hybrid implementation, which leverages both CPU and GPU, yields the best performance, achieving up to a 14.2x improvement over our already optimized OpenMP version. We compared our generated graph-based models to previously state-of-the-art feature vector 2-gram and 3-gram models on a dataset consisting of over 22,000 binaries. We show that our classification accuracy using graphs is over 19% higher than either n-gram model and gives a false positive rate (FPR) of less than 0.1%. We are also able to consider large call graphs and dataset sizes because of the reduced execution time of our parallelized SPGK implementation.
更多
查看译文
关键词
Binary Analysis,Parallelization,Call Graphs,Malware,Compiler Techniques,Heterogeneous Computing
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要