A Software Vulnerability Dataset of Large Open Source C/C++ Projects

José D'Abruzzo Pereira,João Henggeler Antunes,Marco Vieira

2022 IEEE 27th Pacific Rim International Symposium on Dependable Computing (PRDC)（2022）

引用 1|浏览7

暂无评分

摘要

Automated tools, namely Static Analysis Tools (SATs) and Penetration Testing Tools, are frequently used by developers to detect vulnerabilities. However, research and practice show that the effectiveness of those tools in large-scale projects is low, being prone to both false positives and false negatives. Thus, there is an urgent need for more effective techniques, which ultimately require representative field data for driving their design and testing. In this paper, we present a dataset of vulnerabilities from five large open-source C/C++ projects: Mozilla, Linux Kernel, Xen, httpd, and Glibc. For collecting the data, we designed an automated process grounded on vulnerabilities collected from the Common Vulnerability and Exposures (CVE) Details website. For each vulnerability, we retrieve the corresponding source code units from the project repository (including both vulnerable and fixed versions). We then compute a large set of Software Metrics (SMs) for those code units and run two SATs to collect security alerts (i.e., potential vulnerabilities and/or weaknesses). The dataset currently includes 5214 vulnerabilities. To demonstrate its usefulness, we explore the use of the dataset to train machine learning models to detect vulnerable C/C++ functions. Results clearly show that the dataset can be used in practice and is a key contribution for researchers working in software security.

查看译文

关键词

Software security,vulnerability dataset,static code analysis,software metrics

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要