Secrets in Source Code: Reducing False Positives using Machine Learning

2020 International Conference on COMmunication Systems & NETworkS (COMSNETS)(2020)

引用 12|浏览39
暂无评分
摘要
Private and public git repositories often contain unintentional sensitive information in the source code. Many tools have been developed to scan repositories looking for potential secrets and credentials committed in the code base, inadvertently or intentionally, for taking corrective action once these secrets and credentials are found. However, most of these existing works either target a specific type of secret or generate a large number of false positives. Our research aims to create a generalized framework to detect all kinds of secrets - which includes API keys, asymmetric private keys, client secrets, generic passwords - using an extensive regular expression list. We then apply machine learning models to intelligently distinguish between a real secret from a false positive. The combination of regular expression based approach and machine learning allows for the identification of different types of secrets, specifically generic passwords which are ignored by existing works, and subsequent reduction of possible false positives. We also evaluate our machine learning model using a precision-recall curve that can be used by an operator to find the optimal trade-off between the number of false positives and false negatives depending on their specific application. Using a Voting Classifier (combination of Logistic Regression, Naïve Bayes and SVM) we are able to reduce the number of false positives considerably.
更多
查看译文
关键词
Automated software tool,hard-coded secrets,source code,security
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要