Generate and Pray: Using SALLMS to Evaluate the Security of LLM Generated Code.
CoRR(2023)
摘要
With the growing popularity of Large Language Models (e.g. GitHub Copilot,
ChatGPT, etc.) in software engineers' daily practices, it is important to
ensure that the code generated by these tools is not only functionally correct
but also free of vulnerabilities. Although LLMs can help developers to be more
productive, prior empirical studies have shown that LLMs can generate insecure
code. There are two contributing factors to the insecure code generation.
First, existing datasets used to evaluate Large Language Models (LLMs) do not
adequately represent genuine software engineering tasks sensitive to security.
Instead, they are often based on competitive programming challenges or
classroom-type coding tasks. In real-world applications, the code produced is
integrated into larger codebases, introducing potential security risks. There's
a clear absence of benchmarks that focus on evaluating the security of the
generated code. Second, existing evaluation metrics primarily focus on the
functional correctness of the generated code while ignoring security
considerations. Metrics such as pass@k gauge the probability of obtaining the
correct code in the top k suggestions. Other popular metrics like BLEU,
CodeBLEU, ROUGE, and METEOR similarly emphasize functional accuracy, neglecting
security implications. In light of these research gaps, in this paper, we
described SALLM, a framework to benchmark LLMs' abilities to generate secure
code systematically. This framework has three major components: a novel dataset
of security-centric Python prompts, an evaluation environment to test the
generated code, and novel metrics to evaluate the models' performance from the
perspective of secure code generation.
更多查看译文
关键词
sallms,sallms,security,code
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要