Between Lines of Code: Unraveling the Distinct Patterns of Machine and Human Programmers
CoRR(2024)
摘要
Large language models have catalyzed an unprecedented wave in code
generation. While achieving significant advances, they blur the distinctions
between machine-and human-authored source code, causing integrity and
authenticity issues of software artifacts. Previous methods such as DetectGPT
have proven effective in discerning machine-generated texts, but they do not
identify and harness the unique patterns of machine-generated code. Thus, its
applicability falters when applied to code. In this paper, we carefully study
the specific patterns that characterize machine and human-authored code.
Through a rigorous analysis of code attributes such as length, lexical
diversity, and naturalness, we expose unique pat-terns inherent to each source.
We particularly notice that the structural segmentation of code is a critical
factor in identifying its provenance. Based on our findings, we propose a novel
machine-generated code detection method called DetectCodeGPT, which improves
DetectGPT by capturing the distinct structural patterns of code. Diverging from
conventional techniques that depend on external LLMs for perturbations,
DetectCodeGPT perturbs the code corpus by strategically inserting spaces and
newlines, ensuring both efficacy and efficiency. Experiment results show that
our approach significantly outperforms state-of-the-art techniques in detecting
machine-generated code.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要