谷歌浏览器插件
订阅小程序
在清言上使用

Is Self-Repair a Silver Bullet for Code Generation?

ICLR(2024)

引用 52|浏览807
暂无评分
摘要
Large language models have shown remarkable aptitude in code generation, but still struggle on challenging tasks. Self-repair---in which the model debugs and fixes mistakes in its own code---has recently become a popular way to boost performance in these settings. However, only very limited studies on how and when self-repair works effectively exist in the literature, and one might wonder to what extent a model is really capable of repairing mistakes in code which was originally generated by that very same model. In this paper, we analyze Code Llama, GPT-3.5 and GPT-4's ability to perform self-repair on problems taken from HumanEval or APPS, finding that when the cost of carrying out repair is taken into account gains are often modest, vary a lot between subsets of the data, and are sometimes not present at all. We hypothesize that this is because self-repair is bottlenecked by the model's ability to provide feedback on its own code; boosting the feedback with stronger models, we observe performance gains even in settings where the model does not benefit from self-repair. Furthermore, we observe that providing the model with feedback from human participants greatly benefits repair even for GPT-4, and we provide a brief qualitative analysis as to why.
更多
查看译文
关键词
program synthesis,code generation,large language models,machine learning for code
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要