Studying LLM Performance on Closed- and Open-source Data
CoRR(2024)
摘要
Large Language models (LLMs) are finding wide use in software engineering
practice. These models are extremely data-hungry, and are largely trained on
open-source (OSS) code distributed with permissive licenses. In terms of actual
use however, a great deal of software development still occurs in the
for-profit/proprietary sphere, where the code under development is not, and
never has been, in the public domain; thus, many developers, do their work, and
use LLMs, in settings where the models may not be as familiar with the code
under development. In such settings, do LLMs work as well as they do for OSS
code? If not, what are the differences? When performance differs, what are the
possible causes, and are there work-arounds? In this paper, we examine this
issue using proprietary, closed-source software data from Microsoft, where most
proprietary code is in C# and C++. We find that performance for C# changes
little from OSS –> proprietary code, but does significantly reduce for C++; we
find that this difference is attributable to differences in identifiers. We
also find that some performance degradation, in some cases, can be ameliorated
efficiently by in-context learning.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要