Empirical Analysis on CI/CD Pipeline Evolution in Machine Learning Projects
arxiv(2024)
摘要
The growing popularity of machine learning (ML) and the integration of ML
components with other software artifacts has led to the use of continuous
integration and delivery (CI/CD) tools, such as Travis CI, GitHub Actions, etc.
that enable faster integration and testing for ML projects. Such CI/CD
configurations and services require synchronization during the life cycle of
the projects. Several works discussed how CI/CD configuration and services
change during their usage in traditional software systems. However, there is
very limited knowledge of how CI/CD configuration and services change in ML
projects.
To fill this knowledge gap, this work presents the first empirical analysis
of how CI/CD configuration evolves for ML software systems. We manually
analyzed 343 commits collected from 508 open-source ML projects to identify
common CI/CD configuration change categories in ML projects and devised a
taxonomy of 14 co-changes in CI/CD and ML components. Moreover, we developed a
CI/CD configuration change clustering tool that identified frequent CI/CD
configuration change patterns in 15,634 commits. Furthermore, we measured the
expertise of ML developers who modify CI/CD configurations. Based on this
analysis, we found that 61.8
and minimal changes related to performance and maintainability compared to
general open-source projects. Additionally, the co-evolution analysis
identified that CI/CD configurations, in many cases, changed unnecessarily due
to bad practices such as the direct inclusion of dependencies and a lack of
usage of standardized testing frameworks. More practices were found through the
change patterns analysis consisting of using deprecated settings and reliance
on a generic build language. Finally, our developer's expertise analysis
suggests that experienced developers are more inclined to modify CI/CD
configurations.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要