谷歌浏览器插件
订阅小程序
在清言上使用

The Ripple Effect of Dataset Reuse: Contextualising the Data Lifecycle for Machine Learning Data Sets and Social Impact

JOURNAL OF INFORMATION SCIENCE(2023)

引用 0|浏览0
暂无评分
摘要
Although there exists a rich literature on data lifecycle, a common framework for data lifecycle depicts reuse as the last stage. However, this framework fails to explain the complex lifecycle of machine learning (ML) data sets, which can have many different afterlives. Data sets for ML can be expanded to supplement previous research, and researchers can concatenate multiple data sets to develop new models. This study discusses ML dataset reuse through the lens of the data-information-knowledge-wisdom pyramid. In social science research, researchers might reuse data to analyse a new research question that is still in the context of the data domain. By contrast, research practices in ML, where researchers layer multiple data sets for training purposes, require us to ask whether the existing data lifecycle model, ending with 'reuse', is appropriate for explaining such an iterative and layered lifecycle. This study introduces one case of merging computer vision data set and natural language processing data set and two cases of applying ML models from outside of the ML community (hate speech detection and politeness detection) to justify a framework for a ML dataset lifecycle. Last but not least, this study proposes a ML dataset lifecycle and provides case examples to describe each stage.
更多
查看译文
关键词
Data curation,data lifecycle,data management,machine learning,responsible data science
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要