Navigating Dataset Documentations in AI: A Large-Scale Analysis of Dataset Cards on Hugging Face
CoRR(2024)
摘要
Advances in machine learning are closely tied to the creation of datasets.
While data documentation is widely recognized as essential to the reliability,
reproducibility, and transparency of ML, we lack a systematic empirical
understanding of current dataset documentation practices. To shed light on this
question, here we take Hugging Face – one of the largest platforms for sharing
and collaborating on ML models and datasets – as a prominent case study. By
analyzing all 7,433 dataset documentation on Hugging Face, our investigation
provides an overview of the Hugging Face dataset ecosystem and insights into
dataset documentation practices, yielding 5 main findings: (1) The dataset card
completion rate shows marked heterogeneity correlated with dataset popularity.
(2) A granular examination of each section within the dataset card reveals that
the practitioners seem to prioritize Dataset Description and Dataset Structure
sections, while the Considerations for Using the Data section receives the
lowest proportion of content. (3) By analyzing the subsections within each
section and utilizing topic modeling to identify key topics, we uncover what is
discussed in each section, and underscore significant themes encompassing both
technical and social impacts, as well as limitations within the Considerations
for Using the Data section. (4) Our findings also highlight the need for
improved accessibility and reproducibility of datasets in the Usage sections.
(5) In addition, our human annotation evaluation emphasizes the pivotal role of
comprehensive dataset content in shaping individuals' perceptions of a dataset
card's overall quality. Overall, our study offers a unique perspective on
analyzing dataset documentation through large-scale data science analysis and
underlines the need for more thorough dataset documentation in machine learning
research.
更多查看译文
关键词
dataset documentation,data-centric AI,computational social science
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要