Chrome Extension
WeChat Mini Program
Use on ChatGLM

A Survey on Data Selection for Language Models

Trans Mach Learn Res(2024)

Cited 0|Views94
No score
Abstract
A major factor in the recent success of large language models is the use ofenormous and ever-growing text datasets for unsupervised pre-training. However,naively training a model on all available data may not be optimal (orfeasible), as the quality of available text data can vary. Filtering out datacan also decrease the carbon footprint and financial costs of training modelsby reducing the amount of training required. Data selection methods aim to determine which candidate data points toinclude in the training dataset and how to appropriately sample from theselected data points. The promise of improved data selection methods has causedthe volume of research in the area to rapidly expand. However, because deeplearning is mostly driven by empirical evidence and experimentation onlarge-scale data is expensive, few organizations have the resources forextensive data selection research. Consequently, knowledge of effective dataselection practices has become concentrated within a few organizations, many ofwhich do not openly share their findings and methodologies. To narrow this gap in knowledge, we present a comprehensive review ofexisting literature on data selection methods and related research areas,providing a taxonomy of existing approaches. By describing the currentlandscape of research, this work aims to accelerate progress in data selectionby establishing an entry point for new and established researchers.Additionally, throughout this review we draw attention to noticeable holes inthe literature and conclude the paper by proposing promising avenues for futureresearch.
More
Translated text
AI Read Science
Must-Reading Tree
Example
Generate MRT to find the research sequence of this paper
Chat Paper
Summary is being generated by the instructions you defined