Chrome Extension
WeChat Mini Program
Use on ChatGLM

A Survey on Data Selection for Language Models

TMLR 2024(2024)

UC Santa Barbara | Allen Institute for AI | Stanford University | Massachusetts Institute of Technology | Allen Institute for AI Xinyi Wang | University of Toronto

Cited 4|Views138
Abstract
A major factor in the recent success of large language models is the use ofenormous and ever-growing text datasets for unsupervised pre-training. However,naively training a model on all available data may not be optimal (orfeasible), as the quality of available text data can vary. Filtering out datacan also decrease the carbon footprint and financial costs of training modelsby reducing the amount of training required. Data selection methods aim to determine which candidate data points toinclude in the training dataset and how to appropriately sample from theselected data points. The promise of improved data selection methods has causedthe volume of research in the area to rapidly expand. However, because deeplearning is mostly driven by empirical evidence and experimentation onlarge-scale data is expensive, few organizations have the resources forextensive data selection research. Consequently, knowledge of effective dataselection practices has become concentrated within a few organizations, many ofwhich do not openly share their findings and methodologies. To narrow this gap in knowledge, we present a comprehensive review ofexisting literature on data selection methods and related research areas,providing a taxonomy of existing approaches. By describing the currentlandscape of research, this work aims to accelerate progress in data selectionby establishing an entry point for new and established researchers.Additionally, throughout this review we draw attention to noticeable holes inthe literature and conclude the paper by proposing promising avenues for futureresearch.
More
Translated text
Key words
Language Modeling,Topic Modeling,Part-of-Speech Tagging
PDF
Bibtex
AI Read Science
Video&Figures
Must-Reading Tree
Example
Generate MRT to find the research sequence of this paper
Data Disclaimer
The page data are from open Internet sources, cooperative publishers and automatic analysis results through AI technology. We do not make any commitments and guarantees for the validity, accuracy, correctness, reliability, completeness and timeliness of the page data. If you have any questions, please contact us by email: report@aminer.cn
Chat Paper

要点】:该论文调研了语言模型数据选择的方法,提出了数据选择对语言模型训练的重要性,指出了当前研究领域存在的不足,并提出了未来研究的方向。

方法】:通过综述现有文献,提出了数据选择的方法和相关研究领域的分类。

实验】:该论文没有具体实验,而是通过综述现有文献来总结现状并指出未来研究的方向。