A Survey on Data Selection for Language Models
TMLR 2024(2024)
UC Santa Barbara | Allen Institute for AI | Stanford University | Massachusetts Institute of Technology | Allen Institute for AI Xinyi Wang | University of Toronto
The authors of this paper include Alon Albalak, Yanai Elazar, Sang Michael Xie, Shayne Longpre, Nathan Lambert, Xinyi Wang, Niklas Muennighoff, Bairu Hou, Liangming Pan, Haewon Jeong, Colin Raffel, Shiyu Chang, Tatsunori Hashimoto, and Yang Wang. They are affiliated with institutions such as the University of California, Santa Barbara, Allen Institute for AI, MIT Media Lab, Department of Computer Science at the University of Toronto, among others. Their research areas cover natural language processing, machine learning, language models, multi-armed bandits, large-scale language models, data-driven decision-making, neural network machine translation, visual question answering, and other fields.
A Survey on Data Selection for Language Models
1. Introduction
- The importance of data selection in machine learning
- Objectives and challenges of data selection
- Data selection in language model training
2. Taxonomy of Data Selection
- Definition of data points and datasets
- A unified framework for data selection methods
- Dimensions for classifying data selection methods
3. Data Selection Methods in Language Model Pre-training
- Language filtering
- Heuristic methods
- Data quality
- Domain-specific selection
- Data deduplication
- Toxicity and explicit content filtering
- Multilingual model data selection
- Data mixing
4. Data Selection in Other Stages of Language Model Training
- Multitask training and instruction tuning
- Alignment
- Learning in context
- Task-specific fine-tuning
5. Data Selection in Non-Language Domains
- Computer vision
- Vision-language model pre-training
- Task-specific fine-tuning
6. Related Topics
- Data cleaning
- Data distillation and core set selection
- Data attribution and valuation
- Data augmentation
- Data organization
- Curriculum learning
7. Insights from Data Selection
- Test set decontamination
- Trade-off between memory and generalization
- There is no free lunch
- Data selection tools
- Considerations when applying data selection
8. Future Directions: Challenges and Opportunities
- Accelerating data selection research
- Better understanding of the nature of target distributions
- Shifting computational time from model training to data processing
Q: What research methods were specifically used in the paper?
- Literature Review: Through extensive reading and analysis of existing literature, the paper summarizes and synthesizes the applications and research outcomes of data selection methods in different fields.
- Conceptual Framework Construction: Proposes a unified conceptual framework that classifies and compares data selection methods, and defines the components of data selection methods, including utility functions and selection mechanisms.
- Case Studies: Conducts in-depth analysis of several data selection methods, including language filtering, heuristic methods, data quality, domain-specific selection, data deduplication, toxicity content filtering, data mixing, multi-task training, instruction fine-tuning, alignment, context learning, and task-specific fine-tuning.
- Comparative Analysis: Compares different data selection methods, evaluating their advantages, disadvantages, and suitable scenarios.
Q: What are the main research findings and outcomes?
- Classification of Data Selection Methods: Proposes a unified conceptual framework, categorizing data selection methods into several types, including distribution matching, data deduplication, data mixing, etc.
- Application of Data Selection Methods: Summarizes the applications of data selection methods in various fields, such as language model pre-training, multi-task learning, instruction fine-tuning, alignment, context learning, and task-specific fine-tuning.
- Evaluation of Data Selection Methods: Conducts comparative analysis of different data selection methods, assessing their advantages, disadvantages, and suitable scenarios.
- Future Research Directions: Proposes future research directions for data selection methods, such as developing direct data evaluation metrics, constructing data selection benchmarks and challenges, open-sourcing tools and best practices.
Q: What are the current limitations of this research?
- Complexity of Data Selection Methods: There is a wide variety of data selection methods, each with its own specific scenarios and limitations, making it challenging to conduct comprehensive evaluation and comparison.
- Difficulty in Evaluating Data Selection Methods: The evaluation of data selection methods requires a large number of experiments and computational resources, and the results may be influenced by factors such as datasets, model architectures, and training strategies.
- Interpretability of Data Selection Methods: Some data selection methods, such as model-based filtering and deduplication methods, have decision-making processes that are difficult to interpret, making it hard to understand the reasons behind data selection.
- Fairness of Data Selection Methods: Data selection methods may introduce new biases, such as text content-based filtering methods that may discriminate against certain groups.

Transformers for Molecular Property Prediction: Lessons Learned from the Past Five Years
被引用0