SHED: Shapley-Based Automated Dataset Refinement for Instruction Fine-Tuning
arxiv(2024)
摘要
The pre-trained Large Language Models (LLMs) can be adapted for many
downstream tasks and tailored to align with human preferences through
fine-tuning. Recent studies have discovered that LLMs can achieve desirable
performance with only a small amount of high-quality data, suggesting that a
large amount of the data in these extensive datasets is redundant or even
harmful. Identifying high-quality data from vast datasets to curate small yet
effective datasets has emerged as a critical challenge. In this paper, we
introduce SHED, an automated dataset refinement framework based on Shapley
value for instruction fine-tuning. SHED eliminates the need for human
intervention or the use of commercial LLMs. Moreover, the datasets curated
through SHED exhibit transferability, indicating they can be reused across
different LLMs with consistently high performance. We conduct extensive
experiments to evaluate the datasets curated by SHED. The results demonstrate
SHED's superiority over state-of-the-art methods across various tasks and LLMs;
notably, datasets comprising only 10
achieve performance comparable to or surpassing that of the full datasets.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要