Self-Supervised Alignment with Mutual Information: Learning to Follow Principles without Preference Labels
arxiv(2024)
摘要
When prompting a language model (LM), users frequently expect the model to
adhere to a set of behavioral principles across diverse tasks, such as
producing insightful content while avoiding harmful or biased language.
Instilling such principles into a model can be resource-intensive and
technically challenging, generally requiring human preference labels or
examples. We introduce SAMI, a method for teaching a pretrained LM to follow
behavioral principles that does not require any preference labels or
demonstrations. SAMI is an iterative algorithm that finetunes a pretrained LM
to increase the conditional mutual information between constitutions and
self-generated responses given queries from a datasest. On single-turn dialogue
and summarization, a SAMI-trained mistral-7b outperforms the initial pretrained
model, with win rates between 66
instruction-finetuned baseline (mistral-7b-instruct) with win rates between 55
and 57
avoid dependence on stronger models, we further evaluate aligning a strong
pretrained model (mixtral-8x7b) using constitutions written by a weak
instruction-finetuned model (mistral-7b-instruct). The SAMI-trained
mixtral-8x7b outperforms both the initial model and the instruction-finetuned
model, achieving a 65
pretrained LM can learn to follow constitutions without using preference
labels, demonstrations, or human oversight.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要