Potential and Challenges of Model Editing for Social Debiasing
CoRR(2024)
摘要
Large language models (LLMs) trained on vast corpora suffer from inevitable
stereotype biases. Mitigating these biases with fine-tuning could be both
costly and data-hungry. Model editing methods, which focus on modifying LLMs in
a post-hoc manner, are of great potential to address debiasing. However, it
lacks a comprehensive study that facilitates both internal and external model
editing methods, supports various bias types, as well as understands the pros
and cons of applying editing methods to stereotypical debiasing. To mitigate
this gap, we carefully formulate social debiasing into an editing problem and
benchmark seven existing model editing algorithms on stereotypical debiasing,
i.e., debias editing. Our findings in three scenarios reveal both the potential
and challenges of debias editing: (1) Existing model editing methods can
effectively preserve knowledge and mitigate biases, while the generalization of
debias effect from edited sentences to semantically equivalent sentences is
limited.(2) Sequential editing highlights the robustness of SERAC (Mitchell et
al. 2022b), while internal editing methods degenerate with the number of edits.
(3) Model editing algorithms achieve generalization towards unseen biases both
within the same type and from different types. In light of these findings, we
further propose two simple but effective methods to improve debias editing, and
experimentally show the effectiveness of the proposed methods.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要