MiMiC: Minimally Modified Counterfactuals in the Representation Space
CoRR(2024)
摘要
Language models often exhibit undesirable behaviors, such as gender bias or
toxic language. Interventions in the representation space were shown effective
in mitigating such issues by altering the LM behavior. We first show that two
prominent intervention techniques, Linear Erasure and Steering Vectors, do not
enable a high degree of control and are limited in expressivity.
We then propose a novel intervention methodology for generating expressive
counterfactuals in the representation space, aiming to make representations of
a source class (e.g., “toxic”) resemble those of a target class (e.g.,
“non-toxic”). This approach, generalizing previous linear intervention
techniques, utilizes a closed-form solution for the Earth Mover's problem under
Gaussian assumptions and provides theoretical guarantees on the representation
space's geometric organization. We further build on this technique and derive a
nonlinear intervention that enables controlled generation. We demonstrate the
effectiveness of the proposed approaches in mitigating bias in multiclass
classification and in reducing the generation of toxic language, outperforming
strong baselines.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要