A Causal Explainable Guardrails for Large Language Models
CoRR(2024)
摘要
Large Language Models (LLMs) have shown impressive performance in natural
language tasks, but their outputs can exhibit undesirable attributes or biases.
Existing methods for steering LLMs towards desired attributes often assume
unbiased representations and rely solely on steering prompts. However, the
representations learned from pre-training can introduce semantic biases that
influence the steering process, leading to suboptimal results. We propose
LLMGuardaril, a novel framework that incorporates causal analysis and
adversarial learning to obtain unbiased steering representations in LLMs.
LLMGuardaril systematically identifies and blocks the confounding effects of
biases, enabling the extraction of unbiased steering representations.
Additionally, it includes an explainable component that provides insights into
the alignment between the generated output and the desired direction.
Experiments demonstrate LLMGuardaril's effectiveness in steering LLMs towards
desired attributes while mitigating biases. Our work contributes to the
development of safe and reliable LLMs that align with desired attributes. We
discuss the limitations and future research directions, highlighting the need
for ongoing research to address the ethical implications of large language
models.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要