Enhancing Large Language Models Against Inductive Instructions with Dual-critique Prompting
arxiv(2023)
摘要
Numerous works are proposed to align large language models (LLMs) with human
intents to better fulfill instructions, ensuring they are trustful and helpful.
Nevertheless, some human instructions are often malicious or misleading and
following them will lead to untruthful and unsafe responses. Previous work
rarely focused on understanding how LLMs manage instructions based on
counterfactual premises, referred to here as inductive instructions,
which may stem from users' false beliefs or malicious intents. In this paper,
we aim to reveal the behaviors of LLMs towards inductive instructions
and enhance their truthfulness and helpfulness accordingly. Specifically, we
first introduce a benchmark of Inductive
Instructions (INDust), where the false
knowledge is incorporated into instructions in multiple different styles. After
extensive human and automatic evaluations, we uncovered a universal
vulnerability among LLMs in processing inductive instructions. Additionally, we
identified that different inductive styles affect the models' ability to
identify the same underlying errors, and the complexity of the underlying
assumptions also influences the model's performance. Motivated by these
results, we propose Dual-critique prompting to improve LLM robustness
against inductive instructions. Our experiments demonstrate that
Dual-critique prompting significantly bolsters the robustness of a
diverse array of LLMs, even when confronted with varying degrees of inductive
instruction complexity and differing inductive styles.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要