Forbidden Facts: An Investigation of Competing Objectives in Llama-2
CoRR(2023)
摘要
LLMs often face competing pressures (for example helpfulness vs.
harmlessness). To understand how models resolve such conflicts, we study
Llama-2-chat models on the forbidden fact task. Specifically, we instruct
Llama-2 to truthfully complete a factual recall statement while forbidding it
from saying the correct answer. This often makes the model give incorrect
answers. We decompose Llama-2 into 1000+ components, and rank each one with
respect to how useful it is for forbidding the correct answer. We find that in
aggregate, around 35 components are enough to reliably implement the full
suppression behavior. However, these components are fairly heterogeneous and
many operate using faulty heuristics. We discover that one of these heuristics
can be exploited via a manually designed adversarial attack which we call The
California Attack. Our results highlight some roadblocks standing in the way of
being able to successfully interpret advanced ML systems. Project website
available at https://forbiddenfacts.github.io .
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要