Detecting Edited Knowledge in Language Models
arxiv(2024)
摘要
Knowledge editing techniques (KEs) can update language models' obsolete or
inaccurate knowledge learned from pre-training. However, KE also faces
potential malicious applications, e.g. inserting misinformation and toxic
content. Moreover, in the context of responsible AI, it is instructive for
end-users to know whether a generated output is driven by edited knowledge or
first-hand knowledge from pre-training. To this end, we study detecting edited
knowledge in language models by introducing a novel task: given an edited model
and a specific piece of knowledge the model generates, our objective is to
classify the knowledge as either "non-edited" (based on the pre-training), or
“edited” (based on subsequent editing). We initiate the task with two
state-of-the-art KEs, two language models, and two datasets. We further propose
a simple classifier, RepReg, a logistic regression model that takes hidden
state representations as input features. Our results reveal that RepReg
establishes a strong baseline, achieving a peak accuracy of 99.81
in out-of-domain settings. Second, RepReg achieves near-optimal performance
with a limited training set (200 training samples), and it maintains its
performance even in out-of-domain settings. Last, we find it more challenging
to separate edited and non-edited knowledge when they contain the same subject
or object.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要