GradSafe: Detecting Unsafe Prompts for LLMs via Safety-Critical Gradient Analysis
CoRR(2024)
摘要
Large Language Models (LLMs) face threats from unsafe prompts. Existing
methods for detecting unsafe prompts are primarily online moderation APIs or
finetuned LLMs. These strategies, however, often require extensive and
resource-intensive data collection and training processes. In this study, we
propose GradSafe, which effectively detects unsafe prompts by scrutinizing the
gradients of safety-critical parameters in LLMs. Our methodology is grounded in
a pivotal observation: the gradients of an LLM's loss for unsafe prompts paired
with compliance response exhibit similar patterns on certain safety-critical
parameters. In contrast, safe prompts lead to markedly different gradient
patterns. Building on this observation, GradSafe analyzes the gradients from
prompts (paired with compliance responses) to accurately detect unsafe prompts.
We show that GradSafe, applied to Llama-2 without further training, outperforms
Llama Guard, despite its extensive finetuning with a large dataset, in
detecting unsafe prompts. This superior performance is consistent across both
zero-shot and adaptation scenarios, as evidenced by our evaluations on the
ToxicChat and XSTest. The source code is available at
https://github.com/xyq7/GradSafe.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要