Protecting Your LLMs with Information Bottleneck
arxiv(2024)
摘要
The advent of large language models (LLMs) has revolutionized the field of
natural language processing, yet they might be attacked to produce harmful
content. Despite efforts to ethically align LLMs, these are often fragile and
can be circumvented by jailbreaking attacks through optimized or manual
adversarial prompts. To address this, we introduce the Information Bottleneck
Protector (IBProtector), a defense mechanism grounded in the information
bottleneck principle, and we modify the objective to avoid trivial solutions.
The IBProtector selectively compresses and perturbs prompts, facilitated by a
lightweight and trainable extractor, preserving only essential information for
the target LLMs to respond with the expected answer. Moreover, we further
consider a situation where the gradient is not visible to be compatible with
any LLM. Our empirical evaluations show that IBProtector outperforms current
defense methods in mitigating jailbreak attempts, without overly affecting
response quality or inference speed. Its effectiveness and adaptability across
various attack methods and target LLMs underscore the potential of IBProtector
as a novel, transferable defense that bolsters the security of LLMs without
requiring modifications to the underlying models.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要