AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models
arxiv(2023)
摘要
The aligned Large Language Models (LLMs) are powerful language understanding
and decision-making tools that are created through extensive alignment with
human feedback. However, these large models remain susceptible to jailbreak
attacks, where adversaries manipulate prompts to elicit malicious outputs that
should not be given by aligned LLMs. Investigating jailbreak prompts can lead
us to delve into the limitations of LLMs and further guide us to secure them.
Unfortunately, existing jailbreak techniques suffer from either (1) scalability
issues, where attacks heavily rely on manual crafting of prompts, or (2)
stealthiness problems, as attacks depend on token-based algorithms to generate
prompts that are often semantically meaningless, making them susceptible to
detection through basic perplexity testing. In light of these challenges, we
intend to answer this question: Can we develop an approach that can
automatically generate stealthy jailbreak prompts? In this paper, we introduce
AutoDAN, a novel jailbreak attack against aligned LLMs. AutoDAN can
automatically generate stealthy jailbreak prompts by the carefully designed
hierarchical genetic algorithm. Extensive evaluations demonstrate that AutoDAN
not only automates the process while preserving semantic meaningfulness, but
also demonstrates superior attack strength in cross-model transferability, and
cross-sample universality compared with the baseline. Moreover, we also compare
AutoDAN with perplexity-based defense methods and show that AutoDAN can bypass
them effectively.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要