DrAttack: Prompt Decomposition and Reconstruction Makes Powerful LLM Jailbreakers
CoRR(2024)
Abstract
The safety alignment of Large Language Models (LLMs) is vulnerable to both
manual and automated jailbreak attacks, which adversarially trigger LLMs to
output harmful content. However, current methods for jailbreaking LLMs, which
nest entire harmful prompts, are not effective at concealing malicious intent
and can be easily identified and rejected by well-aligned LLMs. This paper
discovers that decomposing a malicious prompt into separated sub-prompts can
effectively obscure its underlying malicious intent by presenting it in a
fragmented, less detectable form, thereby addressing these limitations. We
introduce an automatic prompt Decomposition and
Reconstruction framework for jailbreak Attack (DrAttack).
DrAttack includes three key components: (a) `Decomposition' of the original
prompt into sub-prompts, (b) `Reconstruction' of these sub-prompts implicitly
by in-context learning with semantically similar but harmless reassembling
demo, and (c) a `Synonym Search' of sub-prompts, aiming to find sub-prompts'
synonyms that maintain the original intent while jailbreaking LLMs. An
extensive empirical study across multiple open-source and closed-source LLMs
demonstrates that, with a significantly reduced number of queries, DrAttack
obtains a substantial gain of success rate over prior SOTA prompt-only
attackers. Notably, the success rate of 78.0% on GPT-4 with merely 15 queries
surpassed previous art by 33.1%.
MoreTranslated text
AI Read Science
Must-Reading Tree
Example
![](https://originalfileserver.aminer.cn/sys/aminer/pubs/mrt_preview.jpeg)
Generate MRT to find the research sequence of this paper
Chat Paper
Summary is being generated by the instructions you defined