Automatic and Universal Prompt Injection Attacks against Large Language Models
CoRR(2024)
摘要
Large Language Models (LLMs) excel in processing and generating human
language, powered by their ability to interpret and follow instructions.
However, their capabilities can be exploited through prompt injection attacks.
These attacks manipulate LLM-integrated applications into producing responses
aligned with the attacker's injected content, deviating from the user's actual
requests. The substantial risks posed by these attacks underscore the need for
a thorough understanding of the threats. Yet, research in this area faces
challenges due to the lack of a unified goal for such attacks and their
reliance on manually crafted prompts, complicating comprehensive assessments of
prompt injection robustness. We introduce a unified framework for understanding
the objectives of prompt injection attacks and present an automated
gradient-based method for generating highly effective and universal prompt
injection data, even in the face of defensive measures. With only five training
samples (0.3
performance compared with baselines. Our findings emphasize the importance of
gradient-based testing, which can avoid overestimation of robustness,
especially for defense mechanisms.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要