TRAP: Targeted Random Adversarial Prompt Honeypot for Black-Box Identification.
Findings of the Association for Computational Linguistics ACL 2024(2024)
Abstract
Large Language Model (LLM) services and models often come with legal rules onwho can use them and how they must use them. Assessing the compliance of thereleased LLMs is crucial, as these rules protect the interests of the LLMcontributor and prevent misuse. In this context, we describe the novel problemof Black-box Identity Verification (BBIV). The goal is to determine whether athird-party application uses a certain LLM through its chat function. Wepropose a method called Targeted Random Adversarial Prompt (TRAP) thatidentifies the specific LLM in use. We repurpose adversarial suffixes,originally proposed for jailbreaking, to get a pre-defined answer from thetarget LLM, while other models give random answers. TRAP detects the targetLLMs with over 95after a single interaction. TRAP remains effective even if the LLM has minorchanges that do not significantly alter the original function.
MoreTranslated text
AI Read Science
Must-Reading Tree
Example
Generate MRT to find the research sequence of this paper
Chat Paper
Summary is being generated by the instructions you defined