Adversarial Semantic Collisions

EMNLP 2020, 2020.

Cited by: 0|Bibtex|Views9
Other Links: arxiv.org
Weibo:
We evaluated the effectiveness of our attacks on state-of-the-art models for paraphrase identification, document and sentence retrieval, and extractive summarization

Abstract:

We study semantic collisions: texts that are semantically unrelated but judged as similar by NLP models. We develop gradient-based approaches for generating semantic collisions and demonstrate that state-of-the-art models for many tasks which rely on analyzing the meaning and similarity of texts-- including paraphrase identification, do...More

Code:

Data:

0
Introduction
Highlights
  • Deep neural networks are vulnerable to adversarial examples (Szegedy et al, 2014; Goodfellow et al, 2015), i.e., imperceptibly perturbed inputs that cause models to make wrong predictions
  • Our semantic collisions extend the idea of changing input semantics to a different class of NLP models; we design new gradient-based approaches that are not perturbation-based and are more effective than HotFlip attacks; and, in addition to nonsensical adversarial texts, we show how to generate “natural” collisions that evade perplexity-based defenses
  • For Microsoft Research Paraphrase Corpus (MRPC) and Quora Question Pairs (QQP), we report the percentage of successful collisions with S > 0.5
  • For Core17/18, we report the percentage of irrelevant articles ranking in the top-10 and top-100 after inserting collisions
  • We demonstrated a new class of vulnerabilities in NLP applications: semantic collisions, i.e., input pairs that are unrelated to each other but perceived by the application as semantically similar
  • On QQP, aggressive collisions achieve 97% vs. 90% for constrained collisions
  • We evaluated the effectiveness of our attacks on state-of-the-art models for paraphrase identification, document and sentence retrieval, and extractive summarization
Methods
  • The authors use a simple greedy baseline based on HotFlip (Ebrahimi et al, 2018).
  • The authors need a LM g that shares the vocabulary with the target model f.
  • When targeting models based on RoBERTa, the authors use pretrained GPT2 (Radford et al, 2019) as the LM since the vocabulary is shared
Results
  • The authors report the similarity score S between x and c; the “gold” baseline is the similarity between x and the ground truth.
  • For MRPC, QQP, Chat, and CNNDM, the ground truth is the annotated label sentences; for Core17/18, the authors use the sentences with the highest similarity S to the query.
  • For MRPC and QQP, the authors report the percentage of successful collisions with S > 0.5.
  • For Core17/18, the authors report the percentage of irrelevant articles ranking in the top-10 and top-100 after inserting collisions.
  • For Chat, the authors report the percentage of collisions achieving top-1 rank.
  • For CNNDM, the authors report the percentage of collisions with the top-1 and top-3 ranks.
Conclusion
  • The authors demonstrated a new class of vulnerabilities in NLP applications: semantic collisions, i.e., input pairs that are unrelated to each other but perceived by the application as semantically similar.
  • The authors developed gradient-based search algorithms for generating collisions and showed how to incorporate constraints that help generate more “natural” collisions.
  • The authors evaluated the effectiveness of the attacks on state-of-the-art models for paraphrase identification, document and sentence retrieval, and extractive summarization.
  • The authors demonstrated that simple perplexity-based filtering is not sufficient to mitigate the attacks, motivating future research on more effective defenses
Summary
  • Introduction:

    Deep neural networks are vulnerable to adversarial examples (Szegedy et al, 2014; Goodfellow et al, 2015), i.e., imperceptibly perturbed inputs that cause models to make wrong predictions.
  • Adversarial examples based on inserting or modifying characters and words have been demonstrated for text classification (Liang et al, 2018; Ebrahimi et al, 2018; Pal and Tople, 2020), question answering (Jia and Liang, 2017; Wallace et al, 2019), and machine translation (Belinkov and Bisk, 2018; Wallace et al, 2020)
  • These attacks aim to minimally perturb the input so as it to preserve its semantics while changing the output of the model.
  • Whereas adversarial examples are similar inputs that produce dissimilar model outputs, semantic collisions are dissimilar inputs that produce similar model outputs
  • Objectives:

    As explained in Section 1, the goal is the inverse of adversarial examples: the authors aim to generate inputs with drastically different semantics that are perceived as similar by the model.
  • Given an input sentence x, the authors aim to generate a collision c for the victim model with the whitebox similarity function S.
  • The authors' objective is to insert a collision c into xd such that the rank of S(xd, c) among all sentences is high
  • Methods:

    The authors use a simple greedy baseline based on HotFlip (Ebrahimi et al, 2018).
  • The authors need a LM g that shares the vocabulary with the target model f.
  • When targeting models based on RoBERTa, the authors use pretrained GPT2 (Radford et al, 2019) as the LM since the vocabulary is shared
  • Results:

    The authors report the similarity score S between x and c; the “gold” baseline is the similarity between x and the ground truth.
  • For MRPC, QQP, Chat, and CNNDM, the ground truth is the annotated label sentences; for Core17/18, the authors use the sentences with the highest similarity S to the query.
  • For MRPC and QQP, the authors report the percentage of successful collisions with S > 0.5.
  • For Core17/18, the authors report the percentage of irrelevant articles ranking in the top-10 and top-100 after inserting collisions.
  • For Chat, the authors report the percentage of collisions achieving top-1 rank.
  • For CNNDM, the authors report the percentage of collisions with the top-1 and top-3 ranks.
  • Conclusion:

    The authors demonstrated a new class of vulnerabilities in NLP applications: semantic collisions, i.e., input pairs that are unrelated to each other but perceived by the application as semantically similar.
  • The authors developed gradient-based search algorithms for generating collisions and showed how to incorporate constraints that help generate more “natural” collisions.
  • The authors evaluated the effectiveness of the attacks on state-of-the-art models for paraphrase identification, document and sentence retrieval, and extractive summarization.
  • The authors demonstrated that simple perplexity-based filtering is not sufficient to mitigate the attacks, motivating future research on more effective defenses
Tables
  • Table1: Four tasks in our study. Given an input x and white-box access to a victim model, the adversary produces a collision c resulting in a deceptive output. Collisions can be nonsensical or natural-looking and also carry spam messages (shown in red)
  • Table2: Attack results. r is the rank of collisions among candidates. Gold denotes the ground truth
  • Table3: BERTSCORE between collisions and target inputs. Gold denotes the ground truth
  • Table4: Effectiveness of perplexity-based filtering. FP@90 and FP@80 are false positive rates (percentage of real data mistakenly filtered out) at thresholds that filter out 90% and 80% of collisions, respectively
  • Table5: Percentage of successfully transferred collisions for MRPC and Chat
  • Table6: Hyper-parameters for each experiment. B is the beam size for beam search. K is the number of top words evaluated at each optimization step. N is the number of optimization iterations. T is the sequence length. η is the step size for optimization. τ is the temperature for softmax. β is the interpolation parameter in equation 5
  • Table7: Collision examples for MRPC and QQP. Outputs are the probability scores produced by the model for whether the input and the collisions are paraphrases
  • Table8: Collision examples for Core17/18. r are the ranks of irrelevant articles after inserting the collisions
  • Table9: Collision examples for Chat. r are the ranks of collisions among the candidate responses
  • Table10: Collision examples for CNNDM. Truth are the true summarizing sentences. r are the ranks of collisions among all sentences in the news articles
Download tables as Excel
Related work
  • Adversarial examples in NLP. Most of the previously studied adversarial attacks in NLP aim to minimally modify or perturb inputs while changing the model’s output. Hosseini et al (2017) showed that perturbations, such as inserting dots or spaces between characters, can deceive a toxic comment classifier. HotFlip used gradients to find such perturbations given white-box access to the target model (Ebrahimi et al, 2018). Wallace et al (2019) extended HotFlip by inserting a short crafted “trigger” text to any input as perturbation; the trigger words are often highly associated with the target class label. Other approaches are based on rules, heuristics or generative models (Mahler et al, 2017; Ribeiro et al, 2018; Iyyer et al, 2018; Zhao et al, 2018). As explained in Section 1, our goal is the inverse of adversarial examples: we aim to generate inputs with drastically different semantics that are perceived as similar by the model.
Funding
  • This research was supported in part by NSF grants 1916717 and 2037519, the generosity of Eric and Wendy Schmidt by recommendation of the Schmidt Futures program, and a Google Faculty Research Award
Study subjects and analysis
paraphrase pairs: 1000
Paraphrase detection. We use the Microsoft Research Paraphrase Corpus (MRPC) (Dolan and Brockett, 2005) and Quora Question Pairs (QQP) (Iyer et al, 2017), and attack the first 1,000 paraphrase pairs from the validation set. We target the BERT and RoBERTa base models for MRPC and QQP, respectively

candidate documents: 1000
Our target model is Birch (Yilmaz et al, 2019a,b). Birch retrieves 1,000 candidate documents using the BM25 and RM3 baseline (Abduljaleel et al, 2004) and re-ranks them using the similarity scores from a fine-tuned BERT model. Given a query xq and a document xd, the BERT model assigns similarity scores S(xq, xi) for each sentence xi in xd

articles: 1000
We use the CNN / DailyMail (CNNDM) dataset (Hermann et al, 2015), which consists of news articles and labeled overview highlights. We attack the first 1,000 articles from the validation set. Our target model is PreSumm (Liu and Lapata, 2019)

Reference
  • Nasreen Abdul-jaleel, James Allan, W Bruce Croft, O Diaz, Leah Larkey, Xiaoyan Li, Mark D Smucker, and Courtney Wade. 2004. UMass at TREC 2004: Novelty and HARD. In TREC.
    Google ScholarFindings
  • Yonatan Belinkov and Yonatan Bisk. 2018. Synthetic and natural noise both break neural machine translation. In ICLR.
    Google ScholarFindings
  • Nicholas Carlini, Anish Athalye, Nicolas Papernot, Wieland Brendel, Jonas Rauber, Dimitris Tsipras, Ian Goodfellow, Aleksander Madry, and Alexey Kurakin. 2019. On evaluating adversarial robustness. arXiv preprint arXiv:1902.06705.
    Findings
  • Sumanth Dathathri, Andrea Madotto, Janice Lan, Jane Hung, Eric Frank, Piero Molino, Jason Yosinski, and Rosanne Liu. 2020. Plug and play language models: a simple approach to controlled text generation. In ICLR.
    Google ScholarFindings
  • Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In NAACL.
    Google ScholarFindings
  • William B Dolan and Chris Brockett. 2005. Automatically constructing a corpus of sentential paraphrases. In International Workshop on Paraphrasing.
    Google ScholarLocate open access versionFindings
  • Javid Ebrahimi, Anyi Rao, Daniel Lowd, and Dejing Dou. 2018. HotFlip: White-box adversarial examples for text classification. In ACL.
    Google ScholarFindings
  • Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. 2015. Explaining and harnessing adversarial examples. In ICLR.
    Google ScholarFindings
  • Karl Moritz Hermann, Tomas Kocisky, Edward Grefenstette, Lasse Espeholt, Will Kay, Mustafa Suleyman, and Phil Blunsom. 2015. Teaching machines to read and comprehend. In NeurIPS.
    Google ScholarFindings
  • Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. 2020. The curious case of neural text degeneration. In ICLR.
    Google ScholarFindings
  • Hossein Hosseini, Sreeram Kannan, Baosen Zhang, and Radha Poovendran. 2017. Deceiving Google’s Perspective API built for detecting toxic comments. arXiv preprint arXiv:1702.08138.
    Findings
  • Samuel Humeau, Kurt Shuster, Marie-Anne Lachaux, and Jason Weston. 2020. Poly-encoders: Transformer architectures and pre-training strategies for fast and accurate multi-sentence scoring. In ICLR.
    Google ScholarFindings
  • Daphne Ippolito, Daniel Duckworth, Douglas Eck, and Chris Callison-Burch. 2020. Automatic detection of generated text is easiest when humans are fooled. In ACL.
    Google ScholarFindings
  • Shankar Iyer, Nikhil Dandekar, and Kornel Csernai. First Quora dataset release: Question pairs [online]. 2017.
    Google ScholarFindings
  • Mohit Iyyer, John Wieting, Kevin Gimpel, and Luke Zettlemoyer. 2018. Adversarial example generation with syntactically controlled paraphrase networks. In NAACL.
    Google ScholarFindings
  • Joern-Henrik Jacobsen, Jens Behrmann, Richard Zemel, and Matthias Bethge. 2019a. Excessive invariance causes adversarial vulnerability. In ICLR.
    Google ScholarFindings
  • Jorn-Henrik Jacobsen, Jens Behrmannn, Nicholas Carlini, Florian Tramer, and Nicolas Papernot. 2019b. Exploiting excessive invariance caused by normbounded adversarial robustness. arXiv preprint arXiv:1903.10484.
    Findings
  • Robin Jia and Percy Liang. 2017. Adversarial examples for evaluating reading comprehension systems. In EMNLP.
    Google ScholarFindings
  • Anjuli Kannan, Karol Kurach, Sujith Ravi, Tobias Kaufmann, Andrew Tomkins, Balint Miklos, Greg Corrado, Laszlo Lukacs, Marina Ganea, Peter Young, et al. 2016. Smart Reply: Automated response suggestion for email. In KDD.
    Google ScholarFindings
  • Diederik P Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In ICLR.
    Google ScholarFindings
  • Kalpesh Krishna, Gaurav Singh Tomar, Ankur P Parikh, Nicolas Papernot, and Mohit Iyyer. 2020. Thieves on Sesame Street! Model extraction of BERT-based APIs. In ICLR.
    Google ScholarFindings
  • Ke Li, Tianhao Zhang, and Jitendra Malik. 2019. Approximate feature collisions in neural nets. In NeurIPS.
    Google ScholarFindings
  • Bin Liang, Hongcheng Li, Miaoqiang Su, Pan Bian, Xirong Li, and Wenchang Shi. 2018. Deep text classification can be fooled. In IJCAI.
    Google ScholarFindings
  • Yang Liu and Mirella Lapata. 2019. Text summarization with pretrained encoders. In EMNLP-IJCNLP.
    Google ScholarFindings
  • Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. RoBERTa: A robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692.
    Findings
  • Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. 2018. Towards deep learning models resistant to adversarial attacks. In ICLR.
    Google ScholarFindings
  • Taylor Mahler, Willy Cheung, Micha Elsner, David King, Marie-Catherine de Marneffe, Cory Shain, Symon Stevens-Guille, and Michael White. 2017. Breaking NLP: Using morphosyntax, semantics, pragmatics and world knowledge to fool sentiment analysis systems. In Workshop on Building Linguistically Generalizable NLP Systems.
    Google ScholarFindings
  • Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. 2017. Pointer sentinel mixture models. In ICLR.
    Google ScholarFindings
  • Paul Michel, Xian Li, Graham Neubig, and Juan Pino. 2019. On evaluation of adversarial perturbations for sequence-to-sequence models. In NAACL.
    Google ScholarFindings
  • Bijeeta Pal and Shruti Tople. 2020. To transfer or not to transfer: Misclassification attacks against transfer learned text classifiers. arXiv preprint arXiv:2001.02438.
    Findings
  • Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language models are unsupervised multitask learners. OpenAI Blog.
    Google ScholarFindings
  • Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. 2018. Semantically equivalent adversarial rules for debugging NLP models. In ACL.
    Google ScholarFindings
  • Laura Scharff. Introducing question merging [online]. 2015.
    Google ScholarFindings
  • Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. 2014. Intriguing properties of neural networks. In ICLR.
    Google ScholarFindings
  • Eric Wallace, Shi Feng, Nikhil Kandpal, Matt Gardner, and Sameer Singh. 2019. Universal adversarial triggers for attacking and analyzing NLP. In EMNLPIJCNLP.
    Google ScholarFindings
  • Eric Wallace, Mitchell Stern, and Dawn Song. 2020. Imitation attacks and defenses for blackbox machine translation systems. arXiv preprint arXiv:2004.15015.
    Findings
  • Zeynep Akkalyoncu Yilmaz, Shengjin Wang, Wei Yang, Haotian Zhang, and Jimmy Lin. 2019a. Applying BERT to document retrieval with Birch. In EMNLP-IJCNLP.
    Google ScholarFindings
  • Zeynep Akkalyoncu Yilmaz, Wei Yang, Haotian Zhang, and Jimmy Lin. 2019b. Cross-domain modeling of sentence-level evidence for document retrieval. In EMNLP-IJCNLP.
    Google ScholarFindings
  • Rowan Zellers, Ari Holtzman, Hannah Rashkin, Yonatan Bisk, Ali Farhadi, Franziska Roesner, and Yejin Choi. 2019. Defending against neural fake news. In NeurIPS.
    Google ScholarFindings
  • Saizheng Zhang, Emily Dinan, Jack Urbanek, Arthur Szlam, Douwe Kiela, and Jason Weston. 2018. Personalizing dialogue agents: I have a dog, do you have pets too? In ACL.
    Google ScholarFindings
  • Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q Weinberger, and Yoav Artzi. 2020. BERTScore: Evaluating text generation with BERT. In ICLR.
    Google ScholarFindings
  • Zhengli Zhao, Dheeru Dua, and Sameer Singh. 2018. Generating natural adversarial examples. In ICLR.
    Google ScholarFindings
Full Text
Your rating :
0

 

Tags
Comments