AI helps you reading Science
AI Insight
AI extracts a summary of this paper
Weibo:
Contextualized Perturbation for Textual Adversarial Attack
NAACL-HLT, pp.5053-5069, (2021)
EI
Abstract
Adversarial examples expose the vulnerabilities of natural language processing (NLP) models, and can be used to evaluate and improve their robustness. Existing techniques of generating such examples are typically driven by local heuristic rules that are agnostic to the context, often resulting in unnatural and ungrammatical outputs. Thi...More
Code:
Data:
Introduction
- Adversarial example generation for natural language processing (NLP) tasks aims to perturb input text to trigger errors in machine learning models, while keeping the output close to the original.
- Generating adversarial examples for NLP tasks can be challenging, in part due to the discrete nature of natural language text.
- Adversarial example generation centers around a victim model f , which the authors assume is a text classifier.
Highlights
- Adversarial example generation for natural language processing (NLP) tasks aims to perturb input text to trigger errors in machine learning models, while keeping the output close to the original
- We evaluate CLARE on text classification, natural language inference, and sentence paraphrase tasks, by attacking finetuned BERT models (Devlin et al, 2019)
- Our analysis further suggests that the CLARE can be used to improve the robustness of the downstream models, and improve their accuracy when the available training data is limited
- As we will show in the experiments (§3.3), contextualized infilling produces more fluent and grammatical outputs compared to the context-agnostic counterparts, especially when using masked language models trained on large-scale data
- 9 In preliminary experiments, we found that it is more difficult to use other models to attack a victim model trained with the adversarial examples generated by CLARE, than to use CLARE itself
- We have presented CLARE, a contextualized adversarial example generation model for text
Methods
- The authors evaluate CLARE on text classification, natural language inference, and sentence paraphrase tasks.
- Merge perturbation can only merge noun phrases, extracted with the NLTK toolkit. The authors find that this helps produce more grammatical outputs
Results
- With a very limited set of synonym candidates from WordNet, PWWS fails to attack a BERT model on most of inputs.
- Using word embeddings to find synonyms, TextFooler achieves a higher success rate, but tends to produce less grammatical and less natural outputs.
- Equipped with a language model, TextFooler+LM does better in terms of perplexity, yet this brings little grammaticality improvement and comes at a cost to attack success rate.
Conclusion
- A key technique of CLARE is the local mask--infill perturbation. This comes with several advantages.
- Generating adversarial examples with masked language models is explored by a concurrent work (Li et al, 2020)
- Their method is similar to a CLARE model except that it only uses the Replace action.
- As shown in the ablation study (§4.1), using all three actions helps CLARE achieve a better attack performance.The authors have presented CLARE, a contextualized adversarial example generation model for text.
- The authors release the code and models at https://github.com/cookielee77/CLARE
Tables
- Table1: Some statistics of datasets. The last column indicates the victim model’s accuracy on the original test set without adversarial attack
- Table2: Adversarial example generation performance in attack success rate (A-rate), modification rate (Mod), perplexity (PPL), number of increased grammar errors (GErr), and textual similarity (Sim). The perplexity of the original inputs is indicated in parentheses for each dataset. Bold font indicates the best performance for each metric
- Table3: Human evaluation performance in percentage on the AG News dataset. ± indicates confidence intervals with a 95% confidence level
- Table4: Ablation study results. “w/o sim > ” ablates the textual similarity constraint when constructing the candidate sets, while “w/o pMLM > k” ablates the masked language model probability constraint
- Table5: Results of CLARE implemented with different masked language models (MLM)
- Table6: Adversarial training results on AG news test set. “Acc” indicates accuracy
- Table7: Top: Top-3 POS tags (or POS tag bigrams) and their percentages for each perturbation type. (a, b): insert a token between a and b. a-b: merge a and b into a token. Bottom: An AG news sample, where CLARE perturbs token “cybersecurity.” Neither PWWS nor TextFooler is able to attack this token since it is out of their vocabularies
- Table8: Adversarial examples produced by different models. The gold label of the original is shown below the (bolded) dataset name. Replace, Insert and Merge are highlighted in italic red, bold blue and sans serif yellow, respectively. (Best viewed in color)
- Table9: Table 9
- Table10: Compared with all baselines, CLARE achieves the best performance on attack success rate, perplexity, grammaticality, and similarity. It is consistent with our observation in §3.3. Adversarial example generation performance in attack success rate (A-rate), modification rate (Mod), perplexity (PPL), number of increased grammar errors (GErr), and text similarity (Sim). The perplexity of the original inputs is indicated in parentheses for each dataset. Bold indicates the best performance on each metric
- Table11: Speed experiment with different attack models
Related work
- Textual adversarial attack. An increasing amount of effort is being devoted to generating better textual adversarial examples with various attack models. Character-based models (Liang et al, 2019; Ebrahimi et al, 2018; Li et al, 2018; Gao et al, 2018, inter alia) use misspellings to attack the victim systems; however, these attacks can often be defended by a spell checker (Pruthi et al, 2019; Vijayaraghavan and Roy, 2019; Zhou et al, 2019b; Jones et al, 2020). Many sentence-level models (Iyyer et al, 2018; Wang et al, 2019b; Zou et al, 2020, inter alia) have been developed to introduce more sophisticated token/phrase perturbations. These, however, generally have difficulty maintaining semantic similarity with original inputs (Zhang et al, 2020a). Recent word-level models explore
Funding
- The trend is similar for fluency & grammaticality (42% vs. 9%)
- As shown in Table 6, when the full training data is available, adversarial training slightly decreases the test accuracy by 0.2% and 0.4% respectively
- As shown in Table 6, in 3 out of the 4 cases, adversarial training helps to decrease the attack success rate by more than 10.2%, and to increase the number of modifications needed by more than 0.7
Study subjects and analysis
datasets: 4
Table 1 summarizes some statistics of the datasets. In addition to the above four datasets, we experiment with DBpedia ontology dataset (Zhang et al, 2015), Stanford sentiment treebank (Socher et al, 2013), Microsoft Research Paraphrase Corpus (Dolan and Brockett, 2005), and Quora Question Pairs from the GLUE benchmark. The results on these datasets are summarized in Appendix A.2
datasets: 4
Table 2 summarizes the results. Although PWWS achieves the best modification rate on 3 out of the 4 datasets, it underperforms CLARE in terms of other metrics. With a very limited set of synonym candidates from WordNet, PWWS fails to attack a BERT model on most of inputs
cases: 4
A higher success rate and fewer modifications indicate a victim classifier is more vulnerable to adversarial attacks. As shown in Table 6, in 3 out of the 4 cases, adversarial training helps to decrease the attack success rate by more than 10.2%, and to increase the number of modifications needed by more than 0.7. The only exception is the TextCNN model trained with 10% data
Reference
- Moustafa Alzantot, Yash Sharma, Ahmed Elgohary, Bo-Jhang Ho, Mani Srivastava, and Kai-Wei Chang. 2018. Generating natural language adversarial examples. In Proc. of EMNLP.
- Daniel Cer, Yinfei Yang, Sheng-yi Kong, Nan Hua, Nicole Limtiaco, Rhomni St John, Noah Constant, Mario Guajardo-Cespedes, Steve Yuan, Chris Tar, et al. 2018. Universal sentence encoder. arXiv preprint arXiv:1803.11175.
- Yong Cheng, Lu Jiang, and Wolfgang Macherey. 2019. Robust neural machine translation with doubly adversarial inputs. In Proc. of ACL.
- Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proc. of NAACL.
- William B Dolan and Chris Brockett. 200Automatically constructing a corpus of sentential paraphrases. In Proceedings of the Third International Workshop on Paraphrasing.
- Javid Ebrahimi, Anyi Rao, Daniel Lowd, and Dejing Dou. 2018. Hotflip: White-box adversarial examples for text classification. In Proc. of ACL.
- Ji Gao, Jack Lanchantin, Mary Lou Soffa, and Yanjun Qi. 2018. Black-box generation of adversarial text sequences to evade deep learning classifiers. In IEEE Security and Privacy Workshops (SPW).
- Marjan Ghazvininejad, Omer Levy, Yinhan Liu, and Luke Zettlemoyer. 2019. Mask-predict: Parallel decoding of conditional masked language models. In Proc. of EMNLP.
- Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. 2015. Explaining and harnessing adversarial examples. In Proc. of ICLR.
- Jiatao Gu, James Bradbury, Caiming Xiong, Victor OK Li, and Richard Socher. 2018. Non-autoregressive neural machine translation. In Proc. of ICLR.
- Mohit Iyyer, John Wieting, Kevin Gimpel, and Luke Zettlemoyer. 2018. Adversarial example generation with syntactically controlled paraphrase networks. In Proc. of NAACL.
- Robin Jia and Percy Liang. 2017. Adversarial examples for evaluating reading comprehension systems. In Proc. of EMNLP.
- Robin Jia, Aditi Raghunathan, Kerem Goksel, and Percy Liang. 2019. Certified robustness to adversarial word substitutions. In Proc. of EMNLP.
- Di Jin, Zhijing Jin, Joey Tianyi Zhou, and Peter Szolovits. 2020. Is bert really robust? natural language attack on text classification and entailment. In Proc. of AAAI.
- Erik Jones, Robin Jia, Aditi Raghunathan, and Percy Liang. 2020. Robust encodings: A framework for combating adversarial typos. In Proc. of ACL.
- Katharina Kann, Sascha Rothe, and Katja Filippova. 2018. Sentence-level fluency evaluation: References help, but can be spared! In Proc. of CNLP, pages 313–323.
- Yoon Kim. 2014. Convolutional neural networks for sentence classification. In Proc. of EMNLP.
- Keita Kurita, Paul Michel, and Graham Neubig. 2020. Weight poisoning attacks on pre-trained models. arXiv preprint arXiv:2004.06660.
- Jason Lee, Elman Mansimov, and Kyunghyun Cho. 2018. Deterministic non-autoregressive neural sequence modeling by iterative refinement. In Proc. of EMNLP.
- Jinfeng Li, Shouling Ji, Tianyu Du, Bo Li, and Ting Wang. 2018. Textbugger: Generating adversarial text against real-world applications. arXiv preprint arXiv:1812.05271.
- Jiwei Li, Xinlei Chen, Eduard Hovy, and Dan Jurafsky. 2016. Visualizing and understanding neural models in nlp. In Proc. of NAACL, pages 681–691.
- Linyang Li, Ruotian Ma, Qipeng Guo, Xiangyang Xue, and Xipeng Qiu. 2020. Bert-attack: Adversarial attack against bert using bert. arXiv preprint arXiv:2004.09984.
- Bin Liang, Hongcheng Li, Miaoqiang Su, Pan Bian, Xirong Li, and Wenchang Shi. 2019. Deep text classification can be fooled. In Proc. of IJCAI.
- Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.
- Xuezhe Ma, Chunting Zhou, Xian Li, Graham Neubig, and Eduard Hovy. 2019. Flowseq: Nonautoregressive conditional sequence generation with generative flow. In Proc. of EMNLP.
- Ga Miller. 1995. Wordnet: A lexical database for english communications of the acm vol. 38.
- John X Morris, Eli Lifland, Jack Lanchantin, Yangfeng Ji, and Yanjun Qi. 2020. Reevaluating adversarial examples in natural language. arXiv preprint arXiv:2004.14174.
- Nikola Mrksic, Diarmuid O Seaghdha, Blaise Thomson, Milica Gasic, Lina M Rojas Barahona, Pei-Hao Su, David Vandyke, Tsung-Hsien Wen, and Steve Young. 2016. Counter-fitting word vectors to linguistic constraints. In Proc. of NAACL.
- Daniel Naber et al. 2003. A rule-based style and grammar checker. Citeseer.
- Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. 2019. Pytorch: An imperative style, high-performance deep learning library. In Proc. of NeurIPS.
- Danish Pruthi, Bhuwan Dhingra, and Zachary C Lipton. 2019. Combating adversarial misspellings with robust word recognition. In Proc. of ACL.
- Jipeng Qiang, Yun Li, Yi Zhu, and Yunhao Yuan. 2020. A simple bert-based approach for lexical simplification. In Proc. of AAAI.
- Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language models are unsupervised multitask learners. OpenAI Blog, 1(8):9.
- Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. Squad: 100,000+ questions for machine comprehension of text. In Proc. of EMNLP.
- Shuhuai Ren, Yihe Deng, Kun He, and Wanxiang Che. 2019. Generating natural language adversarial examples through probability weighted word saliency. In Proc. of ACL.
- Yi Ren, Jinglin Liu, Xu Tan, Sheng Zhao, Zhou Zhao, and Tie-Yan Liu. 2020. A study of non-autoregressive model for sequence generation. arXiv preprint arXiv:2004.10454.
- Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D Manning, Andrew Y Ng, and Christopher Potts. 2013. Recursive deep models for semantic compositionality over a sentiment treebank. In Proc. of EMNLP.
- Zhiqing Sun, Zhuohan Li, Haoqing Wang, Di He, Zi Lin, and Zhihong Deng. 2019. Fast structured decoding for sequence models. In Proc. of NeurIPS.
- Dimitris Tsipras, Shibani Santurkar, Logan Engstrom, Alexander Turner, and Aleksander Madry. 2018. Robustness may be at odds with accuracy. In Proc. of ICLR.
- Prashanth Vijayaraghavan and Deb Roy. 2019. Generating black-box adversarial examples for text classifiers using a deep reinforced model. arXiv preprint arXiv:1909.07873.
- Eric Wallace, Shi Feng, Nikhil Kandpal, Matt Gardner, and Sameer Singh. 2019. Universal adversarial triggers for attacking and analyzing nlp. In Proc. of EMNLP.
- Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. 2019a. Glue: A multi-task benchmark and analysis platform for natural language understanding. In Proc. of ICLR.
- Boxin Wang, Hengzhi Pei, Han Liu, and Bo Li. 2019b. Advcodec: Towards a unified framework for adversarial text generation. arXiv preprint arXiv:1912.10375.
- Xiaosen Wang, Hao Jin, and Kun He. 2019c. Natural language adversarial attacks and defenses in word level. arXiv preprint arXiv:1909.06723.
- Adina Williams, Nikita Nangia, and Samuel Bowman. 2018. A broad-coverage challenge corpus for sentence understanding through inference. In Proc. of NAACL.
- Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, R’emi Louf, Morgan Funtowicz, and Jamie Brew. 2019. Huggingface’s transformers: State-of-the-art natural language processing. ArXiv, abs/1910.03771.
- Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. 2018. Semantically equivalent adversarial rules for debugging nlp models. In Proc. of ACL.
- Xing Wu, Shangwen Lv, Liangjun Zang, Jizhong Han, and Songlin Hu. 2019a. Conditional bert contextual augmentation. In Proc. of ICCS.
- Suranjana Samanta and Sameep Mehta. 2017. Towards crafting text adversarial samples. arXiv preprint arXiv:1707.02812.
- Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. 2019.
- Xing Wu, Tao Zhang, Liangjun Zang, Jizhong Han, and Songlin Hu. 2019b. ”mask and infill”: Applying masked language model to sentiment transfer. In Proc. of IJCAI.
- Yuan Zang, Fanchao Qi, Chenghao Yang, Zhiyuan Liu, Meng Zhang, Qun Liu, and Maosong Sun. 2020. Word-level textual adversarial attacking as combinatorial optimization. In Proc. of ACL.
- Huangzhao Zhang, Hao Zhou, Ning Miao, and Lei Li. 2019. Generating fluent adversarial examples for natural languages. In Proc. of ACL.
- Wei Emma Zhang, Quan Z Sheng, Ahoud Alhazmi, and Chenliang Li. 2020a. Adversarial attacks on deep-learning models in natural language processing: A survey. ACM Transactions on Intelligent Systems and Technology (TIST), 11(3):1–41.
- Xiang Zhang, Junbo Zhao, and Yann LeCun. 2015. Character-level convolutional networks for text classification. In Proc. of NeurIPS.
- Yizhe Zhang, Guoyin Wang, Chunyuan Li, Zhe Gan, Chris Brockett, and Bill Dolan. 2020b. Pointer: Constrained text generation via insertionbased generative pre-training. arXiv preprint arXiv:2005.00558.
- Zhengli Zhao, Dheeru Dua, and Sameer Singh. 2018. Generating natural adversarial examples. In Proc. of ICLR.
- Wangchunshu Zhou, Tao Ge, Ke Xu, Furu Wei, and Ming Zhou. 2019a. Bert-based lexical substitution. In Proc. of ACL.
- Yichao Zhou, Jyun-Yu Jiang, Kai-Wei Chang, and Wei Wang. 2019b. Learning to discriminate perturbations for blocking adversarial attacks in text classification. In Proc. of EMNLP.
- Wei Zou, Shujian Huang, Jun Xie, Xinyu Dai, and Jiajun Chen. 2020. A reinforced generation of adversarial samples for neural machine translation. In Proc. of ACL.
- Model Implementation. All pretrained models and victim models based on RoBERTa and BERTbase are implemented with Hugging Face transformers10 (Wolf et al., 2019) based on PyTorch (Paszke et al., 2019). RoBERTadistill, RoBERTabase and uncase BERTbase models have 82M, 125M and 110M parameters, respectively. We use RoBERTadistill as our main backbone for fast inference purpose. PWWS11 and TextFooler12 are built with their open source implementation provided by the authors. In the implementation of TextFooler+LM, we use small sized GPT-2 language model (Radford et al., 2019) to further select those candidate tokens that have top 20% perplexity in the candidate token set. In the adversarial training (§4.2), the small TextCNN victim model (Kim, 2014) has 128 embedding size and 100 filters for 3, 4, 5 window size with 0.5 dropout, resulting in 7M parameters.
- Evaluation Metric. The similarity function sim builds on the universal sentence encoder (USE; Cer et al., 2018) to measure a local similarity at the perturbation position with window size 15 between the original input and its adversary. All baselines are equipped this sim when constructing the candidate vocabulary. The evaluation metric Sim uses USE to calculate a global similarity between two texts. These procedures are typically following Jin et al. (2020). We mostly rely on human evaluation (§3.3) to conclude the significant advantage of preserving textual similarity on CLARE compared with TextFooler.
- Data Processing. When processing the data, we keep all punctuation in texts for both victim model training and attacking. Since GLUE benchmark (Wang et al., 2019a) does not provide the label for test set, we instead use its dev set as the the test set for the included datasets (MNLI, QNLI, QQP, MRPC, SST-2) in the evaluation. For
Tags
Comments
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn