Adversarial NLI: A New Benchmark for Natural Language Understanding

ACL, pp. 4885-4901, 2020.

Cited by: 19|Bibtex|Views214
EI
Other Links: arxiv.org|dblp.uni-trier.de|academic.microsoft.com
Weibo:
We used a human-and-model-in-theloop training method to collect a new benchmark for natural language understanding

Abstract:

We introduce a new large-scale NLI benchmark dataset, collected via an iterative, adversarial human-and-model-in-the-loop procedure. We show that training models on this new dataset leads to state-of-the-art performance on a variety of popular NLI benchmarks, while posing a more difficult challenge with its new test set. Our analysis sh...More

Code:

Data:

Introduction
Highlights
  • Progress in AI has been driven by, among other things, the development of challenging large-scale benchmarks like ImageNet (Russakovsky et al, 2015) in computer vision, and SNLI (Bowman et al, 2015), SQuAD (Rajpurkar et al, 2016), and others in natural language processing (NLP)
  • 2) We show that training models on this new dataset leads to state-of-the-art performance on a variety of popular natural language inference benchmarks
  • We show test set performance on the Adversarial NLI test sets per round, the total Adversarial NLI test set, and the exclusive test subset
  • We used a human-and-model-in-theloop training method to collect a new benchmark for natural language understanding
  • Annotators were employed to act as adversaries, and encouraged to find vulnerabilities that fool the model into misclassifying, but that another person would correctly classify
  • As the rounds progressed, the models became more robust and the test sets for each round became more difficult. Training on this new data yielded the state of the art on existing natural language inference benchmarks
Results
  • Notice that the base model for each round performs very poorly on that round’s test set.
  • This is the expected outcome: For round 1, the base model gets the entire test set wrong, by design.
  • For rounds 2 and 3, the authors used an ensemble, so performance is not necessarily zero.
Conclusion
  • Discussion & Conclusion

    In this work, the authors used a human-and-model-in-theloop training method to collect a new benchmark for natural language understanding.
  • Annotators were employed to act as adversaries, and encouraged to find vulnerabilities that fool the model into misclassifying, but that another person would correctly classify.
  • As the rounds progressed, the models became more robust and the test sets for each round became more difficult.
  • Training on this new data yielded the state of the art on existing NLI benchmarks
Summary
  • Introduction:

    Progress in AI has been driven by, among other things, the development of challenging large-scale benchmarks like ImageNet (Russakovsky et al, 2015) in computer vision, and SNLI (Bowman et al, 2015), SQuAD (Rajpurkar et al, 2016), and others in natural language processing (NLP).
  • With the rapid pace of advancement in AI, NLU benchmarks struggle to keep up with model improvement
  • Whereas it took around 15 years to achieve “near-human performance” on MNIST (LeCun et al, 1998; Ciresan et al, 2012; Wan et al, 2013) and approximately 7 years to surpass humans on ImageNet (Deng et al, 2009; Russakovsky et al, 2015; He et al, 2016), the GLUE benchmark did not last as long as the authors would have hoped after the advent of BERT (Devlin et al., 2018), and rapidly had to be extended into SuperGLUE (Wang et al, 2019).
  • Human annotators—be they seasoned NLP researchers or non-experts— might be able to construct examples that expose model brittleness
  • Results:

    Notice that the base model for each round performs very poorly on that round’s test set.
  • This is the expected outcome: For round 1, the base model gets the entire test set wrong, by design.
  • For rounds 2 and 3, the authors used an ensemble, so performance is not necessarily zero.
  • Conclusion:

    Discussion & Conclusion

    In this work, the authors used a human-and-model-in-theloop training method to collect a new benchmark for natural language understanding.
  • Annotators were employed to act as adversaries, and encouraged to find vulnerabilities that fool the model into misclassifying, but that another person would correctly classify.
  • As the rounds progressed, the models became more robust and the test sets for each round became more difficult.
  • Training on this new data yielded the state of the art on existing NLI benchmarks
Tables
  • Table1: Examples from development set. ‘An’ refers to round number, ‘orig.’ is the original annotator’s gold label, ‘pred.’ is the model prediction, ‘valid.’ are the validator labels, ‘reason’ was provided by the original annotator, ‘Annotations’ are the tags determined by an linguist expert annotator
  • Table2: Dataset statistics: ‘Model error rate’ is the percentage of examples that the model got wrong; ‘unverified’ is the overall percentage, while ‘verified’ is the percentage that was verified by at least 2 human annotators
  • Table3: Model Performance. ‘S’ refers to SNLI, ‘M’ to MNLI dev (-m=matched, -mm=mismatched), and ‘F’ to
  • Table4: Model Performance on NLI stress tests (tuned on their respective dev. sets). All=S+M+F+ANLI. AT=‘Antonym’; ‘NR’=Numerical Reasoning; ‘LN’=Length; ‘NG’=Negation; ‘WO’=Word Overlap; ‘SE’=Spell Error. Previous models refers to the <a class="ref-link" id="cNaik_et+al_2018_a" href="#rNaik_et+al_2018_a">Naik et al (2018</a>) implementation of <a class="ref-link" id="cConneau_et+al_2017_a" href="#rConneau_et+al_2017_a">Conneau et al (2017</a>, InferSent) for the Stress Tests, and to the <a class="ref-link" id="cGururangan_et+al_2018_a" href="#rGururangan_et+al_2018_a">Gururangan et al (2018</a>) implementation of <a class="ref-link" id="cGong_et+al_2018_a" href="#rGong_et+al_2018_a">Gong et al (2018</a>, DIIN) for SNLI-Hard
  • Table5: RoBERTa performance on dev set with different training data. S=SNLI, M=MNLI, A=A1+A2+A3. ‘SM’ refers to combined S and M training set. D1, D2, D3 means down-sampling SM s.t. |SMD2|=|A| and |SMD3|+|A|=|SM|. Therefore, training sizes are identical in every pair of rows
  • Table6: Performance of RoBERTa with different data combinations. ALL=S,M,F,ANLI. Hypothesisonly models are marked H where they are trained and tested with only hypothesis texts
  • Table7: Analysis of 500 development set examples per round and on average
  • Table8: Percentage of development set sentences with tags in several datasets: AdvNLI, SNLI, MuliNLI and FEVER. ‘%c’ refers to percentage in contexts, and‘%h’ refers to percentage in hypotheses. Bolded values label linguistic phenomena that have higher incidence in adversarially created hypotheses than in hypotheses from other NLI datasets, and italicized values have roughly the same (within 5%) incidence
  • Table9: Label distribution in splits across rounds
  • Table10: Inter-annotator agreement (Fleiss’ kappa) for writers and the first two verifiers
  • Table11: Percentage of agreement of verifiers (“validators” for SNLI and MNLI) with the author label
  • Table12: Extra examples from development sets. ‘An’ refers to round number, ‘orig.’ is the original annotator’s gold label, ‘pred.’ is the model prediction, ‘valid.’ is the validator labels, ‘reason’ was provided by the original annotator, ‘Annotations’ is the tags determined by linguist expert annotator
Download tables as Excel
Related work
  • Bias in datasets Machine learning methods are well-known to pick up on spurious statistical patterns. For instance, in the first visual question answering dataset (Antol et al, 2015), biases like “2” being the correct answer to 39% of the questions starting with “how many” allowed learning algorithms to perform well while ignoring the visual modality altogether (Jabri et al, 2016; Goyal et al, 2017). In NLI, Gururangan et al (2018), Poliak et al (2018) and Tsuchiya (2018) showed that hypothesis-only baselines often perform far better than chance. NLI systems can often be broken merely by performing simple lexical substitutions (Glockner et al, 2018), and struggle with quantifiers (Geiger et al, 2018) and certain superficial syntactic properties (McCoy et al, 2019).

    In question answering, Kaushik and Lipton (2018) showed that question- and passage-only models can perform surprisingly well, while Jia and Liang (2017) added adversarially constructed sentences to passages to cause a drastic drop in performance. Many tasks do not actually require sophisticated linguistic reasoning, as shown by the surprisingly good performance of random encoders (Wieting and Kiela, 2019). Similar observations were made in machine translation (Belinkov and Bisk, 2017) and dialogue (Sankar et al, 2019). Machine learning also has a tendency to overfit on static targets, even if that does not happen deliberately (Recht et al, 2018). In short, the field is rife with dataset bias and papers trying to address this important problem. This work presents a potential solution: if such biases exist, they will allow humans to fool the models, resulting in valuable training examples until the bias is mitigated.
Funding
  • YN and MB were sponsored by DARPA MCS Grant #N66001-19-2-4031, ONR Grant #N00014-18-1-2871, and DARPA YFA17-D17AP00022
Reference
  • Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. 2015. Vqa: Visual question answering. In Proceedings of the IEEE international conference on computer vision, pages 2425–2433.
    Google ScholarLocate open access versionFindings
  • Yonatan Belinkov and Yonatan Bisk. 2017. Synthetic and natural noise both break neural machine translation. arXiv preprint arXiv:1711.02173.
    Findings
  • Luisa Bentivogli, Ido Dagan, Hoa Trang Dang, Danilo Giampiccolo, and Bernardo Magnini. 2009. The Fifth PASCAL Recognizing Textual Entailment Challenge. TAC.
    Google ScholarLocate open access versionFindings
  • Chandra Bhagavatula, Ronan Le Bras, Chaitanya Malaviya, Keisuke Sakaguchi, Ari Holtzman, Hannah Rashkin, Doug Downey, Scott Wen-tau Yih, and Yejin Choi. 2019. Abductive commonsense reasoning. arXiv preprint arXiv:1908.05739.
    Findings
  • Samuel R Bowman, Gabor Angeli, Christopher Potts, and Christopher D Manning. 201A large annotated corpus for learning natural language inference. arXiv preprint arXiv:1508.05326.
    Findings
  • Ronan Le Bras, Swabha Swayamdipta, Chandra Bhagavatula, Rowan Zellers, Matthew E Peters, Ashish Sabharwal, and Yejin Choi. 2020. Adversarial filters of dataset biases. arXiv preprint arXiv:2002.04108.
    Findings
  • Danqi Chen, Adam Fisch, Jason Weston, and Antoine Bordes. 201Reading Wikipedia to answer opendomain questions. In Association for Computational Linguistics (ACL).
    Google ScholarLocate open access versionFindings
  • Michael Chen, Mike D’Arcy, Alisa Liu, Jared Fernandez, and Doug Downey. 2019. CODAH: an adversarially authored question-answer dataset for common sense. CoRR, abs/1904.04365.
    Findings
  • Dan Ciresan, Ueli Meier, and Jurgen Schmidhuber. 2012. Multi-column deep neural networks for image classification. arXiv preprint arXiv:1202.2745.
    Findings
  • Alexis Conneau and Douwe Kiela. 2018. Senteval: An evaluation toolkit for universal sentence representations. arXiv preprint arXiv:1803.05449.
    Findings
  • Alexis Conneau, Douwe Kiela, Holger Schwenk, Loıc Barrault, and Antoine Bordes. 2017. Supervised learning of universal sentence representations from natural language inference data. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 670–680. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee.
    Google ScholarLocate open access versionFindings
  • Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
    Findings
  • Emily Dinan, Samuel Humeau, Bharath Chintagunta, and Jason Weston. 2019. Build it Break it Fix it for Dialogue Safety: Robustness from Adversarial Human Attack. In Proceedings of EMNLP.
    Google ScholarLocate open access versionFindings
  • Allyson Ettinger, Sudha Rao, Hal Daume III, and Emily M Bender. 2017. Towards linguistically generalizable nlp systems: A workshop and shared task. arXiv preprint arXiv:1711.01505.
    Findings
  • Atticus Geiger, Ignacio Cases, Lauri Karttunen, and Christopher Potts. 2018. Stress-testing neural models of natural language inference with multiply-quantified sentences. arXiv preprint arXiv:1810.13033.
    Findings
  • Mor Geva, Yoav Goldberg, and Jonathan Berant. 2019. Are we modeling the task or the annotator? an investigation of annotator bias in natural language understanding datasets. arXiv preprint arXiv:1908.07898.
    Findings
  • Max Glockner, Vered Shwartz, and Yoav Goldberg. 20Breaking nli systems with sentences that require simple lexical inferences. In Proceedings of ACL.
    Google ScholarLocate open access versionFindings
  • Yichen Gong, Heng Luo, and Jian Zhang. 2018. Natural language inference over interaction space. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings.
    Google ScholarLocate open access versionFindings
  • Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. 2017. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6904–6913.
    Google ScholarLocate open access versionFindings
  • Suchin Gururangan, Swabha Swayamdipta, Omer Levy, Roy Schwartz, Samuel R Bowman, and Noah A Smith. 2018. Annotation artifacts in natural language inference data. In Proceedings of NAACL.
    Google ScholarLocate open access versionFindings
  • Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770– 778.
    Google ScholarLocate open access versionFindings
  • Felix Hill, Antoine Bordes, Sumit Chopra, and Jason Weston. 2015. The goldilocks principle: Reading children’s books with explicit memory representations. arXiv preprint arXiv:1511.02301.
    Findings
  • Allan Jabri, Armand Joulin, and Laurens Van Der Maaten. 2016. Revisiting visual question answering baselines. In European conference on computer vision, pages 727–739. Springer.
    Google ScholarLocate open access versionFindings
  • Robin Jia and Percy Liang. 2017. Adversarial examples for evaluating reading comprehension systems. In Proceedings of EMNLP.
    Google ScholarLocate open access versionFindings
  • Divyansh Kaushik, Eduard Hovy, and Zachary C Lipton. 2019. Learning the difference that makes a difference with counterfactually-augmented data. arXiv preprint arXiv:1909.12434.
    Findings
  • Divyansh Kaushik and Zachary C Lipton. 2018. How much reading does reading comprehension require? a critical investigation of popular benchmarks. arXiv preprint arXiv:1808.04926.
    Findings
  • Bernhard Kratzwald and Stefan Feuerriegel. 2019. Learning from on-line user feedback in neural question answering on the web. In The World Wide Web Conference, pages 906–916. ACM.
    Google ScholarLocate open access versionFindings
  • Wuwei Lan, Siyu Qiu, Hua He, and Wei Xu. 2017. A continuously growing dataset of sentential paraphrases. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 1224–1234, Copenhagen, Denmark. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Yann LeCun, Leon Bottou, Yoshua Bengio, Patrick Haffner, et al. 1998. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324.
    Google ScholarLocate open access versionFindings
  • Huan Ling and Sanja Fidler. 2017. Teaching machines to describe images via natural language feedback. In Proceedings of the 31st International Conference on Neural Information Processing Systems, pages 5075–5085.
    Google ScholarLocate open access versionFindings
  • Xiaodong Liu, Pengcheng He, Weizhu Chen, and Jianfeng Gao. 2019a. Multi-task deep neural networks for natural language understanding. arXiv preprint arXiv:1901.11504.
    Findings
  • Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019b. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.
    Findings
  • Tom McCoy, Ellie Pavlick, and Tal Linzen. 2019. Right for the wrong reasons: Diagnosing syntactic heuristics in natural language inference. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3428–3448, Florence, Italy. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Tom Mitchell, William Cohen, Estevam Hruschka, Partha Talukdar, Bo Yang, Justin Betteridge, Andrew Carlson, B Dalvi, Matt Gardner, Bryan Kisiel, et al. 2018. Never-ending learning. Communications of the ACM, 61(5):103–115.
    Google ScholarLocate open access versionFindings
  • Nasrin Mostafazadeh, Nathanael Chambers, Xiaodong He, Devi Parikh, Dhruv Batra, Lucy Vanderwende, Pushmeet Kohli, and James Allen. 2016. A corpus and evaluation framework for deeper understanding of commonsense stories. arXiv preprint arXiv:1604.01696.
    Findings
  • Aakanksha Naik, Abhilasha Ravichander, Norman Sadeh, Carolyn Rose, and Graham Neubig. 2018. Stress test evaluation for natural language inference. In Proceedings of the 27th International Conference on Computational Linguistics, pages 2340–2353, Santa Fe, New Mexico, USA. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Yixin Nie and Mohit Bansal. 2017. Shortcutstacked sentence encoders for multi-domain inference. arXiv preprint arXiv:1708.02312.
    Findings
  • Yixin Nie, Haonan Chen, and Mohit Bansal. 2019. Combining fact extraction and verification with neural semantic matching networks. In Association for the Advancement of Artificial Intelligence (AAAI).
    Google ScholarLocate open access versionFindings
  • Denis Paperno, German Kruszewski, Angeliki Lazaridou, Quan Ngoc Pham, Raffaella Bernardi, Sandro Pezzelle, Marco Baroni, Gemma Boleda, and Raquel Fernandez. 2016. The lambada dataset: Word prediction requiring a broad discourse context. arXiv preprint arXiv:1606.06031.
    Findings
  • Adam Poliak, Jason Naradowsky, Aparajita Haldar, Rachel Rudinger, and Benjamin Van Durme. 2018. Hypothesis only baselines in natural language inference. In Proceedings of the Seventh Joint Conference on Lexical and Computational Semantics.
    Google ScholarLocate open access versionFindings
  • Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. Squad: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250.
    Findings
  • Benjamin Recht, Rebecca Roelofs, Ludwig Schmidt, and Vaishaal Shankar. 2018. Do cifar-10 classifiers generalize to cifar-10? arXiv preprint arXiv:1806.00451.
    Findings
  • Andrew Ruef, Michael Hicks, James Parker, Dave Levin, Michelle L Mazurek, and Piotr Mardziel. 2016. Build it, break it, fix it: Contesting secure development. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, pages 690–703. ACM.
    Google ScholarLocate open access versionFindings
  • Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. 2015. Imagenet large scale visual recognition challenge. International journal of computer vision, 115(3):211–252.
    Google ScholarLocate open access versionFindings
  • Chinnadhurai Sankar, Sandeep Subramanian, Christopher Pal, Sarath Chandar, and Yoshua Bengio. 2019. Do neural dialog systems use the conversation history effectively? an empirical study. arXiv preprint arXiv:1906.01603.
    Findings
  • James Thorne, Andreas Vlachos, Christos Christodoulopoulos, and Arpit Mittal. 2018. Fever: a large-scale dataset for fact extraction and verification. arXiv preprint arXiv:1803.05355.
    Findings
  • Masatoshi Tsuchiya. 2018. Performance impact caused by hidden bias of training data for recognizing textual entailment. In Proceedings of LREC.
    Google ScholarLocate open access versionFindings
  • Eric Wallace, Pedro Rodriguez, Shi Feng, Ikuya Yamada, and Jordan Boyd-Graber. 2019. Trick me if you can: Human-in-the-loop generation of adversarial question answering examples. In Transactions of the Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Li Wan, Matthew Zeiler, Sixin Zhang, Yann Le Cun, and Rob Fergus. 2013. Regularization of neural networks using dropconnect. In International conference on machine learning, pages 1058–1066.
    Google ScholarLocate open access versionFindings
  • Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R Bowman. 2019. Superglue: A stickier benchmark for general-purpose language understanding systems. arXiv preprint arXiv:1905.00537.
    Findings
  • Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R Bowman. 2018. Glue: A multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461.
    Findings
  • John Wieting and Douwe Kiela. 2019. No training required: Exploring random encoders for sentence classification. arXiv preprint arXiv:1901.10444.
    Findings
  • Adina Williams, Nikita Nangia, and Samuel R Bowman. 2017. A broad-coverage challenge corpus for sentence understanding through inference. arXiv preprint arXiv:1704.05426.
    Findings
  • Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, and Quoc V Le. 2019. Xlnet: Generalized autoregressive pretraining for language understanding. arXiv preprint arXiv:1906.08237.
    Findings
  • Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W Cohen, Ruslan Salakhutdinov, and Christopher D Manning. 2018. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. arXiv preprint arXiv:1809.09600.
    Findings
  • Zhilin Yang, Saizheng Zhang, Jack Urbanek, Will Feng, Alexander H Miller, Arthur Szlam, Douwe Kiela, and Jason Weston. 2017. Mastering the dungeon: Grounded language learning by mechanical turker descent. arXiv preprint arXiv:1711.07950.
    Findings
  • Rowan Zellers, Yonatan Bisk, Roy Schwartz, and Yejin Choi. 2018. Swag: A large-scale adversarial dataset for grounded commonsense inference. In Proceedings of EMNLP.
    Google ScholarLocate open access versionFindings
  • Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. 2019. Hellaswag: Can a machine really finish your sentence? In Proceedings of ACL.
    Google ScholarLocate open access versionFindings
  • Recently, several hard test sets have been made available for revealing the biases NLI models learn from their training datasets (Nie and Bansal, 2017; McCoy et al., 2019; Gururangan et al., 2018; Naik et al., 2018). We examine model performance on two of these: the SNLI-Hard (Gururangan et al., 2018) test set, which consists of examples that hypothesis-only models label incorrectly, and the NLI stress tests (Naik et al., 2018), in which sentences containing antonyms pairs, negations, high word overlap, i.a., are heuristically constructed. We test our models on these stress tests after tuning on each test’s respective development set to account for potential domain mismatches. For comparison, we also report results from the original papers: for SNLI-Hard from Gururangan et al.’s implementation of the hierarchical tensor-based Densely Interactive Inference Network (Gong et al., 2018, DIIN) on MNLI, and for the NLI stress tests, Naik et al.’s implementation of InferSent (Conneau et al., 2017) trained on SNLI.
    Google ScholarLocate open access versionFindings
Full Text
Your rating :
0

 

Tags
Comments