More Bang for Your Buck: Natural Perturbation for Robust Question Answering

empirical methods in natural language processing, pp. 163-170, 2020.

Other Links: academic.microsoft.com
Weibo:
Our results demonstrate that models trained on perturbations of BOOLQ questions are more robust to minor variations and generalize better, while preserving performance on the original BOOLQ test set as long as the natural perturbations are moderately cheap to create

Abstract:

Deep learning models for linguistic tasks require large training datasets, which are expensive to create. As an alternative to the traditional approach of creating new instances by repeating the process of creating one instance, we propose doing so by first collecting a set of seed examples and then applying human-driven natural perturbat...More

Code:

Data:

0
Introduction
  • While many datasets (Bowman et al, 2015; Rajpurkar et al, 2016) targeting different linguistic tasks have been proposed, most are created by repeating a fixed process used for writing a single example.
  • This approach results in many independent examples, each generated from scratch.
  • The Guinness Book of World Records no longer lists Yonge Street as the longest street in the world and has not chosen a replacement street, but cites the Pan-American Highway as the world's longest`motorable road''
Highlights
  • Creating large datasets to train NLP models has become increasingly expensive
  • Often substantially cheaper training set construction method where, after collecting a few seed examples, the set is expanded by applying humanauthored minimal perturbations to the seeds
  • We proposed an alternative approach for constructing training sets, by expanding seed examples via natural perturbations
  • Our results demonstrate that models trained on perturbations of BOOLQ questions are more robust to minor variations and generalize better, while preserving performance on the original BOOLQ test set as long as the natural perturbations are moderately cheap to create
  • While this is not a dataset paper, we provide the natural perturbations resource for BOOLQ constructed during the course of this study
Methods
  • To assess the impact of the perturbation approach, the authors evaluate standard RoBERTa-large model that has been shown to achieve state-of-the-art results on many tasks.
  • Each experiment considers the effect of training on subsamples of BOOLQ obtained under different conditions.
  • The authors evaluate the QA model trained on various question sets on three test sets.
  • (ii) For assessing generalization, the authors use the subset of 260 training questions from MULTIRC (Khashabi et al, 2018) that have binary answers, from training section of the their data.2 (c) The original BOOLQ test set, to ensure models trained on perturbed questions retain performance on the original task.
Results
  • Even when the ratio is only moderately low, models trained on the perturbed datasets exhibit desirable advantages: They are 9% more robust to minor changes and generalize 4.5% better across datasets than models trained on BOOLQ.
Conclusion
  • A key question with respect to the premise of this work is whether the idea would generalize to other tasks.
  • Creating perturbed examples is often cheaper than creating new ones and the authors empirically observed notable gains even at a moderate cost ratio of 0.6.
  • While this is not a dataset paper, the authors provide the natural perturbations resource for BOOLQ constructed during the course of this study.4.
  • While the authors leave a detailed study to future work, the authors expect general trends regarding the value of perturbations to hold broadly
Summary
  • Introduction:

    While many datasets (Bowman et al, 2015; Rajpurkar et al, 2016) targeting different linguistic tasks have been proposed, most are created by repeating a fixed process used for writing a single example.
  • This approach results in many independent examples, each generated from scratch.
  • The Guinness Book of World Records no longer lists Yonge Street as the longest street in the world and has not chosen a replacement street, but cites the Pan-American Highway as the world's longest`motorable road''
  • Methods:

    To assess the impact of the perturbation approach, the authors evaluate standard RoBERTa-large model that has been shown to achieve state-of-the-art results on many tasks.
  • Each experiment considers the effect of training on subsamples of BOOLQ obtained under different conditions.
  • The authors evaluate the QA model trained on various question sets on three test sets.
  • (ii) For assessing generalization, the authors use the subset of 260 training questions from MULTIRC (Khashabi et al, 2018) that have binary answers, from training section of the their data.2 (c) The original BOOLQ test set, to ensure models trained on perturbed questions retain performance on the original task.
  • Results:

    Even when the ratio is only moderately low, models trained on the perturbed datasets exhibit desirable advantages: They are 9% more robust to minor changes and generalize 4.5% better across datasets than models trained on BOOLQ.
  • Conclusion:

    A key question with respect to the premise of this work is whether the idea would generalize to other tasks.
  • Creating perturbed examples is often cheaper than creating new ones and the authors empirically observed notable gains even at a moderate cost ratio of 0.6.
  • While this is not a dataset paper, the authors provide the natural perturbations resource for BOOLQ constructed during the course of this study.4.
  • While the authors leave a detailed study to future work, the authors expect general trends regarding the value of perturbations to hold broadly
Tables
  • Table1: Statistics of BOOLQ
  • Table2: Various systems trained and evaluated on different datasets. Best non-human scores are in bold. Numbers in percentage
Download tables as Excel
Related work
  • Data augmentation. There is a handful of work that studies semi-automatic contextual augmentation (Kobayashi, 2018; Cheng et al, 2018), often with the goal of creating better systems. We, however, study natural human-authored perturbations as an alternative dataset construction method. A related recent work is by Kaushik et al (2020), who, unlike the goal here, study the value of naturalperturbations in reducing artifacts.

    Adversarial perturbations. A closely relevant line of work is adversarial perturbations to expose the weaknesses of systems upon local changes and criticize their lack robustness (Ebrahimi et al, 2018; Glockner et al, 2018; Dinan et al, 2019). For instance, Khashabi et al (2016) showed significant drops upon perturbing answer-options for multiplechoice question-answering. Such rule-based perturbations have simple definitions leading to them being easily reverse-engineered by models (Jia and Liang, 2017) and generally use label-preserving, shallow perturbations (Hu et al, 2019). In contrast, our natural human-authored perturbations are harder for models.1 More broadly, adversarial perturbations research seeks examples that stumble existing models, while our focus is on expanding datasets in a cost-efficient way.
Funding
  • Even when the ratio is only moderately low (at 0.6), models trained on our perturbed datasets exhibit desirable advantages: They are 9% more robust to minor changes and generalize 4.5% better across datasets than models trained on BOOLQ
  • Given the same total budget b = 1500, we can thus infer from Fig. 3 that training on a dataset of perturbed questions would be about 10% and 5% more effective on BOOLQ-e and MULTIRC, respectively
Study subjects and analysis
data: 3
For each case, we vary the max cluster size in the following rage: [1, 2, 3, 4]. As a result, in (i), C vary from 3.7k to 951 (N = 3.7k), and in (ii), N vary from 1k to 4k (C = 1k).

workers: 3
These annotations served to eliminate ambiguous questions as well as those that cannot be answered from the provided paragraph. The annotation was done in two steps: (i) in the first step, we ask 3 workers to answer each question with one of the three options (“yes“, “no“ and “cannot be inferred from the paragraph”). We filtered out the subset of the questions that were not agreed upon (i.e., not a consistent majority label) or were marked as “cannot be inferred from the paragraph” by majority of the annotators

data: 3
For each case, we vary the max cluster size in the following rage: [1, 2, 3, 4]. As a result, in (i), C vary from 3.7k to 951 (N = 3.7k), and in (ii), N vary from 1k to 4k (C = 1k). 2The yes/no subset of dev was too small. 3In practice, we expect r to lie somewhere in-between these two extremes, such as r = 0.3 as discussed in §4.2

datasets: 3
Fig. 2 shows the accuracy of models trained on these subsets across our three evaluation sets. In scenario (i) with a fixed number of instances (r = 1), it is evident that the size of the clusters (the number of perturbations) does not affect the model quality (on 2 out of 3 datasets). This shows that perturbation clusters are equally informative as (traditional) independent instances

Reference
  • S. Bowman, G. Angeli, C. Potts, and C. Manning. 2015. A large annotated corpus for learning natural language inference. In Procedings of EMNLP.
    Google ScholarLocate open access versionFindings
  • Y. Cheng, Z. Tu, F. Meng, J. Zhai, and Y. Liu. 2018. Towards robust neural machine translation. In Proceedings of ACL.
    Google ScholarLocate open access versionFindings
  • C. Clark, K. Lee, M-W. Chang, T. Kwiatkowski, M. Collins, and K. Toutanova. 2019. BoolQ: Exploring the surprising difficulty of natural yes/no questions. In Proceedings of NAACL.
    Google ScholarLocate open access versionFindings
  • E. Dinan, S. Humeau, B. Chintagunta, and J. Weston. 2019. Build it break it fix it for dialogue safety: Robustness from adversarial human attack. In Proceedings of EMNLP-IJCNLP, pages 4529–4538.
    Google ScholarLocate open access versionFindings
  • J. Ebrahimi, D. Lowd, and D. Dou. 2018. On adversarial examples for character-level neural machine translation. In Proceedings of COLING.
    Google ScholarLocate open access versionFindings
  • M. Gardner et al. 2020. Evaluating models’ local decision boundaries via contrast sets. In Proceedings of EMNLP.
    Google ScholarLocate open access versionFindings
  • M. Geva, Y. Goldberg, and J. Berant. 2019. Are we modeling the task or the annotator? An investigation of annotator bias in natural language understanding datasets. In Proceedings of EMNLP-IJCNLP.
    Google ScholarLocate open access versionFindings
  • T. Kwiatkowski, J. Palomaki, O. Redfield, M. Collins, A. Parikh, C. Alberti, D. Epstein, I. Polosukhin, J. Devlin, K. Lee, et al. 2019. Natural questions: a benchmark for question answering research. TACL, 7:453–466.
    Google ScholarLocate open access versionFindings
  • H. J. Levesque, E. Davis, and L. Morgenstern. 2011. The Winograd schema challenge. In AAAI Spring Symposium: Logical Formalizations of Commonsense Reasoning.
    Google ScholarFindings
  • H. Peng, D. Khashabi, and D. Roth. 2015. Solving hard coreference problems. In Proceedings of NAACL.
    Google ScholarLocate open access versionFindings
  • P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang. 2016. Squad: 100, 000+ questions for machine comprehension of text. In Proceedings of EMNLP.
    Google ScholarLocate open access versionFindings
  • K. Sakaguchi, R. L. Bras, C. Bhagavatula, and Y. Choi. 2020. WINOGRANDE: An adversarial Winograd schema challenge at scale. In Proceedings of AAAI.
    Google ScholarLocate open access versionFindings
  • M. Shah, X. Chen, M. Rohrbach, and D. Parikh. 2019. Cycle-consistency for robust visual question answering. In Proceedings of CVPR.
    Google ScholarLocate open access versionFindings
  • A. Wang, Y. Pruksachatkun, N. Nangia, A. Singh, J. Michael, F. Hill, O. Levy, and S. Bowman. 2019. Superglue: A stickier benchmark for generalpurpose language understanding systems. In Proceedings of NourIPS.
    Google ScholarLocate open access versionFindings
  • M. Glockner, V. Shwartz, and Y. Goldberg. 2018. Breaking nli systems with sentences that require simple lexical inferences. In Proceedings of ACL.
    Google ScholarLocate open access versionFindings
  • J. Edward Hu, Huda Khayrallah, Ryan Culkin, Patrick Xia, Tongfei Chen, Matt Post, and Benjamin Van Durme. 2019. Improved lexically constrained decoding for translation and monolingual rewriting. In Proceedings of NAACL-HLT.
    Google ScholarLocate open access versionFindings
  • R. Jia and P. Liang. 20Adversarial examples for evaluating reading comprehension systems. In Proceedings of EMNLP.
    Google ScholarLocate open access versionFindings
  • D. Kaushik, E. Hovy, and Z. C. Lipton. 2020. Learning the difference that makes a difference with counterfactually-augmented data. In Proceedings of ICLR.
    Google ScholarLocate open access versionFindings
  • D. Khashabi, S. Chaturvedi, M. Roth, S. Upadhyay, and D. Roth. 2018. Looking beyond the surface: A challenge set for reading comprehension over multiple sentences. In Proceedings of NAACL.
    Google ScholarLocate open access versionFindings
  • D. Khashabi, T. Khot, A. Sabharwal, P. Clark, O. Etzioni, and D. Roth. 2016. Question answering via integer programming over semi-structured knowledge. In Proceedings of IJCAI.
    Google ScholarLocate open access versionFindings
  • S. Kobayashi. 2018. Contextual augmentation: Data augmentation by words with paradigmatic relations. In Proceedings of NAACL.
    Google ScholarLocate open access versionFindings
  • Here is an screen cast of the relevant annotation interface interface: https://youtu.be/
    Findings
Full Text
Your rating :
0

 

Tags
Comments