AI帮你理解科学

AI 生成解读视频

AI抽取解析论文重点内容自动生成视频


pub
生成解读视频

AI 溯源

AI解析本论文相关学术脉络


Master Reading Tree
生成 溯源树

AI 精读

AI抽取本论文的概要总结


微博一下
We present a general self-debiasing framework to address the impact of unknown dataset biases by omitting the need for thorough task-specific analysis to discover the types of biases for each new dataset

Towards Debiasing NLU Models from Unknown Biases

EMNLP 2020, pp.7597-7610, (2020)

被引用3|浏览159
下载 PDF 全文
引用
微博一下

摘要

NLU models often exploit biases to achieve high dataset-specific performance without properly learning the intended task. Recently proposed debiasing methods are shown to be effective in mitigating this tendency. However, these methods rely on a major assumption that the types of bias should be known a-priori, which limits their applicati...更多

代码

数据

0
简介
重点内容
  • Proposed debiasing methods are effective in mitigating this tendency and the resulting models are shown to perform better beyond training distribution with improved performance on challenge test sets that are designed such that relying on the spurious association leads to incorrect predictions
  • They first identify biased examples in the training data, and down-weight their importance in the training loss so that models focus on learning from harder examples.2. These model agnostic methods rely on an assumption that the types of the biased feature are known apriori. This assumption, is a limitation in various natural language understanding tasks (NLU) tasks or datasets because it depends on researchers’ intuition and task-specific insights to manually characterize the spurious biases, which may range from simple word/n-grams co-occurrence (Gururangan et al, 2018; Poliak et al, 2018; Tsuchiya, 2018; Schuster et al, 2019) to more complex stylistic and lexico-syntactic patterns (Zellers et al, 2019; Snow et al, 2006; Vanderwende and Dolan, 2006)
  • In very small data settings, models exhibit a distinctive response to synthetically biased examples, where they rapidly increase the accuracy (→ 100%) on biased test set while performing poorly on other sets, indicating that they are mainly relying on biases
  • We evaluate models trained on Multi-Genre Natural Language Inference (MNLI) through the three debiasing setups: known-bias to target HANS-specific bias, self-debiasing, and self-debiasing augmented with the proposed annealing mechanism
  • We present a general self-debiasing framework to address the impact of unknown dataset biases by omitting the need for thorough task-specific analysis to discover the types of biases for each new dataset
方法
  • Empirical Methods in Natural Language

    Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3417–3423, Hong Kong, China.
  • Empirical Methods in Natural Language.
  • Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3417–3423, Hong Kong, China.
  • Roy Schwartz, Maarten Sap, Ioannis Konstas, Leila Zilles, Yejin Choi, and Noah A.
  • The effect of different writing tasks on linguistic style: A case study of the ROC story cloze task.
  • In Proceedings of the 21st Conference on Computational Natural Language Learning (CoNLL 2017), pages 15–25, Vancouver, Canada.
  • Association for Computational Linguistics
结果
  • The existing debiasing methods can even be more effective when applied using the proposed framework, e.g., self-debias example reweighting obtains 52.3 F1 score improvement over the baseline on the non-duplicate subset of PAWS
  • This indicates that the framework is effective in identifying biased examples without previously needed prior knowledge; (2) Most improvements on the challenge datasets come at the expense of the in-distribution performance except for the confidence regularization models.
  • This indicates that self-debiasing may identify more potentially biased examples and effectively omit more training data; (3) Annealing mechanism is effective in mitigating this issue in most cases, e.g., improving PoE by 0.5pp on FEVER dev and 1.2pp on MNLI
结论
  • The authors present a general self-debiasing framework to address the impact of unknown dataset biases by omitting the need for thorough task-specific analysis to discover the types of biases for each new dataset.
  • The authors adopt the existing debiasing methods into the framework and enable them to obtain high improvement on several challenge test sets without targeting a specific bias.
  • The evaluation suggests that the framework results in better overall robustness compared to the biasspecific counterparts.
  • Future work in the direction of automatic bias mitigation may include identifying potentially biased examples in an online fashion and discouraging models to exploit them throughout the training
表格
  • Table1: Models’ performance when evaluated on MNLI, Fever, QQP, and their corresponding challenge test sets. The known-bias results for MNLI and FEVER are reported in <a class="ref-link" id="cUtama_et+al_2020_a" href="#rUtama_et+al_2020_a">Utama et al (2020</a>)(♭), <a class="ref-link" id="cClark_et+al_2019_a" href="#rClark_et+al_2019_a">Clark et al (2019</a>)(‡), <a class="ref-link" id="cMahabadi_2019_a" href="#rMahabadi_2019_a">Mahabadi and Henderson (2019</a>)(†), and <a class="ref-link" id="cSchuster_et+al_2019_a" href="#rSchuster_et+al_2019_a">Schuster et al (2019</a>)(♣). Results of the proposed framework are indicated by self-debias. (♠) indicates the training with our proposed annealing mechanism. Boldface numbers indicate the highest challenge test set improvement for each debiasing setup on a particular task
  • Table2: Accuracy results of self-debias confidence regularization on cross-dataset evaluation
  • Table3: Final accuracy of models trained on synthetic bias datasets
  • Table4: Models’ performance on HANS challenge test set (<a class="ref-link" id="cMccoy_et+al_2019_a" href="#rMccoy_et+al_2019_a">McCoy et al, 2019</a>). Column lex., con., and sub. stand for lexical overlap, constituency, and subsequence, respectively. The (¬) symbol indicates the non-entailment subset
Download tables as Excel
相关工作
  • The artifacts of large scale dataset collections result in dataset biases that allow models to perform well without learning the intended reasoning skills. In NLI, models can perform better than chance by only using the partial input (Gururangan et al, 2018; Poliak et al, 2018; Tsuchiya, 2018), or by basing their predictions on whether the inputs are highly overlapped (McCoy et al, 2019; Dasgupta et al, 2018). Similar phenomena exist in various tasks including argumentation mining (Niven and Kao, 2019), reading comprehension (Kaushik and Lipton, 2018), or story cloze completion (Schwartz et al, 2017). To allow a better evaluation of models’ reasoning capabilities, researchers constructed challenge test sets com- ˜ ¡¢£¤¥¦

    self-debiased loss

    0 2000 4000 6000 8000 10000 0 2000 4000 6000 8000 10000 train step posed of “counterexamples” to the spurious shortcuts that models may adopt (Jia and Liang, 2017; Glockner et al, 2018; Zhang et al, 2019). Models evaluated on these sets often fall back to random baseline performance.

    There has been a flurry of work in dynamic dataset construction to systematically reduce dataset biases through adversarial filtering (Zellers et al, 2018; Sakaguchi et al, 2019; Bras et al, 2020) or human in the loop (Nie et al, 2019b; Kaushik et al, 2020). While promising, researchers also show that newly constructed datasets may not be fully free of hidden biased patterns (Sharma et al, 2018). It is thus crucial to complement the data collection efforts with learning algorithms that are more robust to biases, such as the recently proposed product-of-expert (Clark et al, 2019; He et al, 2019; Mahabadi and Henderson, 2019), or confidence regularization (Utama et al, 2020). Despite their effectiveness, these methods are limited by their assumption on the availability of an information about the task-specific biases. Our framework aims to alleviate this limitation and enable them to address unknown biases.
基金
  • Proposed debiasing methods are effective in mitigating this tendency and the resulting models are shown to perform better beyond training distribution with improved performance on challenge test sets that are designed such that relying on the spurious association leads to incorrect predictions
研究对象与分析
NLI datasets: 4
We do not tune the hyperparameters for each target dataset and use the models that we previously reported in the main results. As the target datasets, we use 4 NLI datasets: Scitail (Khot et al, 2018), Sentences Involving Compositional Knowledge (SICK) (Marelli et al, 2014), GLUE diagnostic test set (Wang et al, 2018), and 3-way version of RTE 1, 2, and 3 (Dagan et al, 2005; Bar-Haim et al, 2006; Giampiccolo et al, 2007).7. We present the results in Table 2

引用论文
  • Roy Bar-Haim, Ido Dagan, Bill Dolan, Lisa Ferro, and Danilo Giampiccolo. 2006. The second pascal recognising textual entailment challenge. Proceedings of the Second PASCAL Challenges Workshop on Recognising Textual Entailment.
    Google ScholarLocate open access versionFindings
  • Yonatan Belinkov, Adam Poliak, Stuart Shieber, Benjamin Van Durme, and Alexander Rush. 2019. Don’t take the premise for granted: Mitigating artifacts in natural language inference. In Proceedings of the 57th Conference of the Association for Computational Linguistics, pages 877–891, Florence, Italy. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Ronan Le Bras, Swabha Swayamdipta, Chandra Bhagavatula, Rowan Zellers, Matthew E. Peters, Ashish Sabharwal, and Yejin Choi. 2020. Adversarial filters of dataset biases. ArXiv, abs/2002.04108.
    Findings
  • Christopher Clark, Mark Yatskar, and Luke Zettlemoyer. 2019. Don’t take the easy way out: Ensemble based methods for avoiding known dataset biases. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 4067–4080, Hong Kong, China. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Ido Dagan, Oren Glickman, and Bernardo Magnini. 200The pascal recognising textual entailment challenge. In Proceedings of the First International Conference on Machine Learning Challenges: Evaluating Predictive Uncertainty Visual Object Classification, and Recognizing Textual Entailment, MLCW05, page 177190, Berlin, Heidelberg. Springer-Verlag.
    Google ScholarLocate open access versionFindings
  • Ishita Dasgupta, Demi Guo, Andreas Stuhlmuller, Samuel J Gershman, and Noah D. Goodman. 2018. Evaluating compositionality in sentence embeddings. ArXiv, abs/1802.04302.
    Findings
  • Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Tommaso Furlanello, Zachary Chase Lipton, Michael Tschannen, Laurent Itti, and Anima Anandkumar. 201Born-again neural networks. In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmassan, Stockholm, Sweden, July 10-15, 2018, volume 80 of Proceedings of Machine Learning Research, pages 1602–1611. PMLR.
    Google ScholarLocate open access versionFindings
  • Danilo Giampiccolo, Bernardo Magnini, Ido Dagan, and Bill Dolan. 2007. The third PASCAL recognizing textual entailment challenge. In Proceedings of the ACL-PASCAL Workshop on Textual Entailment and Paraphrasing, pages 1–9, Prague. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Max Glockner, Vered Shwartz, and Yoav Goldberg. 2018. Breaking NLI systems with sentences that require simple lexical inferences. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 650–655, Melbourne, Australia. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Suchin Gururangan, Swabha Swayamdipta, Omer Levy, Roy Schwartz, Samuel Bowman, and Noah A. Smith. 2018. Annotation artifacts in natural language inference data. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pages 107–112, New Orleans, Louisiana. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • He He, Sheng Zha, and Haohan Wang. 2019. Unlearn dataset bias in natural language inference by fitting the residual. ArXiv, abs/1908.10763.
    Findings
  • Geoffrey E. Hinton, Oriol Vinyals, and Jeffrey Dean. 2015. Distilling the knowledge in a neural network. ArXiv, abs/1503.02531.
    Findings
  • Robin Jia and Percy Liang. 2017. Adversarial examples for evaluating reading comprehension systems. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 2021–2031, Copenhagen, Denmark. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Divyansh Kaushik, Eduard Hovy, and Zachary Lipton. 2020. Learning the difference that makes a difference with counterfactually-augmented data. In 8th International Conference on Learning Representations, ICLR 2020, Virtual Conference, 26 April - 1 May, 2019. OpenReview.net.
    Google ScholarFindings
  • Divyansh Kaushik and Zachary C. Lipton. 2018. How much reading does reading comprehension require? a critical investigation of popular benchmarks. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 5010–5015, Brussels, Belgium. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Tushar Khot, Ashish Sabharwal, and Peter Clark. 2018. Scitail: A textual entailment dataset from science question answering. In Proceedings of the ThirtySecond AAAI Conference on Artificial Intelligence, (AAAI-18), the 30th innovative Applications of Artificial Intelligence (IAAI-18), and the 8th AAAI Symposium on Educational Advances in Artificial Intelligence (EAAI-18), New Orleans, Louisiana, USA, February 2-7, 2018, pages 5189–5197. AAAI Press.
    Google ScholarLocate open access versionFindings
  • Rabeeh Karimi Mahabadi and James Henderson. 2019. simple but effective techniques to reduce biases. arXiv preprint arXiv:1909.06321.
    Findings
  • Marco Marelli, Stefano Menini, Marco Baroni, Luisa Bentivogli, Raffaella Bernardi, and Roberto Zamparelli. 2014. A SICK cure for the evaluation of compositional distributional semantic models. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), pages 216–223, Reykjavik, Iceland. European Language Resources Association (ELRA).
    Google ScholarLocate open access versionFindings
  • Tom McCoy, Ellie Pavlick, and Tal Linzen. 2019. Right for the wrong reasons: Diagnosing syntactic heuristics in natural language inference. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3428–3448, Florence, Italy. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Yixin Nie, Yicheng Wang, and Mohit Bansal. 2019a. Analyzing compositionality-sensitivity of nli models. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 6867–6874.
    Google ScholarLocate open access versionFindings
  • Yixin Nie, Adina Williams, Emily Dinan, Mohit Bansal, Jason Weston, and Douwe Kiela. 2019b. Adversarial nli: A new benchmark for natural language understanding. ArXiv, abs/1910.14599.
    Findings
  • Timothy Niven and Hung-Yu Kao. 2019. Probing neural network comprehension of natural language arguments. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4658–4664, Florence, Italy. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Adam Poliak, Jason Naradowsky, Aparajita Haldar, Rachel Rudinger, and Benjamin Van Durme. 2018. Hypothesis only baselines in natural language inference. In Proceedings of the Seventh Joint Conference on Lexical and Computational Semantics, pages 180–191, New Orleans, Louisiana. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. 2019. Winogrande: An adversarial winograd schema challenge at scale. ArXiv, abs/1907.10641.
    Findings
  • Tal Schuster, Darsh Shah, Yun Jie Serene Yeo, Daniel Roberto Filizzola Ortiz, Enrico Santus, and Regina Barzilay. 2019. Towards debiasing fact verification models. In Proceedings of the 2019 Conference on
    Google ScholarLocate open access versionFindings
  • Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3417–3423, Hong Kong, China. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Roy Schwartz, Maarten Sap, Ioannis Konstas, Leila Zilles, Yejin Choi, and Noah A. Smith. 2017. The effect of different writing tasks on linguistic style: A case study of the ROC story cloze task. In Proceedings of the 21st Conference on Computational Natural Language Learning (CoNLL 2017), pages 15–25, Vancouver, Canada. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Rishi Sharma, James Allen, Omid Bakhshandeh, and Nasrin Mostafazadeh. 2018. Tackling the story ending biases in the story cloze test. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 752–757, Melbourne, Australia. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Rion Snow, Lucy Vanderwende, and Arul Menezes. 2006. Effectively using syntax for recognizing false entailment. In Proceedings of the Human Language Technology Conference of the NAACL, Main Conference, pages 33–40, New York City, USA. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Damien Teney, Kushal Kafle, Robik Shrestha, Ehsan Abbasnejad, Christopher Kanan, and Anton van den Hengel. 2020. On the value of out-of-distribution testing: An example of goodhart’s law. ArXiv, abs/2005.09241.
    Findings
  • James Thorne, Andreas Vlachos, Oana Cocarascu, Christos Christodoulopoulos, and Arpit Mittal. 2018. The fact extraction and VERification (FEVER) shared task. In Proceedings of the First Workshop on Fact Extraction and VERification (FEVER), pages 1–9, Brussels, Belgium. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Masatoshi Tsuchiya. 2018. Performance impact caused by hidden bias of training data for recognizing textual entailment. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan. European Language Resources Association (ELRA).
    Google ScholarLocate open access versionFindings
  • Prasetya Ajie Utama, Nafise Sadat Moosavi, and Iryna Gurevych. 2020. Mind the trade-off: Debiasing nlu models without degrading the in-distribution performance. arXiv preprint arXiv:2005.00315.
    Findings
  • Lucy Vanderwende and William B. Dolan. 2006. What syntax can contribute in the entailment task. In Machine Learning Challenges. Evaluating Predictive Uncertainty, Visual Object Classification, and Recognising Tectual Entailment, pages 205–216, Berlin, Heidelberg. Springer Berlin Heidelberg.
    Google ScholarFindings
  • Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. 2018. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 353–355, Brussels, Belgium. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Adina Williams, Nikita Nangia, and Samuel Bowman. 2018. A broad-coverage challenge corpus for sentence understanding through inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 1112–1122, New Orleans, Louisiana. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Remi Louf, Morgan Funtowicz, et al. 2019. Transformers: State-of-theart natural language processing. arXiv preprint arXiv:1910.03771.
    Findings
  • Rowan Zellers, Yonatan Bisk, Roy Schwartz, and Yejin Choi. 2018. SWAG: A large-scale adversarial dataset for grounded commonsense inference. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 93– 104, Brussels, Belgium. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. 2019. HellaSwag: Can a machine really finish your sentence? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4791– 4800, Florence, Italy. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Yuan Zhang, Jason Baldridge, and Luheng He. 2019. PAWS: Paraphrase adversaries from word scrambling. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 1298–1308, Minneapolis, Minnesota. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Main model We finetune the BERT base model for all settings (baseline, known-bias, and selfdebiasing) using default parameters: 3 epochs of training with learning rate 5−5. An exception is made for product-of-expert and confidence regularization, where we follow He et al. (2019) to run the training longer, i.e. 5 epochs.
    Google ScholarLocate open access versionFindings
  • HANS dataset (McCoy et al., 2019) consist of three subsets, covering different inference phenomena which happen to have lexical overlap: (a) Lexical overlap e.g., “The doctor was paid by the actor” vs. “The doctor paid the actor”; (b) Subsequence, e.g., “The doctor near the actor danced” vs. “The actor danced”; and (c) Constituent e.g., “If the artist slept, the actor ran” vs. “The artist slept”. Each subset contains examples of both entailment and non-entailment. The 3-way predictions on MNLI is mapped to HANS by taking max pool between neutral and contradiction labels. We present the results of our experiments in Table 4.
    Google ScholarLocate open access versionFindings
  • Main model We follow Schuster et al. (2019) in finetuning the BERT base model on FEVER dataset using the following parameters: learning rate 2−5 and 3 epochs of training.
    Google ScholarLocate open access versionFindings
  • Main model We follow Utama et al. (2020) in setting the parameters for training a QQP model: learning rate 2−5 and 3 epochs of training.
    Google ScholarLocate open access versionFindings
作者
Prasetya Ajie Utama
Prasetya Ajie Utama
Nafise Sadat Moosavi
Nafise Sadat Moosavi
您的评分 :
0

 

标签
评论
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn
小科