ERASER: A Benchmark to Evaluate Rationalized NLP Models

DeYoung Jay
DeYoung Jay
Jain Sarthak
Jain Sarthak
Lehman Eric
Lehman Eric
Wallace Byron C.
Wallace Byron C.

ACL, pp. 4443-4458, 2020.

Cited by: 26|Bibtex|Views78
EI
Other Links: arxiv.org|dblp.uni-trier.de|academic.microsoft.com
Weibo:
We have introduced a new publicly available resource: the Evaluating Rationales And Simple English Reasoning benchmark

Abstract:

State-of-the-art models in NLP are now predominantly based on deep neural networks that are generally opaque in terms of how they come to specific predictions. This limitation has led to increased interest in designing more interpretable deep models for NLP that can reveal the `reasoning' underlying model outputs. But work in this direc...More
0
Introduction
  • Interest has recently grown in designing NLP systems that can reveal why models make specific predictions.
  • The authors aim to address this issue by releasing a standardized benchmark of datasets — repurposed and augmented from pre-existing corpora, spanning a range of NLP tasks — and associated metrics for measuring different properties of rationales.
  • The authors refer to this as the Evaluating Rationales And Simple English Reasoning (ERASER ) benchmark.
  • The acting is great! The soundtrack is run-of-the-mill, but the action more than makes up for it (a) Positive (b) Negative e-SNLI
Highlights
  • Interest has recently grown in designing NLP systems that can reveal why models make specific predictions
  • In Evaluating Rationales And Simple English Reasoning we focus on rationales, i.e., snippets that support outputs
  • How best to measure rationale faithfulness is an open question. In this first version of Evaluating Rationales And Simple English Reasoning we propose simple metrics motivated by prior work (Zaidan et al, 2007; Yu et al, 2019)
  • For models that with high sufficiency scores: Movies, FEVER, Commonsense Explanations, and eSNLI, we find that random removal is damaging to performance, indicating poor absolute ranking; whereas those with high comprehensiveness are sensitive to rationale length
  • We have introduced a new publicly available resource: the Evaluating Rationales And Simple English Reasoning (ERASER) benchmark
Results
  • The authors present initial results for the baseline models discussed in Section 5, with respect to the metrics proposed in Section 4.
  • In Table 3 the authors evaluate models that perform discrete selection of rationales
  • The authors view these as inherently faithful, because by construction the authors know which snippets the decoder used to make a prediction.10.
  • For these methods the authors report only metrics that measure agreement with human annotations.
  • In the view this highlights the need for models that can rationalize at varying levels of granularity, depending on what is appropriate
Conclusion
  • The authors have introduced a new publicly available resource: the Evaluating Rationales And Simple English Reasoning (ERASER) benchmark.
  • This comprises seven datasets, all of which include both instance level labels and corresponding supporting snippets (‘rationales’) marked by human annotators.
  • The authors believe these metrics provide reasonable means of comparison of specific aspects of interpretability, but the authors view the problem of measuring faithfulness, in particular, a topic ripe for additional research
Summary
  • Introduction:

    Interest has recently grown in designing NLP systems that can reveal why models make specific predictions.
  • The authors aim to address this issue by releasing a standardized benchmark of datasets — repurposed and augmented from pre-existing corpora, spanning a range of NLP tasks — and associated metrics for measuring different properties of rationales.
  • The authors refer to this as the Evaluating Rationales And Simple English Reasoning (ERASER ) benchmark.
  • The acting is great! The soundtrack is run-of-the-mill, but the action more than makes up for it (a) Positive (b) Negative e-SNLI
  • Results:

    The authors present initial results for the baseline models discussed in Section 5, with respect to the metrics proposed in Section 4.
  • In Table 3 the authors evaluate models that perform discrete selection of rationales
  • The authors view these as inherently faithful, because by construction the authors know which snippets the decoder used to make a prediction.10.
  • For these methods the authors report only metrics that measure agreement with human annotations.
  • In the view this highlights the need for models that can rationalize at varying levels of granularity, depending on what is appropriate
  • Conclusion:

    The authors have introduced a new publicly available resource: the Evaluating Rationales And Simple English Reasoning (ERASER) benchmark.
  • This comprises seven datasets, all of which include both instance level labels and corresponding supporting snippets (‘rationales’) marked by human annotators.
  • The authors believe these metrics provide reasonable means of comparison of specific aspects of interpretability, but the authors view the problem of measuring faithfulness, in particular, a topic ripe for additional research
Tables
  • Table1: Overview of datasets in the ERASER benchmark. Tokens is the average number of tokens in each
  • Table2: Human agreement with respect to rationales. For Movie Reviews and BoolQ we calculate the mean agreement of individual annotators with the majority vote per token, over the two-three annotators we hired via Upwork and Amazon Turk, respectively. The e-SNLI dataset already comprised three annotators; for this we calculate mean agreement between individuals and the majority. For CoS-E, MultiRC, and FEVER, members of our team annotated a subset to use a comparison to the (majority of, where appropriate) existing rationales. We collected comprehensive rationales for Evidence Inference from Medical Doctors; as they have a high amount of expertise, we would expect agreement to be high, but have not collected redundant comprehensive annotations
  • Table3: Performance of models that perform hard rationale selection. All models are supervised at the rationale level except for those marked with (u), which learn only from instance-level supervision; † denotes cases in which rationale training degenerated due to the REINFORCE style training. Perf. is accuracy (CoS-E) or macro-averaged F1 (others). Bert-To-Bert for CoS-E and e-SNLI uses a token classification objective. BertTo-Bert CoS-E uses the highest scoring answer
  • Table4: Metrics for ‘soft’ scoring models. Perf. is accuracy (CoS-E) or F1 (others). Comprehensiveness and sufficiency are in terms of AOPC (Eq 3). ‘Random’ assigns random scores to tokens to induce orderings; these are averages over 10 runs
  • Table5: Detailed breakdowns for each dataset - the number of documents, instances, evidence statements, and lengths. Additionally we include the percentage of each relevant document that is considered a rationale. For test sets, counts are for all instances including documents with non comprehensive rationales
  • Table6: General dataset statistics: number of labels, instances, unique documents, and average numbers of sentences and tokens in documents, across the publicly released train/validation/test splits in ERASER. For CoS-E and e-SNLI, the sentence counts are not meaningful as the partitioning of question/sentence/answer formatting is an arbitrary choice in this framework
Download tables as Excel
Related work
  • Interpretability in NLP is a large, fast-growing area; we do not attempt to provide a comprehensive overview here. Instead we focus on directions particularly relevant to ERASER, i.e., prior work on models that provide rationales for their predictions.

    Learning to explain. In ERASER we assume that rationales (marked by humans) are provided during training. However, such direct supervision will not always be available, motivating work on methods that can explain (or “rationalize”) model predictions using only instance-level supervision.

    In the context of modern neural models for text classification, one might use variants of attention (Bahdanau et al, 2015) to extract rationales. Attention mechanisms learn to assign soft weights to (usually contextualized) token representations, and so one can extract highly weighted tokens as rationales. However, attention weights do not in general provide faithful explanations for predictions (Jain and Wallace, 2019; Serrano and Smith, 2019; Wiegreffe and Pinter, 2019; Zhong et al, 2019; Pruthi et al, 2020; Brunner et al, 2020; Moradi et al, 2019; Vashishth et al, 2019). This likely owes to encoders entangling inputs, complicating the interpretation of attention weights on inputs over contextualized representations of the same.2
Funding
  • This work was supported in part by the NSF (CA- REER award 1750978), and by the Army Research Office (W911NF1810328)
Reference
  • David Alvarez-Melis and Tommi Jaakkola. 2017. A causal framework for explaining the predictions of black-box sequence-to-sequence models. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 412– 421.
    Google ScholarLocate open access versionFindings
  • Leila Arras, Franziska Horn, Gregoire Montavon, Klaus-Robert Muller, and Wojciech Samek. 2017. ”what is relevant in a text document?”: An interpretable machine learning approach. In PloS one.
    Google ScholarFindings
  • Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural machine translation by jointly learning to align and translate. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings.
    Google ScholarLocate open access versionFindings
  • Joost Bastings, Wilker Aziz, and Ivan Titov. 2019. Interpretable neural predictions with differentiable binary variables. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2963–2977, Florence, Italy. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Iz Beltagy, Kyle Lo, and Arman Cohan. 2019. Scibert: Pretrained language model for scientific text. In EMNLP.
    Google ScholarFindings
  • Samuel R. Bowman, Gabor Angeli, Christopher Potts, and Christopher D. Manning. 2015. A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Gino Brunner, Yang Liu, Damian Pascual, Oliver Richter, Massimiliano Ciaramita, and Roger Wattenhofer. 2020. On identifiability in transformers. In International Conference on Learning Representations.
    Google ScholarLocate open access versionFindings
  • Oana-Maria Camburu, Tim Rocktaschel, Thomas Lukasiewicz, and Phil Blunsom. 201e-snli: Natural language inference with natural language explanations. In Advances in Neural Information Processing Systems, pages 9539–9549.
    Google ScholarLocate open access versionFindings
  • Shiyu Chang, Yang Zhang, Mo Yu, and Tommi Jaakkola. 201A game theoretic approach to classwise selective rationalization. In Advances in Neural Information Processing Systems, pages 10055– 10065.
    Google ScholarLocate open access versionFindings
  • Sihao Chen, Daniel Khashabi, Wenpeng Yin, Chris Callison-Burch, and Dan Roth. 2019. Seeing things from a different angle: Discovering diverse perspectives about claims. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), pages 542– 557, Minneapolis, Minnesota.
    Google ScholarLocate open access versionFindings
  • Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using RNN encoder–decoder for statistical machine translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1724– 1734, Doha, Qatar. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. 2019. Boolq: Exploring the surprising difficulty of natural yes/no questions. In NAACL.
    Google ScholarFindings
  • Jacob Cohen. 1960. A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20(1):37–46.
    Google ScholarLocate open access versionFindings
  • Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Yanzhuo Ding, Yang Liu, Huanbo Luan, and Maosong Sun. 2017. Visualizing and understanding neural machine translation. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vancouver, Canada. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Finale Doshi-Velez and Been Kim. 2017. Towards a rigorous science of interpretable machine learning. arXiv preprint arXiv:1702.08608.
    Findings
  • Mark Everingham, Luc Van Gool, Christopher K. I. Williams, John Winn, and Andrew Zisserman. 2010. The pascal visual object classes (voc) challenge. International Journal of Computer Vision, 88(2):303– 338.
    Google ScholarLocate open access versionFindings
  • Shi Feng, Eric Wallace, Alvin Grissom, Mohit Iyyer, Pedro Rodriguez, and Jordan L. Boyd-Graber. 20Pathologies of neural models make interpretation difficult. In EMNLP.
    Google ScholarFindings
  • Matt Gardner, Joel Grus, Mark Neumann, Oyvind Tafjord, Pradeep Dasigi, Nelson F. Liu, Matthew Peters, Michael Schmitz, and Luke Zettlemoyer. 2018. AllenNLP: A deep semantic natural language processing platform. In Proceedings of Workshop for NLP Open Source Software (NLP-OSS), pages 1– 6, Melbourne, Australia. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Sepp Hochreiter and Jurgen Schmidhuber. 1997. Long short-term memory. Neural computation, 9(8):1735–1780.
    Google ScholarLocate open access versionFindings
  • Sara Hooker, Dumitru Erhan, Pieter-Jan Kindermans, and Been Kim. 2019. A benchmark for interpretability methods in deep neural networks. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alche-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems 32, pages 9737– 9748. Curran Associates, Inc.
    Google ScholarLocate open access versionFindings
  • Alon Jacovi and Yoav Goldberg. 2020. Towards faithfully interpretable nlp systems: How should we define and evaluate faithfulness? arXiv preprint arXiv:2004.03685.
    Findings
  • Sarthak Jain and Byron C. Wallace. 2019. Attention is not Explanation. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 3543–3556, Minneapolis, Minnesota. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Sarthak Jain, Sarah Wiegreffe, Yuval Pinter, and Byron C. Wallace. 2020. Learning to Faithfully Rationalize by Construction. In Proceedings of the Conference of the Association for Computational Linguistics (ACL).
    Google ScholarLocate open access versionFindings
  • Daniel Khashabi, Snigdha Chaturvedi, Michael Roth, Shyam Upadhyay, and Dan Roth. 2018. Looking Beyond the Surface: A Challenge Set for Reading Comprehension over Multiple Sentences. In Proc. of the Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL).
    Google ScholarLocate open access versionFindings
  • Diederik Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. International Conference on Learning Representations.
    Google ScholarFindings
  • Eric Lehman, Jay DeYoung, Regina Barzilay, and Byron C Wallace. 2019. Inferring which medical treatments work from reports of clinical trials. In Proceedings of the North American Chapter of the Association for Computational Linguistics (NAACL), pages 3705–3717.
    Google ScholarLocate open access versionFindings
  • Tao Lei, Regina Barzilay, and Tommi Jaakkola. 2016. Rationalizing neural predictions. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 107–117.
    Google ScholarLocate open access versionFindings
  • Jiwei Li, Xinlei Chen, Eduard Hovy, and Dan Jurafsky. 2016. Visualizing and understanding neural models in NLP. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 681–691, San Diego, California. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Zachary C Lipton. 2016. The mythos of model interpretability. arXiv preprint arXiv:1606.03490.
    Findings
  • Scott M Lundberg and Su-In Lee. 2017. A unified approach to interpreting model predictions. In Advances in Neural Information Processing Systems, pages 4765–4774.
    Google ScholarLocate open access versionFindings
  • Tyler McDonnell, Mucahid Kutlu, Tamer Elsayed, and Matthew Lease. 2017. The many benefits of annotator rationales for relevance judgments. In IJCAI, pages 4909–4913.
    Google ScholarLocate open access versionFindings
  • Tyler McDonnell, Matthew Lease, Mucahid Kutlu, and Tamer Elsayed. 2016. Why is that relevant? collecting annotator rationales for relevance judgments. In Fourth AAAI Conference on Human Computation and Crowdsourcing.
    Google ScholarLocate open access versionFindings
  • Gregoire Montavon, Sebastian Lapuschkin, Alexander Binder, Wojciech Samek, and Klaus-Robert Muller. 2017. Explaining nonlinear classification decisions with deep taylor decomposition. Pattern Recognition, 65:211–222.
    Google ScholarLocate open access versionFindings
  • Pooya Moradi, Nishant Kambhatla, and Anoop Sarkar. 2019. Interrogating the explanatory power of attention in neural machine translation. In Proceedings of the 3rd Workshop on Neural Generation and Translation, pages 221–230, Hong Kong. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Mark Neumann, Daniel King, Iz Beltagy, and Waleed Ammar. 2019. Scispacy: Fast and robust models for biomedical natural language processing. CoRR, abs/1902.07669.
    Findings
  • Dong Nguyen. 2018. Comparing automatic and human evaluation of local explanations for text classification. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 1069–1078.
    Google ScholarLocate open access versionFindings
  • Bo Pang and Lillian Lee. 2004. A sentimental education: Sentiment analysis using subjectivity summarization based on minimum cuts. In Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL-04), pages 271– 278, Barcelona, Spain.
    Google ScholarLocate open access versionFindings
  • Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. 2019. Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems, pages 8024–8035.
    Google ScholarLocate open access versionFindings
  • David J Pearce. 2005. An improved algorithm for finding the strongly connected components of a directed graph. Technical report, Victoria University, NZ.
    Google ScholarFindings
  • Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1532–1543, Doha, Qatar. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Danish Pruthi, Mansi Gupta, Bhuwan Dhingra, Graham Neubig, and Zachary C. Lipton. 2020. Learning to deceive with attention-based explanations. In Annual Conference of the Association for Computational Linguistics (ACL).
    Google ScholarLocate open access versionFindings
  • Sampo Pyysalo, F Ginter, Hans Moen, T Salakoski, and Sophia Ananiadou. 2013. Distributional semantics resources for biomedical text processing. Proceedings of Languages in Biology and Medicine.
    Google ScholarLocate open access versionFindings
  • Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Improving language understanding by generative pre-training.
    Google ScholarFindings
  • Nazneen Fatema Rajani, Bryan McCann, Caiming Xiong, and Richard Socher. 2019. Explain yourself! leveraging language models for commonsense reasoning. Proceedings of the Association for Computational Linguistics (ACL).
    Google ScholarLocate open access versionFindings
  • Marco Ribeiro, Sameer Singh, and Carlos Guestrin. 2016. why should i trust you?: Explaining the predictions of any classifier. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations, pages 97–101.
    Google ScholarLocate open access versionFindings
  • Wojciech Samek, Alexander Binder, Gregoire Montavon, Sebastian Lapuschkin, and Klaus-Robert Muller. 2016. Evaluating the visualization of what a deep neural network has learned. IEEE transactions on neural networks and learning systems, 28(11):2660–2673.
    Google ScholarLocate open access versionFindings
  • Tal Schuster, Darsh J Shah, Yun Jie Serene Yeo, Daniel Filizzola, Enrico Santus, and Regina Barzilay. 2019. Towards debiasing fact verification models. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Sofia Serrano and Noah A. Smith. 2019. Is attention interpretable? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2931–2951, Florence, Italy. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Burr Settles. 2012. Active learning. Synthesis Lectures on Artificial Intelligence and Machine Learning, 6(1):1–114.
    Google ScholarLocate open access versionFindings
  • Manali Sharma, Di Zhuang, and Mustafa Bilgic. 2015. Active learning with rationales for text classification. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 441–451.
    Google ScholarLocate open access versionFindings
  • Kevin Small, Byron C Wallace, Carla E Brodley, and Thomas A Trikalinos. 2011. The constrained weight space svm: learning with ranked features. In Proceedings of the International Conference on International Conference on Machine Learning (ICML), pages 865–872.
    Google ScholarLocate open access versionFindings
  • D. Smilkov, N. Thorat, B. Kim, F. Viegas, and M. Wattenberg. 2017. SmoothGrad: removing noise by adding noise. ICML workshop on visualization for deep learning.
    Google ScholarFindings
  • Robyn Speer. 2019. ftfy. Zenodo. Version 5.5.
    Google ScholarFindings
  • Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15:1929–1958.
    Google ScholarLocate open access versionFindings
  • Julia Strout, Ye Zhang, and Raymond Mooney. 2019. Do human rationales improve machine explanations? In Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 56–62, Florence, Italy. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Mukund Sundararajan, Ankur Taly, and Qiqi Yan. 2017. Axiomatic attribution for deep networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 3319–3328. JMLR. org.
    Google ScholarLocate open access versionFindings
  • Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. 2019. CommonsenseQA: A question answering challenge targeting commonsense knowledge. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4149–4158, Minneapolis, Minnesota. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • James Thorne, Andreas Vlachos, Christos Christodoulopoulos, and Arpit Mittal. 2018. FEVER: a Large-scale Dataset for Fact Extraction and VERification. In Proceedings of the North American Chapter of the Association for Computational Linguistics (NAACL), pages 809–819.
    Google ScholarLocate open access versionFindings
  • Shikhar Vashishth, Shyam Upadhyay, Gaurav Singh Tomar, and Manaal Faruqui. 2019. Attention interpretability across nlp tasks. arXiv preprint arXiv:1909.11218.
    Findings
  • Byron C Wallace, Kevin Small, Carla E Brodley, and Thomas A Trikalinos. 2010. Active learning for biomedical citation screening. In Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 173– 182. ACM.
    Google ScholarLocate open access versionFindings
  • Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. 2019a. Superglue: A stickier benchmark for general-purpose language understanding systems. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alche-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems 32, pages 3266–3280. Curran Associates, Inc.
    Google ScholarLocate open access versionFindings
  • Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. 2019b. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In International Conference on Learning Representations.
    Google ScholarLocate open access versionFindings
  • Sarah Wiegreffe and Yuval Pinter. 2019. Attention is not not explanation. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLPIJCNLP), pages 11–20, Hong Kong, China. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Ronald J Williams. 1992. Simple statistical gradientfollowing algorithms for connectionist reinforcement learning. Machine learning, 8(3-4):229–256.
    Google ScholarLocate open access versionFindings
  • Ronald J Williams and David Zipser. 1989. A learning algorithm for continually running fully recurrent neural networks. Neural computation, 1(2):270– 280.
    Google ScholarLocate open access versionFindings
  • Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, R’emi Louf, Morgan Funtowicz, and Jamie Brew. 2019. Huggingface’s transformers: State-of-the-art natural language processing. ArXiv, abs/1910.03771.
    Findings
  • Mo Yu, Shiyu Chang, Yang Zhang, and Tommi Jaakkola. 2019. Rethinking cooperative rationalization: Introspective extraction and complement control. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 4094–4103, Hong Kong, China. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Omar Zaidan, Jason Eisner, and Christine Piatko. 2007. Using annotator rationales to improve machine learning for text categorization. In Proceedings of the conference of the North American chapter of the Association for Computational Linguistics (NAACL), pages 260–267.
    Google ScholarLocate open access versionFindings
  • Omar F Zaidan and Jason Eisner. 2008. Modeling annotators: A generative approach to learning from annotator rationales. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 31–40.
    Google ScholarLocate open access versionFindings
  • Ye Zhang, Iain Marshall, and Byron C Wallace. 2016. Rationale-augmented convolutional neural networks for text classification. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), volume 2016, page 795. NIH Public Access.
    Google ScholarLocate open access versionFindings
  • Ruiqi Zhong, Steven Shao, and Kathleen McKeown. 2019. Fine-grained sentiment analysis with faithful attention. arXiv preprint arXiv:1908.06870.
    Findings
Full Text
Your rating :
0

 

Tags
Comments