AI helps you reading Science

AI generates interpretation videos

AI extracts and analyses the key points of the paper to generate videos automatically


pub
Go Generating

AI Traceability

AI parses the academic lineage of this thesis


Master Reading Tree
Generate MRT

AI Insight

AI extracts a summary of this paper


Weibo:
We investigated explainable question answering, revealing that existing models lack an explicit coupling of answers and explanations and that evaluation scores used in related work fail to quantify that

F1 is Not Enough! Models and Evaluation Towards User Centered Explainable Question Answering

EMNLP 2020, pp.7076-7095, (2020)

Cited by: 0|Views592
Full Text
Bibtex
Weibo

Abstract

Explainable question answering systems predict an answer together with an explanation showing why the answer has been selected. The goal is to enable users to assess the correctness of the system and understand its reasoning process. However, we show that current models and evaluation settings have shortcomings regarding the coupling of a...More

Code:

Data:

0
Introduction
  • Understanding the decisions of deep learning models is of utmost importance, especially when they are deployed in critical domains, such as medicine or finance (Ribeiro et al, 2016).
  • A good explanation should satisfy the following requirements: (i) It should contain all information that the model used to predict the answer for the question
  • This is necessary so that the user can reconstruct the model’s reasoning process.
  • Previous work on XQA mostly focuses on developing models that predict the correct answer and, independent of this, the correct explanation (Yang et al, 2018; Qi et al, 2019; Shao et al, 2020)
  • This can lead to model outputs in which the explanations do not sufficiently relate to the answers.
  • To strengthen the coupling of answer and explanation prediction in the model architecture and during training, the authors propose two novel approaches in this paper: (i) a hierarchical neural network architecture for XQA that ensures that only information included in the explanation is used to predict the answer to the question, and (ii) a regularization term for the loss function that explicitly couples answer and explanation prediction during training
Highlights
  • Understanding the decisions of deep learning models is of utmost importance, especially when they are deployed in critical domains, such as medicine or finance (Ribeiro et al, 2016)
  • To strengthen the coupling of answer and explanation prediction in the model architecture and during training, we propose two novel approaches in this paper: (i) a hierarchical neural network architecture for XQA that ensures that only information included in the explanation is used to predict the answer to the question, and (ii) a regularization term for the loss function that explicitly couples answer and explanation prediction during training
  • To quantify the model’s answer-explanation coupling, we propose two novel evaluation scores: (i) Fact-Removal Score (FARM) which tracks prediction changes when removing facts, and (ii) Location Score (LOCA) which assesses whether the answer is contained in the explanation or not
  • We investigated explainable question answering, revealing that existing models lack an explicit coupling of answers and explanations and that evaluation scores used in related work fail to quantify that
  • This highly impairs their applicability in real-life scenarios with human users. We addressed both modeling and evaluation, proposing a hierarchical neural architecture, a regularization term, as well as two new evaluation scores
  • Our user study showed that our models help the users assess their correctness and that our proposed evaluation scores are better correlated with user experience than standard measures like F1
Methods
  • The authors built upon the model by Qi et al (2019) as it is an improved version of the BiDaf++ model, which is used in numerous state-of-the-art XQA models (Yang et al, 2018; Qi et al, 2019; Nishida et al, 2019; Ye et al, 2019; Qiu et al, 2019) including the best-scoring publication (Shao et al, 2020)
  • It consists of a question and context encoding part with self-attention, followed by two prediction heads: a prediction of relevant facts and a prediction of the answer to the question.
  • The predicted answer does not occur in the explanation, leaving the user uninformed about where it came from
Results
  • The authors describe the dataset the authors used in the experiments as well as the results.
  • The false positive ratio deserves particular attention as a false positive corresponds to a user thinking the model answer is correct while it is not.
  • Such an error can be dangerous in safety-critical domains.
Conclusion
  • The authors investigated explainable question answering, revealing that existing models lack an explicit coupling of answers and explanations and that evaluation scores used in related work fail to quantify that.
  • The authors' user study showed that the models help the users assess their correctness and that the proposed evaluation scores are better correlated with user experience than standard measures like F1
Tables
  • Table1: Comparison of our methods to <a class="ref-link" id="cQi_et+al_2019_a" href="#rQi_et+al_2019_a">Qi et al (2019</a>) regarding evaluation scores from related work and our proposed scores on the distractor dev set (SP: supporting facts). All values in %
  • Table2: The table shows whether sorting the conditions by a human score (rows) and an automatized score (columns) results in the same order (+), the inverse order (-) or a different order (blank cell). Green ( ) cells with circles mark desirable relations, red ( ) cells without circles mark undesirable relations
  • Table3: Comparison of the HOTPOTQA baseline model by <a class="ref-link" id="cYang_et+al_2018_a" href="#rYang_et+al_2018_a">Yang et al (2018</a>) and the modified model by <a class="ref-link" id="cQi_et+al_2019_a" href="#rQi_et+al_2019_a">Qi et al (2019</a>). The modified model outperforms the baseline on all scores. All values in %
  • Table4: Questions and statements shown to the participants for (a) each question (upper part) and (b) in the post questionnaire (lower part). Statements were presented along with the prompt “Please rate how much you disagree/agree to each of the following statements”
  • Table5: Pearson correlations between human and automatized scores
Download tables as Excel
Related work
Funding
  • Ngoc Thang Vu is funded by Carl Zeiss Foundation
Study subjects and analysis
participants: 40
6.4 Participants and Data Cleaning. We collect the ratings of 40 participants (16 female, 24 male) with a mean age of 26.6 years (SD “ 3.4). We filter out all responses with a completion time smaller than 15 seconds or larger than 5 minutes as this indicates that the participant did not read the whole explanation or was interrupted during the study

cases: 4
hkkkkkikkkkkj hk ik jp1 ́ paqp pec2 ` p1 ́ peqc3q (1) loooooooooooooooooooomoooooooooooooooooooon wrong answer with pa corresponding to the probability of the model for the correct answer span and pe denoting the probability of the model for the ground truth relevant facts. The term can be broken down into four cases: (i) correct answer and ground truth explanation, (ii) correct answer but non-ground truth explanation, (iii) incorrect answer but ground truth explanation and (iv) incorrect answer and nonground truth explanation. Each case corresponds to a constant cost of 0, c1, c2 and c3, respectively, with c1, c2, c3 being hyperparameters

Wikipedia articles: 10
The HOTPOTQA data set is a multi-hop opendomain explainable question answering data set containing 113k questions with crowd-sourced annotations. Each instance of the training data contains a question, a context consisting of the first paragraph of ten Wikipedia articles, the annotated answer and an explanation in the form of a selection of relevant sentences from the context. As HOTPOTQA was designed as a multi-hop data set, 3In HOTPOTQA, answers can stem from article titles although titles are never used as relevant facts

subjects: 3
Moreover, the study provides another way to compare our proposed methods to the model by Qi et al (2019) and the ground truth explanations. In contrast to the human evaluation from Chen et al (2019), we evaluate explanations in the context of the model answer, ask participants to rate the predictions along multiple dimensions and collect responses from 40 instead of 3 subjects. 6.1 Choice of Models

Reference
  • Ashraf Abdul, Jo Vermeulen, Danding Wang, Brian Y. Lim, and Mohan Kankanhalli. 2018. Trends and trajectories for explainable, accountable and intelligible systems: An hci research agenda. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems, page 1–18, New York, NY, USA. Association for Computing Machinery.
    Google ScholarLocate open access versionFindings
  • Amina Adadi and Mohammed Berrada. 2018. Peeking inside the black-box: A survey on explainable artificial intelligence (XAI). IEEE Access, 6:52138– 52160.
    Google ScholarLocate open access versionFindings
  • Akari Asai, Kazuma Hashimoto, Hannaneh Hajishirzi, Richard Socher, and Caiming Xiong. 2020. Learning to retrieve reasoning paths over wikipedia graph for question answering. In 8th International Conference on Learning Representations, Addis Ababa, Ethiopia.
    Google ScholarLocate open access versionFindings
  • Or Biran and Kathleen R. McKeown. 2017. Humancentric justification of machine learning predictions. In Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, pages 1461–1467, Melbourne, Australia. International Joint Conferences on Artificial Intelligence.
    Google ScholarLocate open access versionFindings
  • Adrian Bussone, Simone Stumpf, and Dympna O’Sullivan. 201The role of explanations on trust and reliance in clinical decision support systems. In 2015 International Conference on Healthcare Informatics, pages 160–169, Dallas, TX, USA. IEEE Computer Society.
    Google ScholarLocate open access versionFindings
  • Chris Callison-Burch, Miles Osborne, and Philipp Koehn. 200Re-evaluating the role of Bleu in machine translation research. In 11th Conference of the European Chapter of the Association for Computational Linguistics, Trento, Italy. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Oana-Maria Camburu, Tim Rocktaschel, Thomas Lukasiewicz, and Phil Blunsom. 2018. e-snli: Natural language inference with natural language explanations. In Annual Conference on Neural Information Processing Systems 2018, pages 9560–9572, Montreal, Canada. Advances in Neural Information Processing Systems.
    Google ScholarLocate open access versionFindings
  • Jifan Chen, Shih-Ting Lin, and Greg Durrett. 2019. Multi-hop question answering via reasoning chains. Computing Research Repository, arXiv:1910.02610.
    Findings
  • Davide Chicco and Giuseppe Jurman. 2020. The advantages of the matthews correlation coefficient (mcc) over f1 score and accuracy in binary classification evaluation. BMC genomics, 21(1):6.
    Google ScholarLocate open access versionFindings
  • Christopher Clark and Matt Gardner. 2018. Simple and effective multi-paragraph reading comprehension. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 845–855, Melbourne, Australia. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Jeremie Clos, Nirmalie Wiratunga, and Stewart Massie. 2017. Towards explainable text classification by jointly learning lexicon and modifier terms. In IJCAI-17 Workshop on Explainable AI (XAI), page 19.
    Google ScholarLocate open access versionFindings
  • Henriette S. M. Cramer, Vanessa Evers, Satyan Ramlal, Maarten van Someren, Lloyd Rutledge, Natalia Stash, Lora Aroyo, and Bob J. Wielinga. 2008. The effects of transparency on trust in and acceptance of a content-based art recommender. User Model. User-Adapt. Interact., 18(5):455–496.
    Google ScholarLocate open access versionFindings
  • Bhuwan Dhingra, Manzil Zaheer, Vidhisha Balachandran, Graham Neubig, Ruslan Salakhutdinov, and William W. Cohen. 2020. Differentiable reasoning over a virtual knowledge base. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia.
    Google ScholarLocate open access versionFindings
  • Yuwei Fang, Siqi Sun, Zhe Gan, Rohit Pillai, Shuohang Wang, and Jingjing Liu. 2019. Hierarchical graph network for multi-hop question answering. Computing Research Repository, arXiv:1911.03631.
    Findings
  • Miriam Greis, Jakob Karolus, Hendrik Schuff, Paweł W. Wozniak, and Niels Henze. 2017a. Detecting uncertain input using physiological sensing and behavioral measurements. In Proceedings of the 16th International Conference on Mobile and Ubiquitous Multimedia, MUM ’17, page 299–304, New York, NY, USA. Association for Computing Machinery.
    Google ScholarLocate open access versionFindings
  • Miriam Greis, Hendrik Schuff, Marius Kleiner, Niels Henze, and Albrecht Schmidt. 2017b. Input controls for entering uncertain data: Probability distribution sliders. volume 1, New York, NY, USA. Association for Computing Machinery.
    Google ScholarFindings
  • David Hand and Peter Christen. 2018. A note on using the f-measure for evaluating record linkage algorithms. Stat. Comput., 28(3):539–547.
    Google ScholarLocate open access versionFindings
  • Jonathan L. Herlocker, Joseph A. Konstan, and John Riedl. 2000. Explaining collaborative filtering recommendations. In Proceedings of the 2000 ACM Conference on Computer Supported Cooperative Work, CSCW ’00, page 241–250, New York, NY, USA. Association for Computing Machinery.
    Google ScholarLocate open access versionFindings
  • Been Kim, Oluwasanmi Koyejo, and Rajiv Khanna. 2016. Examples are not enough, learn to criticize! Criticism for interpretability. In Annual Conference on Neural Information Processing Systems 2016, pages 2280–2288, Barcelona, Spain. Advances in Neural Information Processing Systems.
    Google ScholarLocate open access versionFindings
  • Todd Kulesza, Simone Stumpf, Margaret Burnett, and Irwin Kwan. 2012. Tell me more? The effects of mental model soundness on personalizing an intelligent agent. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, CHI ’12, page 1–10, New York, NY, USA. Association for Computing Machinery.
    Google ScholarLocate open access versionFindings
  • Isaac Lage, Emily Chen, Jeffrey He, Menaka Narayanan, Been Kim, Samuel J Gershman, and Finale Doshi-Velez. 2019. Human evaluation of models built for interpretability. In Proceedings of the AAAI Conference on Human Computation and Crowdsourcing, volume 7, pages 59–67.
    Google ScholarLocate open access versionFindings
  • Brian Y. Lim, Anind K. Dey, and Daniel Avrahami. 2009. Why and why not explanations improve the intelligibility of context-aware intelligent systems. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, CHI ’09, page 2119–2128, New York, NY, USA. Association for Computing Machinery.
    Google ScholarLocate open access versionFindings
  • Chia-Wei Liu, Ryan Lowe, Iulian Serban, Mike Noseworthy, Laurent Charlin, and Joelle Pineau. 2016. How NOT to evaluate your dialogue system: An empirical study of unsupervised evaluation metrics for dialogue response generation. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 2122–2132, Austin, Texas. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Scott M. Lundberg and Su-In Lee. 2017. A unified approach to interpreting model predictions. In Annual Conference on Neural Information Processing Systems 2017, pages 4765–4774, Long Beach, CA, USA. Advances in Neural Information Processing Systems.
    Google ScholarLocate open access versionFindings
  • Sewon Min, Victor Zhong, Luke Zettlemoyer, and Hannaneh Hajishirzi. 2019. Multi-hop reading comprehension through question decomposition and rescoring. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 6097–6109, Florence, Italy. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Kosuke Nishida, Kyosuke Nishida, Masaaki Nagata, Atsushi Otsuka, Itsumi Saito, Hisako Asano, and Junji Tomita. 2019. Answering while summarizing: Multi-task learning for multi-hop QA with evidence extraction. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2335–2345, Florence, Italy. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Mahsan Nourani, Samia Kabir, Sina Mohseni, and Eric D Ragan. 2019. The effects of meaningful and meaningless explanations on trust and perceived system accuracy in intelligent systems. In Proceedings of the AAAI Conference on Human Computation and Crowdsourcing, volume 7, pages 97–105.
    Google ScholarLocate open access versionFindings
  • Kishore Papineni, Salim Roukos, Todd Ward, and WeiJing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 311–318, Philadelphia, Pennsylvania, USA. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Ethan Perez, Patrick Lewis, Wen-tau Yih, Kyunghyun Cho, and Douwe Kiela. 2020. Unsupervised question decomposition for question answering. Computing Research Repository, arXiv:2002.09758.
    Findings
  • Peng Qi, Xiaowen Lin, Leo Mehr, Zijian Wang, and Christopher D. Manning. 2019. Answering complex open-domain questions through iterative query generation. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2590–2602, Hong Kong, China. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Peng Qian, Xipeng Qiu, and Xuanjing Huang. 2016. A new psychometric-inspired evaluation metric for Chinese word segmentation. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2185–2194, Berlin, Germany. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Lin Qiu, Yunxuan Xiao, Yanru Qu, Hao Zhou, Lei Li, Weinan Zhang, and Yong Yu. 2019. Dynamically fused graph network for multi-hop reasoning. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 6140–6150, Florence, Italy. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. 2016. “Why should I trust you?”: Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’16, page 1135–1144, New York, NY, USA. Association for Computing Machinery.
    Google ScholarLocate open access versionFindings
  • Min Joon Seo, Aniruddha Kembhavi, Ali Farhadi, and Hannaneh Hajishirzi. 2017. Bidirectional attention flow for machine comprehension. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France.
    Google ScholarLocate open access versionFindings
  • Nan Shao, Yiming Cui, Ting Liu, Shijin Wang, and Guoping Hu. 2020. Is graph structure necessary for multi-hop reasoning? Computing Research Repository, arXiv:2004.03096.
    Findings
  • Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. 2014. Deep inside convolutional networks: Visualising image classification models and saliency maps. In 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada.
    Google ScholarLocate open access versionFindings
  • Rashmi Sinha and Kirsten Swearingen. 2002. The role of transparency in recommender systems. In CHI ’02 Extended Abstracts on Human Factors in Computing Systems, CHI EA ’02, page 830–831, New York, NY, USA. Association for Computing Machinery.
    Google ScholarFindings
  • Marina Sokolova, Nathalie Japkowicz, and Stan Szpakowicz. 2006. Beyond accuracy, f-score and ROC: A family of discriminant measures for performance evaluation. In AI 2006: Advances in Artificial Intelligence, 19th Australian Joint Conference on Artificial Intelligence, volume 4304 of Lecture Notes in Computer Science, pages 1015–1021, Hobart, Australia. Springer.
    Google ScholarLocate open access versionFindings
  • Felix Stahlberg, Danielle Saunders, and Bill Byrne. 2018. An operation sequence model for explainable neural machine translation. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 175–186, Brussels, Belgium. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Mukund Sundararajan, Ankur Taly, and Qiqi Yan. 2017. Axiomatic attribution for deep networks. In Proceedings of the 34th International Conference on Machine Learning, ICML 2017, volume 70 of Proceedings of Machine Learning Research, pages 3319–3328, Sydney, NSW, Australia.
    Google ScholarLocate open access versionFindings
  • Ming Tu, Kevin Huang, Guangtao Wang, Jing Huang, Xiaodong He, and Bowen Zhou. 2019.
    Google ScholarFindings
  • Xin Wang, Wenhu Chen, Yuan-Fang Wang, and William Yang Wang. 2018. No metrics are perfect: Adversarial reward learning for visual storytelling. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 899–909, Melbourne, Australia. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. 2018. HotpotQA: A dataset for diverse, explainable multi-hop question answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2369–2380, Brussels, Belgium. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Deming Ye, Yankai Lin, Zhenghao Liu, Zhiyuan Liu, and Maosong Sun. 2019. Multi-paragraph reasoning with knowledge-enhanced graph neural network. Computing Research Repository, arXiv:1911.02170.
    Findings
  • Chen Zhao, Chenyan Xiong, Corby Rosset, Xia Song, Paul N. Bennett, and Saurabh Tiwary. 2020. Transformer-xh: Multi-evidence reasoning with extra hop attention. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia.
    Google ScholarLocate open access versionFindings
Author
Hendrik Schuff
Hendrik Schuff
Your rating :
0

 

Tags
Comments
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn
小科