Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering

EMNLP, pp. 2381-2391, 2018.

Cited by: 65|Bibtex|Views45
EI
Other Links: dblp.uni-trier.de|academic.microsoft.com|arxiv.org
Weibo:
We present a new dataset, OpenBookQA, of about 6000 questions for open book question answering

Abstract:

We present a new kind of question answering dataset, OpenBookQA, modeled after open book exams for assessing human understanding of a subject. The open book that comes with our questions is a set of 1329 elementary level science facts. Roughly 6000 questions probe an understanding of these facts and their application to novel situations. ...More

Code:

Data:

0
Introduction
  • Open book exams are a common mechanism for assessing human understanding of a subject, where test takers are allowed free access to a relevant book, study guide, or class notes when answering questions
  • In this context, the goal is not to evaluate memorization but a deeper understanding of the material and its application to new situations (Jenkins, 1995; Landsberger, 1996).
Highlights
  • Open book exams are a common mechanism for assessing human understanding of a subject, where test takers are allowed free access to a relevant book, study guide, or class notes when answering questions
  • Building upon a recent neural model for incorporating external knowledge in the story cloze setting (Mihaylov and Frank, 2018), we propose a knowledge-aware neural baseline that can utilize both the open book F and common knowledge retrieved from sources such as ConceptNet (Speer et al, 2017)
  • We present a new dataset, OpenBookQA, of about 6000 questions for open book question answering
  • The task focuses on the challenge of combining a corpus of provided science facts with external broad common knowledge
  • We show that this dataset requires simple common knowledge beyond the provided core facts, as well as multihop reasoning combining the two
  • H (Q) − 3% with probability over 98.8% and at least H (Q) − 2.5% with prob 95.6%; we report the former as our conservative estimate on human performance
  • While simple neural methods are able to achieve an accuracy of about 50%, this is still far from the human performance of 92% on this task
Results
  • Motivated by recent findings of gameability of NLP datasets (Gururangan et al, 2018), the authors develop and evaluate simple, attention-based, neural baselines including a plausible answer detector and an odd-one-out solver.
  • The question author) of the additional knowledge k needed for q, provides valuable insight into the nature of this dataset: Facts from the open book F are valuable (5% improvement) but not sufficient
  • Using both f and k increases the accuracy to 76%, but is still far from human level performance, suggesting the need for non-trivial reasoning to combine these facts.
  • While simple neural methods are able to achieve an accuracy of about 50%, this is still far from the human performance of 92% on this task
Conclusion
  • The authors present a new dataset, OpenBookQA, of about 6000 questions for open book question answering.
  • The task focuses on the challenge of combining a corpus of provided science facts with external broad common knowledge
  • The authors show that this dataset requires simple common knowledge beyond the provided core facts, as well as multihop reasoning combining the two.
  • While simple neural methods are able to achieve an accuracy of about 50%, this is still far from the human performance of 92% on this task.
  • The authors leave closing this gap for future research, and illustrate, via oraclestyle experiments, the potential of better retrieval and reasoning on this task
Summary
  • Introduction:

    Open book exams are a common mechanism for assessing human understanding of a subject, where test takers are allowed free access to a relevant book, study guide, or class notes when answering questions
  • In this context, the goal is not to evaluate memorization but a deeper understanding of the material and its application to new situations (Jenkins, 1995; Landsberger, 1996).
  • Results:

    Motivated by recent findings of gameability of NLP datasets (Gururangan et al, 2018), the authors develop and evaluate simple, attention-based, neural baselines including a plausible answer detector and an odd-one-out solver.
  • The question author) of the additional knowledge k needed for q, provides valuable insight into the nature of this dataset: Facts from the open book F are valuable (5% improvement) but not sufficient
  • Using both f and k increases the accuracy to 76%, but is still far from human level performance, suggesting the need for non-trivial reasoning to combine these facts.
  • While simple neural methods are able to achieve an accuracy of about 50%, this is still far from the human performance of 92% on this task
  • Conclusion:

    The authors present a new dataset, OpenBookQA, of about 6000 questions for open book question answering.
  • The task focuses on the challenge of combining a corpus of provided science facts with external broad common knowledge
  • The authors show that this dataset requires simple common knowledge beyond the provided core facts, as well as multihop reasoning combining the two.
  • While simple neural methods are able to achieve an accuracy of about 50%, this is still far from the human performance of 92% on this task.
  • The authors leave closing this gap for future research, and illustrate, via oraclestyle experiments, the potential of better retrieval and reasoning on this task
Tables
  • Table1: Statistics for full OpenBookQA dataset. Parenthetical numbers next to each average are the max
  • Table2: Percentage of questions and facts for the five most common type of additional facts. Note that % Questions does not add up to 100% since we count the percentage of questions where at least one such fact is needed
  • Table3: Example training questions (with their correct choices marked) along with the facts and reasoning needed. In the last example, the science fact states that lhs=“source of light becomes closer” implies rhs=“source will appear brighter”. Grounding this rule based on the common-knowledge fact, produces a new rule: “As headlights of the car come closer, headlights will appear brighter”
  • Table4: Scores obtained by various solvers on OpenBookQA, reported as a percentage ± the standard deviation across 5 runs with different random seeds. Other baselines are described in the corresponding referenced section. For oracle evaluation, we use the gold science fact f associated with each question, and optionally the additional fact k provided by the question author. Bold denotes the best Test score in each category
Download tables as Excel
Related work
  • By construction, answering OpenBookQA questions requires (i) some base science facts from a provided ‘open book’, (ii) broader understanding about the world (common or commonsense knowledge), and (iii) an ability to combine these facts (reasoning). This setup differs from several existing QA tasks, as summarized below.

    Reading Comprehension (RC) datasets have been proposed as benchmarks to evaluate the ability of systems to understand a document by answering factoid-style questions over this document. These datasets have taken various forms: multiple-choice (Richardson et al, 2013), clozestyle (Hermann et al, 2015; Onishi et al, 2016; Hill et al, 2016), and span prediction (Rajpurkar et al, 2016; Trischler et al, 2017; Joshi et al, 2017) However, analysis (Chen et al, 2016; Sugawara et al, 2017) of these datasets has shown that many of the questions can be solved with context token matching (Chen et al, 2017a; Weissenborn et al, 2017) or relatively simple paraphrasing.

    To focus on the more challenging problem of reasoning across sentences, new datasets have been proposed for multi-step RC. QAngaroo (Welbl et al, 2018) have used a knowledgebase to identify entity pairs (s, o) with a known relation, r, which is also supported by a multihop path in a set of documents. They use structured tuple queries (s, r, ?) and use all the documents along the path as the input passage. NarrativeQA (Kociskyet al., 2017) is an RC dataset that has been shown to require an iterative reasoning about the narrative of a story. Similar to OpenBookQA, the questions were generated to ensure that the answer is not a direct match or paraphrase that can be retrieved with an IR approach. Most recently, Khashabi et al (2018) proposed MultiRC, a multiple-choice RC dataset that is designed to require multi-sentence reasoning and can have multiple correct answers. Again, like most RC datasets, it is self-contained.
Funding
  • Motivated by recent findings of gameability of NLP datasets (Gururangan et al, 2018), we also develop and evaluate simple, attention-based, neural baselines including a plausible answer detector (which ignores the question text completely) and an odd-one-out solver. These highlight inevitable human bias in any crowdsourced dataset, increasing performance on OpenBookQA to 48%
  • The question author) of the additional knowledge k needed for q, provides valuable insight into the nature of this dataset: Facts from the open book F are valuable (5% improvement) but not sufficient
  • the question author) of the additional knowledge k needed for q, provides valuable insight into the nature of this dataset: Facts from the open book F are valuable (5% improvement) but not sufficient. Using both f and k increases the accuracy to 76%, but is still far from human level performance, suggesting the need for non-trivial reasoning to combine these facts
  • If we had kept only those questions that all 5 workers answered correctly, it would clearly be inaccurate to claim that the human accuracy on Q is 100%
  • H (Q) − 3% with probability over 98.8% and at least H (Q) − 2.5% with prob 95.6%; we report the former as our conservative estimate on human performance
  • The third group of results suggests that adding F to pre-trained models has a mixed effect, improving TupleInference by 8.7% but not changing DGEM.12
  • The “plausible answer detector” can predict the correct answer with 49.6% accuracy without even looking at the question
  • For Question Match and ESIM we also experiment with ElMo (Peters et al, 2018) which improved their score on Test with 0.4% and 1.8%
  • When we also include facts retrieved from WordNet (Miller et al, 1990), the score improves by about 0.5%
  • While simple neural methods are able to achieve an accuracy of about 50%, this is still far from the human performance of 92% on this task
Study subjects and analysis
new crowdworkers: 5
6. Question qmc is then shown to 5 new crowdworkers, who are asked to answer it. 7

workers: 5
7. If at least 4 out of 5 workers answer qmc correctly, it is deemed answerable and the process continues. If not, qmc is discarded

independent samples: 5
7Choice ‘A’ was the correct answer in 69% of the questions at the end of Step 4. process is equivalent to obtaining 5 independent samples, Xq,i, i ∈ I, |I| = 5, from B(pq). We must, however, be careful when using this data to estimate pq, as the same 5 samples were used to decide whether q makes it into the question set Q or not

samples: 5
process is equivalent to obtaining 5 independent samples, Xq,i, i ∈ I, |I| = 5, from B(pq). We must, however, be careful when using this data to estimate pq, as the same 5 samples were used to decide whether q makes it into the question set Q or not. For instance, if we had kept only those questions that all 5 workers answered correctly, it would clearly be inaccurate to claim that the human accuracy on Q is 100%

workers: 5
We must, however, be careful when using this data to estimate pq, as the same 5 samples were used to decide whether q makes it into the question set Q or not. For instance, if we had kept only those questions that all 5 workers answered correctly, it would clearly be inaccurate to claim that the human accuracy on Q is 100%. Nevertheless, it is possible to re-use the judgments from Step 6 to approximate H(Q) with high confidence, without posing the questions to new workers

Reference
  • M. Banko, M. J. Cafarella, S. Soderland, M. Broadhead, and O. Etzioni. 2007. Open information extraction from the web. In IJCAI.
    Google ScholarLocate open access versionFindings
  • D. Chen, J. Bolton, and C. D. Manning. 2016. A thorough examination of the cnn/daily mail reading comprehension task. In ACL, pages 2358–2367.
    Google ScholarLocate open access versionFindings
  • D. Chen, A. Fisch, J. Weston, and A. Bordes. 2017a. Reading wikipedia to answer open-domain questions. In ACL.
    Google ScholarFindings
  • Q. Chen, X. Zhu, Z.-H. Ling, S. Wei, H. Jiang, and D. Inkpen. 2017b. Enhanced lstm for natural language inference. In ACL, pages 1657–1668.
    Google ScholarLocate open access versionFindings
  • P. Clark, I. Cowhey, O. Etzioni, T. Khot, A. Sabharwal, C. Schoenick, and O. Tafjord. 2018. Think you have solved question answering? Try ARC, the AI2 reasoning challenge. CoRR, abs/1803.05457.
    Findings
  • P. Clark, O. Etzioni, T. Khot, A. Sabharwal, O. Tafjord, P. D. Turney, and D. Khashabi. 201Combining retrieval, statistics, and inference to answer elementary science questions. In AAAI, pages 2580–2586.
    Google ScholarLocate open access versionFindings
  • A. Conneau, D. Kiela, H. Schwenk, L. Barrault, and A. Bordes. 201Supervised learning of universal sentence representations from natural language inference data. In EMNLP, pages 670–680.
    Google ScholarLocate open access versionFindings
  • M. Gardner, J. Grus, M. Neumann, O. Tafjord, P. Dasigi, N. F. Liu, M. Peters, M. Schmitz, and L. S. Zettlemoyer. 2017. AllenNLP: A deep semantic natural language processing platform. CoRR, abs/1803.07640.
    Findings
  • S. Gururangan, S. Swayamdipta, O. Levy, R. Schwartz, S. R. Bowman, and N. A. Smith. 2018. Annotation artifacts in natural language inference data. In NAACL.
    Google ScholarLocate open access versionFindings
  • K. M. Hermann, T. Kocisky, E. Grefenstette, L. Espeholt, W. Kay, M. Suleyman, and P. Blunsom. 2015. Teaching machines to read and comprehend. In NIPS, pages 1693–1701.
    Google ScholarLocate open access versionFindings
  • F. Hill, A. Bordes, S. Chopra, and J. Weston. 2016. The goldilocks principle: Reading children’s books with explicit memory representations. In ICLR.
    Google ScholarFindings
  • W. Hoeffding. 1963. Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association, 58(301):13–30.
    Google ScholarLocate open access versionFindings
  • P. Jansen, N. Balasubramanian, M. Surdeanu, and P. Clark. 2016. What’s in an explanation? characterizing knowledge and inference requirements for elementary science exams. In COLING.
    Google ScholarFindings
  • P. A. Jansen, E. Wainwright, S. Marmorstein, and C. T. Morrison. 2018. WorldTree: A corpus of explanation graphs for elementary science questions supporting multi-hop inference. In LREC.
    Google ScholarFindings
  • T. Jenkins. 1995. Open book assessment in computing degree programmes 1. Technical Report 95.28, University of Leeds.
    Google ScholarFindings
  • M. Joshi, E. Choi, D. Weld, and L. Zettlemoyer. 2017. TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension. In ACL, pages 1601–1611.
    Google ScholarLocate open access versionFindings
  • A. Kembhavi, M. J. Seo, D. Schwenk, J. Choi, A. Farhadi, and H. Hajishirzi. 20Are you smarter than a sixth grader? textbook question answering for multimodal machine comprehension. In CVPR, pages 5376–5384.
    Google ScholarLocate open access versionFindings
  • D. Khashabi, S. Chaturvedi, M. Roth, S. Upadhyay, and D. Roth. 20Looking beyond the surface: A challenge set for reading comprehension over multiple sentences. In NAACL.
    Google ScholarFindings
  • D. Khashabi, T. Khot, A. Sabharwal, P. Clark, O. Etzioni, and D. Roth. 2016. Question answering via integer programming over semi-structured knowledge. In IJCAI.
    Google ScholarFindings
  • T. Khot, A. Sabharwal, and P. Clark. 2017. Answering complex questions using open information extraction. In ACL.
    Google ScholarFindings
  • T. Khot, A. Sabharwal, and P. Clark. 2018. SciTail: A textual entailment dataset from science question answering. In AAAI.
    Google ScholarFindings
  • D. P. Kingma and J. L. Ba. 2015. Adam: a Method for Stochastic Optimization. International Conference on Learning Representations 2015, pages 1–15.
    Google ScholarLocate open access versionFindings
  • T. Kocisky, J. Schwarz, P. Blunsom, C. Dyer, K. M. Hermann, G. Melis, and E. Grefenstette. 2017. The NarrativeQA reading comprehension challenge. CoRR, abs/1712.07040.
    Findings
  • J. Landsberger. 1996. Study guides and strategies. Http://www.studygs.net/tsttak7.htm.
    Findings
  • T. Mihaylov and A. Frank. 2016. Discourse relation sense classification using cross-argument semantic similarity based on word embeddings. In CoNLL16 shared task, pages 100–107.
    Google ScholarLocate open access versionFindings
  • T. Mihaylov and A. Frank. 2017. Story Cloze Ending Selection Baselines and Data Examination. In LSDSem Shared Task.
    Google ScholarFindings
  • T. Mihaylov and A. Frank. 2018. Knowledgeable Reader: Enhancing Cloze-Style Reading Comprehension with External Commonsense Knowledge. In ACL, pages 821–832.
    Google ScholarLocate open access versionFindings
  • T. Mihaylov and P. Nakov. 2016. SemanticZ at SemEval-2016 Task 3: Ranking relevant answers in community question answering using semantic similarity based on fine-tuned word embeddings. In SemEval ’16.
    Google ScholarFindings
  • G. A. Miller. 1995. Wordnet: a lexical database for english. Communications of the ACM, 38(11):39– 41.
    Google ScholarLocate open access versionFindings
  • G. A. Miller, R. Beckwith, C. Fellbaum, D. Gross, and K. J. Miller. 1990. Introduction to WordNet: An online lexical database. International Journal of Lexicography, 3(4):235–244.
    Google ScholarLocate open access versionFindings
  • B. D. Mishra, L. Huang, N. Tandon, W. tau Yih, and P. Clark. 2018. Tracking state changes in procedural text: A challenge dataset and models for process paragraph comprehension. In NAACL.
    Google ScholarFindings
  • N. Mostafazadeh, N. Chambers, X. He, D. Parikh, D. Batra, L. Vanderwende, P. Kohli, and J. Allen. 2016. A Corpus and Evaluation Framework for Deeper Understanding of Commonsense Stories. In NAACL.
    Google ScholarFindings
  • P. Nakov, L. Marquez, A. Moschitti, W. Magdy, H. Mubarak, a. A. Freihat, J. Glass, and B. Randeree. 2016. Semeval-2016 task 3: Community question answering. In SemEval ’16, pages 525– 545.
    Google ScholarLocate open access versionFindings
  • T. Onishi, H. Wang, M. Bansal, K. Gimpel, and D. McAllester. 2016. Who did what: A large-scale person-centered cloze dataset. In EMNLP, pages 2230–2235, Austin, Texas.
    Google ScholarLocate open access versionFindings
  • A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer. 2017. Automatic differentiation in pytorch. In NIPS-W.
    Google ScholarLocate open access versionFindings
  • F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830.
    Google ScholarLocate open access versionFindings
  • J. Pennington, R. Socher, and C. Manning. 2014. GloVe: Global vectors for word representation. In EMNLP, pages 1532–1543.
    Google ScholarLocate open access versionFindings
  • M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer. 2018. Deep contextualized word representations. In NAACL.
    Google ScholarFindings
  • P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang. 2016. SQuAD: 100,000+ questions for machine comprehension of text. In EMNLP, pages 2383–2392.
    Google ScholarLocate open access versionFindings
  • M. Richardson, C. J. Burges, and E. Renshaw. 2013. MCTest: A challenge dataset for the open-domain machine comprehension of text. In EMNLP, pages 193–203.
    Google ScholarLocate open access versionFindings
  • P. Singh, T. Lin, E. Mueller, G. Lim, T. Perkins, and W. Zhu. 2002. Open mind common sense: Knowledge acquisition from the general public. In Lecture Notes in Computer Science, volume 2519, pages 1223–1237.
    Google ScholarLocate open access versionFindings
  • R. Speer, J. Chin, and C. Havasi. 2017. ConceptNet 5.5: An open multilingual graph of general knowledge. In AAAI.
    Google ScholarFindings
  • K. Stasaski and M. A. Hearst. 2017. Multiple choice question generation utilizing an ontology. In BEA@EMNLP, 12th Workshop on Innovative Use of NLP for Building Educational Applications.
    Google ScholarLocate open access versionFindings
  • S. Sugawara, H. Yokono, and A. Aizawa. 2017. Prerequisite skills for reading comprehension: Multiperspective analysis of mctest datasets and systems. In AAAI, pages 3089–3096.
    Google ScholarLocate open access versionFindings
  • A. Trischler, T. Wang, X. Yuan, J. Harris, A. Sordoni, P. Bachman, and K. Suleman. 2017. NewsQA: A machine comprehension dataset. In Proceedings of the 2nd Workshop on Representation Learning for NLP, pages 191–200.
    Google ScholarLocate open access versionFindings
  • P. D. Turney. 2017. Leveraging term banks for answering complex questions: A case for sparse vectors. CoRR, abs/1704.03543.
    Findings
  • D. Weissenborn, G. Wiese, and L. Seiffe. 2017. Making neural qa as simple as possible but not simpler. In CoNLL, pages 271–280.
    Google ScholarLocate open access versionFindings
  • J. Welbl, P. Stenetorp, and S. Riedel. 2018. Constructing datasets for multi-hop reading comprehension across documents. TACL.
    Google ScholarLocate open access versionFindings
  • Y. Zhang, H. Dai, K. Toraman, and L. Song. 2018. KG2: Learning to Reason Science Exam Questions with Contextual Knowledge Graph Embeddings. In arXiv.
    Google ScholarFindings
Full Text
Your rating :
0

 

Tags
Comments