Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Isaac Cowhey
Isaac Cowhey
Carissa Schoenick
Carissa Schoenick

arXiv: Artificial Intelligence, Volume abs/1803.05457, 2018.

Cited by: 76|Bibtex|Views188
EI
Other Links: dblp.uni-trier.de|academic.microsoft.com|arxiv.org
Weibo:
To help the field move towards more difficult tasks, we have presented the AI2 Reasoning Challenge, consisting of a new question set, text corpus, and baselines, and whose Challenge partition is hard for retrieval and co-occurence methods

Abstract:

We present a new question set, text corpus, and baselines assembled to encourage AI research in advanced question answering. Together, these constitute the AI2 Reasoning Challenge (ARC), which requires far more powerful knowledge and reasoning than previous challenges such as SQuAD or SNLI. The ARC question set is partitioned into a Chall...More

Code:

Data:

0
Introduction
  • Datasets are increasingly driving progress in AI, resulting in impressive solutions to several question-answering (QA) tasks (e.g., Rajpurkar et al, 2016; Joshi et al, 2017).
  • Standardized tests have previously been proposed as a Grand Challenge for AI (Brachman, 2005; Clark and Etzioni, 2016) as they involve a wide variety of linguistic and inferential phenomena, have varying levels of difficulty, and are measurable, motivating, and ambitious
  • Making this challenge a reality is difficult, as such questions are difficult obtain
Highlights
  • Datasets are increasingly driving progress in AI, resulting in impressive solutions to several question-answering (QA) tasks (e.g., Rajpurkar et al, 2016; Joshi et al, 2017)
  • The most striking observation is that none of the algorithms score significantly higher than the random baseline on the Challenge set, where the 95% confidence interval is ±2.5%
  • Datasets have become highly influential in driving the direction of research
  • Recent datasets for QA have led to impressive advances, but have focused on factoid questions where surface-level cues alone are sufficient to find an answer, discouraging progress on questions requiring reasoning or other advanced methods
  • To help the field move towards more difficult tasks, we have presented the AI2 Reasoning Challenge (ARC), consisting of a new question set, text corpus, and baselines, and whose Challenge partition is hard for retrieval and co-occurence methods
  • We find that none of the baseline systems tested can significantly outperform a random baseline on the Challenge set, including two neural models with high performances on SNLI and SQuAD
Methods
  • Scientists perform experiments to test hypotheses.
  • How do scientists try to remain objective during experiments?
  • This question is part of the Challenge Set, as it appears to require a more advanced answering method.
  • Question Types
Results
  • The slightly abovezero score is due to the solver occasionally picking multiple answers, resulting in a partial credit for a few questions
  • The authors include these questions in the Challenge set.
  • The most striking observation is that none of the algorithms score significantly higher than the random baseline on the Challenge set, where the 95% confidence interval is ±2.5%
  • Their performance on the Easy set is generally between 55% and 65%.
  • This highlights the different nature and difficulty of the Challenge set
Conclusion
  • Recent datasets for QA have led to impressive advances, but have focused on factoid questions where surface-level cues alone are sufficient to find an answer, discouraging progress on questions requiring reasoning or other advanced methods.
  • To help the field move towards more difficult tasks, the authors have presented the AI2 Reasoning Challenge (ARC), consisting of a new question set, text corpus, and baselines, and whose Challenge partition is hard for retrieval and co-occurence methods.
  • To access ARC, view the leaderboard, and submit new entries, visit the ARC Website at http://data.allenai.org/arc
Summary
  • Introduction:

    Datasets are increasingly driving progress in AI, resulting in impressive solutions to several question-answering (QA) tasks (e.g., Rajpurkar et al, 2016; Joshi et al, 2017).
  • Standardized tests have previously been proposed as a Grand Challenge for AI (Brachman, 2005; Clark and Etzioni, 2016) as they involve a wide variety of linguistic and inferential phenomena, have varying levels of difficulty, and are measurable, motivating, and ambitious
  • Making this challenge a reality is difficult, as such questions are difficult obtain
  • Methods:

    Scientists perform experiments to test hypotheses.
  • How do scientists try to remain objective during experiments?
  • This question is part of the Challenge Set, as it appears to require a more advanced answering method.
  • Question Types
  • Results:

    The slightly abovezero score is due to the solver occasionally picking multiple answers, resulting in a partial credit for a few questions
  • The authors include these questions in the Challenge set.
  • The most striking observation is that none of the algorithms score significantly higher than the random baseline on the Challenge set, where the 95% confidence interval is ±2.5%
  • Their performance on the Easy set is generally between 55% and 65%.
  • This highlights the different nature and difficulty of the Challenge set
  • Conclusion:

    Recent datasets for QA have led to impressive advances, but have focused on factoid questions where surface-level cues alone are sufficient to find an answer, discouraging progress on questions requiring reasoning or other advanced methods.
  • To help the field move towards more difficult tasks, the authors have presented the AI2 Reasoning Challenge (ARC), consisting of a new question set, text corpus, and baselines, and whose Challenge partition is hard for retrieval and co-occurence methods.
  • To access ARC, view the leaderboard, and submit new entries, visit the ARC Website at http://data.allenai.org/arc
Tables
  • Table1: Number of questions in the ARC partitions
  • Table2: Grade-level distribution of ARC questions seen in the similar distribution of grade levels), as each grade level contains a mixture of easy and difficult questions
  • Table3: Properties of the ARC Dataset
  • Table4: Types of knowledge suggested by ARC Challenge Set questions
  • Table5: Types of reasoning suggested by ARC Challenge Set questions
  • Table6: Performance of the different baseline systems. Scores are reported as percentages on the test sets. For up-to-date results, see the ARC leaderboard at http://data.allenai.org/arc
  • Table7: The various question sources for ARC
Download tables as Excel
Related work
  • There are numerous datasets available to drive progress in question-answering. Earlier reading comprehension datasets, e.g., MCTest (Richardson, 2013), SQuAD (Rajpurkar et al, 2016), NewsQA (Trischler et al, 2016), and CNN/DailyMail (Hermann et al, 2015), contained questions whose answers could be determined from surface-level cues alone (i.e., answers were “explicitly stated”). TriviaQA (Joshi et al, 2017) broadened this task by providing several articles with a question, and used questions authored independently of the articles. Again, though, the questions were largely factoid-style, e.g., “Who won the Nobel Peace Prize in 2009?”. Although systems can now perform well on these datasets, even matching human performance (Simonite, 2018), they can be easily fooled (Jia and Liang, 2017); the degree to which they truly understand language or domain-specific concepts remains unclear.

    To push towards more complex QA tasks, one approach has been to generate synthetic datasets, the most notable example being the bAbI dataset (Weston et al, 2015). bAbi was generated using a simple world simulator and language generator, producing data for 20 different tasks. It has stimulated work on use of memory network neural architectures (Weston et al, 2014), supporting a form of multistep reasoning where a neural memory propagates information from one step to another (e.g. Henaff et al, 2016; Seo et al, 2017a). However, its use of synthetic text and a synthetic world limits the realism and difficulty of the task, with many systems scoring a perfect 100% on most tasks (e.g. Weston et al, 2014). In general, a risk of using large synthetic QA datasets is that neural methods are remarkably powerful at “reverse-engineering” the process by which a dataset was generated, or picking up on its idiosyncrasies to excel at it, without necessarily advancing language understanding or reasoning.
Reference
  • R. Brachman. Selected grand challenges in cognitive science. Technical Report Tech Report 05-1218, MITRE, 2005.
    Google ScholarFindings
  • K. W. Church and P. Hanks. Word association norms, mutual information and lexicography. In 27th ACL, pp. 76–83, 1989.
    Google ScholarLocate open access versionFindings
  • P. Clark and O. Etzioni. My Computer is an Honor Student but how Intelligent is it? Standardized Tests as a Measure of AI. AI Magazine, 2016. (To appear).
    Google ScholarLocate open access versionFindings
  • P. Clark, O. Etzioni, T. Khot, A. Sabharwal, O. Tafjord, P. D. Turney, and D. Khashabi. Combining retrieval, statistics, and inference to answer elementary science questions. In AAAI, pp. 2580–2586, 2016.
    Google ScholarLocate open access versionFindings
  • E. Davis. How to write science questions that are easy for people and hard for computers. AI magazine, 37(1):13– 22, 2016.
    Google ScholarLocate open access versionFindings
  • A. Fujita, A. Kameda, A. Kawazoe, and Y. Miyao. Overview of todai robot project and evaluation framework of its nlp-based problem solving. In Proc. LREC’14, vol. 36, page 36, 2014.
    Google ScholarLocate open access versionFindings
  • S. Gururangan, S. Swayamdipta, O. Levy, R. Schwartz, S. Bowman, and N. A. Smith. Annotation artifacts in natural language inference data. In Proc. of NAACL, 2018.
    Google ScholarLocate open access versionFindings
  • M. Henaff, J. Weston, A. Szlam, A. Bordes, and Y. LeCun. Tracking the world state with recurrent entity networks. CoRR, abs/1612.03969, 2016.
    Findings
  • K. M. Hermann, T. Kocisky, E. Grefenstette, L. Espeholt, W. Kay, M. Suleyman, and P. Blunsom. Teaching machines to read and comprehend. In Advances in Neural Information Processing Systems, pp. 1693–1701, 2015.
    Google ScholarLocate open access versionFindings
  • R. Jia and P. Liang. Adversarial examples for evaluating reading comprehension systems. In Proc. EMNLP’17, 2017.
    Google ScholarLocate open access versionFindings
  • M. Joshi, E. Choi, D. S. Weld, and L. Zettlemoyer. TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension. In Proc. ACL’17, July 2017.
    Google ScholarLocate open access versionFindings
  • A. Kembhavi, M. Seo, D. Schwenk, J. Choi, A. Farhadi, and H. Hajishirzi. Are you smarter than a sixth grader? textbook question answering for multimodal machine comprehension. In CVPR, 2017.
    Google ScholarLocate open access versionFindings
  • D. Khashabi, T. Khot, A. Sabharwal, P. Clark, O. Etzioni, and D. Roth. Question answering via integer programming over semi-structured knowledge. In IJCAI, 2016.
    Google ScholarLocate open access versionFindings
  • D. Khashabi, T. Khot, A. Sabharwal, and D. Roth. Question answering as global reasoning over semantic abstractions. In AAAI’18, 2018a.
    Google ScholarLocate open access versionFindings
  • D. Khashabi, T. K. A. Sabharwal, and D. Roth. Question answering as global reasoning over semantic abstractions. In AAAI, 2018b.
    Google ScholarLocate open access versionFindings
  • T. Khot, A. Sabharwal, and P. Clark. Answering complex questions using open information extraction. In ACL, 2017.
    Google ScholarLocate open access versionFindings
  • T. Khot, A. Sabharwal, and P. Clark. SciTail: A textual entailment dataset from science question answering. In AAAI, 2018.
    Google ScholarLocate open access versionFindings
  • NII. Evaluation of information access technologies. In Proc. 13th NTCIR Conf, 2017.
    Google ScholarLocate open access versionFindings
  • A. P. Parikh, O. Tackstrom, D. Das, and J. Uszkoreit. A decomposable attention model for natural language inference. In EMNLP, 2016.
    Google ScholarLocate open access versionFindings
  • P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang. Squad: 100,000+ questions for machine comprehension of text. In Proc. EMNLP’16, 2016.
    Google ScholarLocate open access versionFindings
  • M. Richardson. Mctest: A challenge dataset for the opendomain machine comprehension of text. In EMNLP’13, 2013.
    Google ScholarFindings
  • C. Schoenick, P. Clark, O. Tafjord, P. Turney, and O. Etzioni. Moving beyond the turing test with the allen ai science challenge. Communications of the ACM, 60(9): 60–64, 2017.
    Google ScholarLocate open access versionFindings
  • M. Seo, S. Min, A. Farhadi, and H. Hajishirzi. Queryreduction networks for question answering. In ICLR, 2017a.
    Google ScholarLocate open access versionFindings
  • M. J. Seo, H. Hajishirzi, A. Farhadi, and O. Etzioni. Diagram understanding in geometry questions. In AAAI, pp. 2831–2838, 2014.
    Google ScholarLocate open access versionFindings
  • M. J. Seo, A. Kembhavi, A. Farhadi, and H. Hajishirzi. Bidirectional attention flow for machine comprehension. In Proc. ICLR’17, 2017b.
    Google ScholarLocate open access versionFindings
  • T. Simonite. Ai beat humans at reading! maybe not. In WIRED Magazine, 2018.
    Google ScholarLocate open access versionFindings
  • E. Strickland. Can an ai get into the university of tokyo? IEEE Spectrum, August 2013.
    Google ScholarLocate open access versionFindings
  • A. Trischler, T. Wang, X. Yuan, J. Harris, A. Sordoni, P. Bachman, and K. Suleman. Newsqa: A machine comprehension dataset. arXiv preprint arXiv:1611.09830, 2016.
    Findings
  • J. Welbl, N. F. Liu, and M. Gardner. Crowdsourcing multiple choice science questions. In Workshop on Noisy Usergenerated Text, 2017a.
    Google ScholarLocate open access versionFindings
  • J. Welbl, P. Stenetorp, and S. Riedel. Constructing datasets for multi-hop reading comprehension across documents. arXiv preprint arXiv:1710.06481, 2017b.
    Findings
  • J. Weston, A. Bordes, S. Chopra, A. M. Rush, B. van Merrienboer, A. Joulin, and T. Mikolov. Towards AIcomplete question answering: A set of prerequisite toy tasks. arXiv preprint arXiv:1502.05698, 2015.
    Findings
  • J. Weston, S. Chopra, and A. Bordes. Memory networks. arXiv preprint arXiv:1410.3916, 2014.
    Findings
Full Text
Your rating :
0

 

Tags
Comments