AI帮你理解科学

AI 生成解读视频

AI抽取解析论文重点内容自动生成视频


pub
生成解读视频

AI 溯源

AI解析本论文相关学术脉络


Master Reading Tree
生成 溯源树

AI 精读

AI抽取本论文的概要总结


微博一下
This paper reports on ISAAQ, the first system to achieve accuracies above 80%, 70% and 55% on Textbook Question Answering true/false, text and diagram MC questions

ISAAQ - Mastering Textbook Questions with Pre-trained Transformers and Bottom-Up and Top-Down Attention

Conference on Empirical Methods in Natural Language Processing, (2020): 5469-5479

引用0|浏览592
EI
下载 PDF 全文
引用
微博一下

摘要

Textbook Question Answering is a complex task in the intersection of Machine Comprehension and Visual Question Answering that requires reasoning with multimodal information from text and diagrams. For the first time, this paper taps on the potential of transformer language models and bottom-up and top-down attention to tackle the language...更多

代码

数据

0
简介
  • Within NLP, machine understanding of textbooks is one of the grand AI challenges. As originally put by (Reddy, 1988): ”Reading a chapter in a college freshman text (say physics or accounting) and answering the questions at the end of the chapter is a hard (AI) problem that requires advances in vision, language, problem-solving, and learning theory.”.
  • Information retrieval techniques to obtain background information from the text are usually keyword-based and potentially oblivious of the different artifacts of language, such as morphological variations, conjugations, terms that may be semantically related to the question, synonyms, hypernyms or multi-word expressions, which are frequent in the domains of the TQA dataset
  • To address such shortcomings, the authors extend classic information retrieval approaches with pre-trained models that leverage the language understanding capabilities of transformer language models.
  • The authors concatenate the selected sentences following their ranking to compose a text passage with the desired background knowledge
重点内容
  • Within NLP, machine understanding of textbooks is one of the grand AI challenges
  • Textbook Question Answering (TQA) is rich with diagrams that describe potentially complex concepts, such as photosynthesis, the trophic chain, and the cycle of water, which are hard to represent as a single natural image
  • We show that bottom-up and top-down (BUTD) attention, originally proposed for tasks like image captioning and visual question answering with natural images, can be effectively adapted to propose regions of interest in the diagram that are relevant for the question in hand, enabling the identification of diagram constituents and their relationships
  • This results in three text background retrievers: Information Retrieval (IR) The IR method searches the whole TQA dataset to see if question q along with an answer option is explicitly stated in the corpus
  • This paper reports on ISAAQ, the first system to achieve accuracies above 80%, 70% and 55% on TQA true/false, text and diagram MC questions
  • ISAAQ demonstrates that it is possible to master the grand AI challenge of machine textbook understanding based on modern methods for language and visual understanding, with modest infrastructure requirements
方法
  • For each question the authors propose different retrievers to extract relevant language and visual background knowledge from the textbook.
  • Note that the authors consider both approaches based on conventional information retrieval techniques and approaches that leverage transformers pre-trained on specific tasks.
  • The retrieved background is provided along with the question and candidate answers to the solvers.
  • The authors ensemble different solvers resulting from fine-tuning one or several transformers on a multiple choice classification task, which can be combined with others based e.g. on information retrieval
结果
  • 5.1 Experimental settings

    The authors' approach is rather frugal in terms of hardware. All training and evaluation has been done on a single server with 32GB of RAM, 1TB SSD and a single GPU GeForce RTX 2080 Ti.
  • The authors apply Pareto to select maximum input sequences of 64 tokens for true/false questions and 180 for text and diagram MC.
  • The authors take Adam (Kingma and Ba, 2014) with linearly-decayed learning-rate and warm-up as in (Devlin et al, 2018) and empirically select peak learning rates in the range [1e−6, 5e−5], with 1e−5 for true/false and text MC questions and 1e−6 for diagram MC.
  • Training time per epoch is 1’ for true/false questions, 30’ for text MC, and 60’ for diagram MC
结论
  • This paper reports on ISAAQ, the first system to achieve accuracies above 80%, 70% and 55% on TQA true/false, text and diagram MC questions.
  • ISAAQ demonstrates that it is possible to master the grand AI challenge of machine textbook understanding based on modern methods for language and visual understanding, with modest infrastructure requirements.
  • Key to this success are transformers, BUTD attention, pre-training on related datasets, and the selection of complementary background information to train and ensemble different solvers.
  • Additional effort will be needed in activities like the development of large diagram datasets, including the semantic annotation of diagram constituents and connectors, and annotating diagram questions with the reasoning and knowledge types required to answer them
表格
  • Table1: Dataset partition sizes (#questions)
  • Table2: ISAAQ performance and comparison (validation set) with previous SotA for the TQA dataset
  • Table3: ISAAQ vs. SotA in pre-training datasets (test)
  • Table4: Results of each of our solvers and the overall ISAAQ model for TQA true/false questions
  • Table5: Individual text MC solvers and ISAAQ. Note the large delta vs. IR solver baseline (also in table 6). Pre-training on RACE, OBQA, ARC-Easy/Challenge
  • Table6: Individual diagram MC solvers and ISAAQ. Pre-training on VQA abstract scenes and AI2D
  • Table7: ISAAQ ablations for diagram MC
  • Table8: Study of the attention on question diagrams. Examples (earth, life sciences, physics) from validation set
Download tables as Excel
相关工作
  • In (Kembhavi et al, 2016) several TQA baselines were proposed that were based on Machine Comprehension (MC) models like BiDAF (Seo et al, 2017) and MemoryNet (Weston et al, 2014), as well as Visual Question Answering (VQA) (Antol et al, 2015) and diagram parsing algorithms like DsDP-net (Kembhavi et al, 2016). Their results were rather modest (50.4, 32.9, and 31.3 in true/false, text and diagram MC questions), suggesting that existing MC/VQA methods would not suffice for the TQA dataset. Indeed, diagram questions entail greater complexity than dealing with natural images, as shown in (Gomez-Perez and Ortega, 2019), where we beat the TQA baselines using visual and language information extracted from the correspondence between figures and captions in scientific literature enriched with lexicosemantic information from a knowledge graph (Denaux and Gomez-Perez, 2019). By contrast, (Li et al, 2018) focused on finding contradictions between the candidate answers and their corresponding context while (Kim et al, 2019) applied graph convolutional networks on text and diagrams to represent relevant question background information as a unified graph.

    The field of NLP has advanced substantially with the advent of large-scale language models such as ELMo (Peters et al, 2018), ULMFit (Howard and Ruder, 2018), GPT (Radford et al, 2018), BERT (Devlin et al, 2018), and RoBERTa (Liu et al, 2019). Using large amounts of text, e.g. BERT was trained on Wikipedia plus the Google Book Corpus of 10,000 books (Zhu et al, 2015), they are trained to learn various language prediction tasks such as guessing a missing word or the next sentence. Language models and particularly transformers have been used in question answering, as illustrated by the success of the Aristo system (Clark et al, 2019) in standard science tests. Transformers have also proved their worth as soft reasoners (Clark et al, 2020), exhibiting capabilities for natural language inference. Furthermore, whilst learning linguistic information, transformers have shown to capture semantic knowledge and general understanding of the world from the training text (Petroni et al, 2019), including a notion of commonsense that can be useful in question answering. Our approach is the first to leverage the language understanding and reasoning capabilities of existing transformer language models for TQA.
基金
  • This research was funded by the Horizon 2020 grant European Language Grid-825627
研究对象与分析
data: 10
For each answer option ai, we concatenate q and ai and run the query against a search engine like ElasticSearch. Based on the search engine score, we take the top n sentences (n = 10) resulting from the query, where each sentence has at least one overlapping, non-stop word with ai. This ensures that all sentences have some relevance to both q and ai, while maximizing recall

引用论文
  • Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. 2018. Bottom-up and top-down attention for image captioning and visual question answering. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6077–6086.
    Google ScholarLocate open access versionFindings
  • Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi Parikh. 2015. Vqa: Visual question answering. 2015 IEEE International Conference on Computer Vision (ICCV), pages 2425–2433.
    Google ScholarLocate open access versionFindings
  • Michael Boratko, Harshit Padigela, Divyendra Mikkilineni, Pritish Yuvraj, Rajarshi Das, Andrew McCallum, Maria Chang, Achille Fokoue-Nkoutche, Pavan Kapanipathi, Nicholas Mattei, Ryan Musa, Kartik Talamadupula, and Michael Witbrock. 2018. A systematic classification of knowledge, reasoning, and context within the ARC dataset. In Proceedings of the Workshop on Machine Reading for Question Answering, pages 60–70, Melbourne, Australia. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. 2018. Think you have solved question answering? try arc, the ai2 reasoning challenge. ArXiv, abs/1803.05457.
    Findings
  • Peter Clark, Oren Etzioni, Daniel Khashabi, Tushar Khot, Bhavana Dalvi Mishra, Kyle Richardson, Ashish Sabharwal, Carissa Schoenick, Oyvind Tafjord, Niket Tandon, Sumithra Bhakthavatsalam, Dirk Groeneveld, Michal Guerquin, and Michael Schmitz. 2019. From ’f’ to ’a’ on the n.y. regents science exams: An overview of the aristo project. ArXiv, abs/1909.01958.
    Findings
  • Peter Clark, Oyvind Tafjord, and Kyle Richardson. 2020. Transformers as soft reasoners over language. ArXiv, abs/2002.05867.
    Findings
  • Yann N. Dauphin, Angela Fan, Michael Auli, and David Grangier. 201Language modeling with gated convolutional networks. In Proceedings of the 34th International Conference on Machine Learning - Volume 70, ICML’17, page 933–941. JMLR.org.
    Google ScholarLocate open access versionFindings
  • Ronald Denaux and Jose Manuel Gomez-Perez. 2019. Vecsigrafo: Corpus-based word-concept embeddings. Semantic Web, pages 1–28.
    Google ScholarLocate open access versionFindings
  • J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. FeiFei. 200ImageNet: A Large-Scale Hierarchical Image Database. In CVPR09.
    Google ScholarFindings
  • Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: pre-training of deep bidirectional transformers for language understanding. CoRR, abs/1810.04805.
    Findings
  • Jose Manuel Gomez-Perez and Raul Ortega. 2019.
    Google ScholarFindings
  • Look, read and enrich - learning from scientific figures and their captions. In Proceedings of the 10th International Conference on Knowledge Capture, KCAP ’19, page 101–108, New York, NY, USA. Association for Computing Machinery.
    Google ScholarLocate open access versionFindings
  • K. He, X. Zhang, S. Ren, and J. Sun. 2016. Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778.
    Google ScholarLocate open access versionFindings
  • Jeremy Howard and Sebastian Ruder. 2018. Finetuned language models for text classification. CoRR, abs/1801.06146.
    Findings
  • Aniruddha Kembhavi, Mike Salvato, Eric Kolve, Minjoon Seo, Hannaneh Hajishirzi, and Ali Farhadi. 2016. A diagram is worth a dozen images. In ECCV.
    Google ScholarFindings
  • Aniruddha Kembhavi, Minjoon Seo, Dustin Schwenk, Jonghyun Choi, Ali Farhadi, and Hannaneh Hajishirzi. 2017. Are you smarter than a sixth grader? textbook question answering for multimodal machine comprehension. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 5376–5384.
    Google ScholarLocate open access versionFindings
  • Daniel Khashabi, Tushar Khot, Ashish Sabharwal, Oyvind Tafjord, P. Clark, and Hannaneh Hajishirzi. 2020. Unifiedqa: Crossing format boundaries with a single qa system. ArXiv, abs/2005.00700.
    Findings
  • DaeSik Kim, Seonhoon Kim, and Nojun Kwak. 2019. Textbook question answering with multi-modal context graph understanding and self-supervised openset comprehension. In ACL.
    Google ScholarFindings
  • Diederik P. Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. CoRR, abs/1412.6980.
    Findings
  • Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A. Shamma, Michael S. Bernstein, and Li Fei-Fei. 2017. Visual genome: Connecting language and vision using crowdsourced dense image annotations. Int. J. Comput. Vision, 123(1):32–73.
    Google ScholarLocate open access versionFindings
  • Juzheng Li, Hang Su, Jun Zhu, Siyu Wang, and Bo Zhang. 2018. Textbook question answering under instructor guidance with memory networks. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3655–3663.
    Google ScholarLocate open access versionFindings
  • Tsung-Yi Lin, Michael Maire, Serge Belongie, Lubomir Bourdev, Ross Girshick, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollar, and C. Lawrence Zitnick. 2014. Microsoft coco: Common objects in context. In ECCV.
    Google ScholarLocate open access versionFindings
  • Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019.
    Google ScholarFindings
  • Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. 2018. Can a suit of armor conduct electricity? a new dataset for open book question answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2381–2391, Brussels, Belgium. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 2227–2237, New Orleans, Louisiana. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Fabio Petroni, Tim Rocktaschel, Sebastian Riedel, Patrick Lewis, Anton Bakhtin, Yuxiang Wu, and Alexander Miller. 2019. Language models as knowledge bases? In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLPIJCNLP), pages 2463–2473, Hong Kong, China. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Improving language understanding by generative pre-training. URL https://s3us-west-2.amazonaws.com/openai-assets/researchcovers/languageunsupervised/language understanding paper.pdf.
    Findings
  • Raj Reddy. 1988. Foundations and grand challenges of artificial intelligence: Aaai presidential address. AI Magazine, 9(4):9.
    Google ScholarLocate open access versionFindings
  • Joseph Redmon, Santosh Kumar Divvala, Ross B. Girshick, and Ali Farhadi. 2015. You only look once: Unified, real-time object detection. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 779–788.
    Google ScholarLocate open access versionFindings
  • Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2017. Faster r-cnn: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell., 39(6):1137–1149.
    Google ScholarLocate open access versionFindings
  • Minjoon Seo, Aniruddha Kembhavi, Ali Farhadi, and Hannaneh Hajishirzi. 2017. Bidirectional attention flow for machine comprehension. CoRR, abs/1611.01603.
    Findings
  • Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. 2018. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2556–2565, Melbourne, Australia. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • M. Shoeybi, M. Patwary, R. Puri, P. LeGresley, J. Casper, and Bryan Catanzaro. 2019. Megatronlm: Training multi-billion parameter language models using model parallelism. ArXiv, abs/1909.08053.
    Findings
  • Karen Simonyan and Andrew Zisserman. 2015. Very deep convolutional networks for large-scale image recognition. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings.
    Google ScholarLocate open access versionFindings
  • Weijie Su, Xizhou Zhu, Yue Cao, Bin Li, Lewei Lu, Furu Wei, and Jifeng Dai. 2019. Vl-bert: Pretraining of generic visual-linguistic representations. arXiv preprint arXiv:1908.08530.
    Findings
  • Kai Sun, Dian Yu, Dong Yu, and Claire Cardie. 2019. Improving machine reading comprehension with general reading strategies. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 2633–2643, Minneapolis, Minnesota. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Hao Tan and Mohit Bansal. 2019. LXMERT: Learning cross-modality encoder representations from transformers. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 5100–5111, Hong Kong, China. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Damien Teney, L. Liu, and A. V. D. Hengel. 2017. Graph-structured representations for visual question answering. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3233– 3241.
    Google ScholarLocate open access versionFindings
  • Jason Weston, Sumit Chopra, and Antoine Bordes. 2014. Memory networks. CoRR, abs/1410.3916.
    Findings
  • Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. 2015. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), ICCV ’15, page 19–27, USA. IEEE Computer Society.
    Google ScholarLocate open access versionFindings
作者
0
您的评分 :

暂无评分

标签
评论
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn