This paper reports on ISAAQ, the first system to achieve accuracies above 80%, 70% and 55% on Textbook Question Answering true/false, text and diagram MC questions
ISAAQ - Mastering Textbook Questions with Pre-trained Transformers and Bottom-Up and Top-Down Attention
Conference on Empirical Methods in Natural Language Processing, (2020): 5469-5479
Textbook Question Answering is a complex task in the intersection of Machine Comprehension and Visual Question Answering that requires reasoning with multimodal information from text and diagrams. For the first time, this paper taps on the potential of transformer language models and bottom-up and top-down attention to tackle the language...更多
下载 PDF 全文
- Within NLP, machine understanding of textbooks is one of the grand AI challenges. As originally put by (Reddy, 1988): ”Reading a chapter in a college freshman text (say physics or accounting) and answering the questions at the end of the chapter is a hard (AI) problem that requires advances in vision, language, problem-solving, and learning theory.”.
- Information retrieval techniques to obtain background information from the text are usually keyword-based and potentially oblivious of the different artifacts of language, such as morphological variations, conjugations, terms that may be semantically related to the question, synonyms, hypernyms or multi-word expressions, which are frequent in the domains of the TQA dataset
- To address such shortcomings, the authors extend classic information retrieval approaches with pre-trained models that leverage the language understanding capabilities of transformer language models.
- The authors concatenate the selected sentences following their ranking to compose a text passage with the desired background knowledge
- Within NLP, machine understanding of textbooks is one of the grand AI challenges
- Textbook Question Answering (TQA) is rich with diagrams that describe potentially complex concepts, such as photosynthesis, the trophic chain, and the cycle of water, which are hard to represent as a single natural image
- We show that bottom-up and top-down (BUTD) attention, originally proposed for tasks like image captioning and visual question answering with natural images, can be effectively adapted to propose regions of interest in the diagram that are relevant for the question in hand, enabling the identification of diagram constituents and their relationships
- This results in three text background retrievers: Information Retrieval (IR) The IR method searches the whole TQA dataset to see if question q along with an answer option is explicitly stated in the corpus
- This paper reports on ISAAQ, the first system to achieve accuracies above 80%, 70% and 55% on TQA true/false, text and diagram MC questions
- ISAAQ demonstrates that it is possible to master the grand AI challenge of machine textbook understanding based on modern methods for language and visual understanding, with modest infrastructure requirements
- For each question the authors propose different retrievers to extract relevant language and visual background knowledge from the textbook.
- Note that the authors consider both approaches based on conventional information retrieval techniques and approaches that leverage transformers pre-trained on specific tasks.
- The retrieved background is provided along with the question and candidate answers to the solvers.
- The authors ensemble different solvers resulting from fine-tuning one or several transformers on a multiple choice classification task, which can be combined with others based e.g. on information retrieval
- 5.1 Experimental settings
The authors' approach is rather frugal in terms of hardware. All training and evaluation has been done on a single server with 32GB of RAM, 1TB SSD and a single GPU GeForce RTX 2080 Ti.
- The authors apply Pareto to select maximum input sequences of 64 tokens for true/false questions and 180 for text and diagram MC.
- The authors take Adam (Kingma and Ba, 2014) with linearly-decayed learning-rate and warm-up as in (Devlin et al, 2018) and empirically select peak learning rates in the range [1e−6, 5e−5], with 1e−5 for true/false and text MC questions and 1e−6 for diagram MC.
- Training time per epoch is 1’ for true/false questions, 30’ for text MC, and 60’ for diagram MC
- This paper reports on ISAAQ, the first system to achieve accuracies above 80%, 70% and 55% on TQA true/false, text and diagram MC questions.
- ISAAQ demonstrates that it is possible to master the grand AI challenge of machine textbook understanding based on modern methods for language and visual understanding, with modest infrastructure requirements.
- Key to this success are transformers, BUTD attention, pre-training on related datasets, and the selection of complementary background information to train and ensemble different solvers.
- Additional effort will be needed in activities like the development of large diagram datasets, including the semantic annotation of diagram constituents and connectors, and annotating diagram questions with the reasoning and knowledge types required to answer them
- Table1: Dataset partition sizes (#questions)
- Table2: ISAAQ performance and comparison (validation set) with previous SotA for the TQA dataset
- Table3: ISAAQ vs. SotA in pre-training datasets (test)
- Table4: Results of each of our solvers and the overall ISAAQ model for TQA true/false questions
- Table5: Individual text MC solvers and ISAAQ. Note the large delta vs. IR solver baseline (also in table 6). Pre-training on RACE, OBQA, ARC-Easy/Challenge
- Table6: Individual diagram MC solvers and ISAAQ. Pre-training on VQA abstract scenes and AI2D
- Table7: ISAAQ ablations for diagram MC
- Table8: Study of the attention on question diagrams. Examples (earth, life sciences, physics) from validation set
- In (Kembhavi et al, 2016) several TQA baselines were proposed that were based on Machine Comprehension (MC) models like BiDAF (Seo et al, 2017) and MemoryNet (Weston et al, 2014), as well as Visual Question Answering (VQA) (Antol et al, 2015) and diagram parsing algorithms like DsDP-net (Kembhavi et al, 2016). Their results were rather modest (50.4, 32.9, and 31.3 in true/false, text and diagram MC questions), suggesting that existing MC/VQA methods would not suffice for the TQA dataset. Indeed, diagram questions entail greater complexity than dealing with natural images, as shown in (Gomez-Perez and Ortega, 2019), where we beat the TQA baselines using visual and language information extracted from the correspondence between figures and captions in scientific literature enriched with lexicosemantic information from a knowledge graph (Denaux and Gomez-Perez, 2019). By contrast, (Li et al, 2018) focused on finding contradictions between the candidate answers and their corresponding context while (Kim et al, 2019) applied graph convolutional networks on text and diagrams to represent relevant question background information as a unified graph.
The field of NLP has advanced substantially with the advent of large-scale language models such as ELMo (Peters et al, 2018), ULMFit (Howard and Ruder, 2018), GPT (Radford et al, 2018), BERT (Devlin et al, 2018), and RoBERTa (Liu et al, 2019). Using large amounts of text, e.g. BERT was trained on Wikipedia plus the Google Book Corpus of 10,000 books (Zhu et al, 2015), they are trained to learn various language prediction tasks such as guessing a missing word or the next sentence. Language models and particularly transformers have been used in question answering, as illustrated by the success of the Aristo system (Clark et al, 2019) in standard science tests. Transformers have also proved their worth as soft reasoners (Clark et al, 2020), exhibiting capabilities for natural language inference. Furthermore, whilst learning linguistic information, transformers have shown to capture semantic knowledge and general understanding of the world from the training text (Petroni et al, 2019), including a notion of commonsense that can be useful in question answering. Our approach is the first to leverage the language understanding and reasoning capabilities of existing transformer language models for TQA.
- This research was funded by the Horizon 2020 grant European Language Grid-825627
For each answer option ai, we concatenate q and ai and run the query against a search engine like ElasticSearch. Based on the search engine score, we take the top n sentences (n = 10) resulting from the query, where each sentence has at least one overlapping, non-stop word with ai. This ensures that all sentences have some relevance to both q and ai, while maximizing recall
- Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. 2018. Bottom-up and top-down attention for image captioning and visual question answering. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6077–6086.
- Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi Parikh. 2015. Vqa: Visual question answering. 2015 IEEE International Conference on Computer Vision (ICCV), pages 2425–2433.
- Michael Boratko, Harshit Padigela, Divyendra Mikkilineni, Pritish Yuvraj, Rajarshi Das, Andrew McCallum, Maria Chang, Achille Fokoue-Nkoutche, Pavan Kapanipathi, Nicholas Mattei, Ryan Musa, Kartik Talamadupula, and Michael Witbrock. 2018. A systematic classification of knowledge, reasoning, and context within the ARC dataset. In Proceedings of the Workshop on Machine Reading for Question Answering, pages 60–70, Melbourne, Australia. Association for Computational Linguistics.
- Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. 2018. Think you have solved question answering? try arc, the ai2 reasoning challenge. ArXiv, abs/1803.05457.
- Peter Clark, Oren Etzioni, Daniel Khashabi, Tushar Khot, Bhavana Dalvi Mishra, Kyle Richardson, Ashish Sabharwal, Carissa Schoenick, Oyvind Tafjord, Niket Tandon, Sumithra Bhakthavatsalam, Dirk Groeneveld, Michal Guerquin, and Michael Schmitz. 2019. From ’f’ to ’a’ on the n.y. regents science exams: An overview of the aristo project. ArXiv, abs/1909.01958.
- Peter Clark, Oyvind Tafjord, and Kyle Richardson. 2020. Transformers as soft reasoners over language. ArXiv, abs/2002.05867.
- Yann N. Dauphin, Angela Fan, Michael Auli, and David Grangier. 201Language modeling with gated convolutional networks. In Proceedings of the 34th International Conference on Machine Learning - Volume 70, ICML’17, page 933–941. JMLR.org.
- Ronald Denaux and Jose Manuel Gomez-Perez. 2019. Vecsigrafo: Corpus-based word-concept embeddings. Semantic Web, pages 1–28.
- J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. FeiFei. 200ImageNet: A Large-Scale Hierarchical Image Database. In CVPR09.
- Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: pre-training of deep bidirectional transformers for language understanding. CoRR, abs/1810.04805.
- Jose Manuel Gomez-Perez and Raul Ortega. 2019.
- Look, read and enrich - learning from scientific figures and their captions. In Proceedings of the 10th International Conference on Knowledge Capture, KCAP ’19, page 101–108, New York, NY, USA. Association for Computing Machinery.
- K. He, X. Zhang, S. Ren, and J. Sun. 2016. Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778.
- Jeremy Howard and Sebastian Ruder. 2018. Finetuned language models for text classification. CoRR, abs/1801.06146.
- Aniruddha Kembhavi, Mike Salvato, Eric Kolve, Minjoon Seo, Hannaneh Hajishirzi, and Ali Farhadi. 2016. A diagram is worth a dozen images. In ECCV.
- Aniruddha Kembhavi, Minjoon Seo, Dustin Schwenk, Jonghyun Choi, Ali Farhadi, and Hannaneh Hajishirzi. 2017. Are you smarter than a sixth grader? textbook question answering for multimodal machine comprehension. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 5376–5384.
- Daniel Khashabi, Tushar Khot, Ashish Sabharwal, Oyvind Tafjord, P. Clark, and Hannaneh Hajishirzi. 2020. Unifiedqa: Crossing format boundaries with a single qa system. ArXiv, abs/2005.00700.
- DaeSik Kim, Seonhoon Kim, and Nojun Kwak. 2019. Textbook question answering with multi-modal context graph understanding and self-supervised openset comprehension. In ACL.
- Diederik P. Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. CoRR, abs/1412.6980.
- Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A. Shamma, Michael S. Bernstein, and Li Fei-Fei. 2017. Visual genome: Connecting language and vision using crowdsourced dense image annotations. Int. J. Comput. Vision, 123(1):32–73.
- Juzheng Li, Hang Su, Jun Zhu, Siyu Wang, and Bo Zhang. 2018. Textbook question answering under instructor guidance with memory networks. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3655–3663.
- Tsung-Yi Lin, Michael Maire, Serge Belongie, Lubomir Bourdev, Ross Girshick, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollar, and C. Lawrence Zitnick. 2014. Microsoft coco: Common objects in context. In ECCV.
- Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019.
- Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. 2018. Can a suit of armor conduct electricity? a new dataset for open book question answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2381–2391, Brussels, Belgium. Association for Computational Linguistics.
- Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 2227–2237, New Orleans, Louisiana. Association for Computational Linguistics.
- Fabio Petroni, Tim Rocktaschel, Sebastian Riedel, Patrick Lewis, Anton Bakhtin, Yuxiang Wu, and Alexander Miller. 2019. Language models as knowledge bases? In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLPIJCNLP), pages 2463–2473, Hong Kong, China. Association for Computational Linguistics.
- Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Improving language understanding by generative pre-training. URL https://s3us-west-2.amazonaws.com/openai-assets/researchcovers/languageunsupervised/language understanding paper.pdf.
- Raj Reddy. 1988. Foundations and grand challenges of artificial intelligence: Aaai presidential address. AI Magazine, 9(4):9.
- Joseph Redmon, Santosh Kumar Divvala, Ross B. Girshick, and Ali Farhadi. 2015. You only look once: Unified, real-time object detection. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 779–788.
- Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2017. Faster r-cnn: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell., 39(6):1137–1149.
- Minjoon Seo, Aniruddha Kembhavi, Ali Farhadi, and Hannaneh Hajishirzi. 2017. Bidirectional attention flow for machine comprehension. CoRR, abs/1611.01603.
- Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. 2018. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2556–2565, Melbourne, Australia. Association for Computational Linguistics.
- M. Shoeybi, M. Patwary, R. Puri, P. LeGresley, J. Casper, and Bryan Catanzaro. 2019. Megatronlm: Training multi-billion parameter language models using model parallelism. ArXiv, abs/1909.08053.
- Karen Simonyan and Andrew Zisserman. 2015. Very deep convolutional networks for large-scale image recognition. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings.
- Weijie Su, Xizhou Zhu, Yue Cao, Bin Li, Lewei Lu, Furu Wei, and Jifeng Dai. 2019. Vl-bert: Pretraining of generic visual-linguistic representations. arXiv preprint arXiv:1908.08530.
- Kai Sun, Dian Yu, Dong Yu, and Claire Cardie. 2019. Improving machine reading comprehension with general reading strategies. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 2633–2643, Minneapolis, Minnesota. Association for Computational Linguistics.
- Hao Tan and Mohit Bansal. 2019. LXMERT: Learning cross-modality encoder representations from transformers. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 5100–5111, Hong Kong, China. Association for Computational Linguistics.
- Damien Teney, L. Liu, and A. V. D. Hengel. 2017. Graph-structured representations for visual question answering. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3233– 3241.
- Jason Weston, Sumit Chopra, and Antoine Bordes. 2014. Memory networks. CoRR, abs/1410.3916.
- Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. 2015. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), ICCV ’15, page 19–27, USA. IEEE Computer Society.