AI helps you reading Science

AI generates interpretation videos

AI extracts and analyses the key points of the paper to generate videos automatically


pub
Go Generating

AI Traceability

AI parses the academic lineage of this thesis


Master Reading Tree
Generate MRT

AI Insight

AI extracts a summary of this paper


Weibo:
We present a model that is a middle ground between these two approaches: a compositional model reminiscent of neural module networks that can perform chained logical reasoning

Multi Step Inference for Reasoning Over Paragraphs

EMNLP 2020, pp.3040-3050, (2020)

Cited by: 2|Views205
Full Text
Bibtex
Weibo

Abstract

Complex reasoning over text requires understanding and chaining together free-form predicates and logical connectives. Prior work has largely tried to do this either symbolically or with black-box transformers. We present a middle ground between these two extremes: a compositional model reminiscent of neural module networks that can perfo...More

Code:

Data:

0
Introduction
  • Performing chained inference over natural language text is a long-standing goal in artificial intelligence (Grosz et al, 1986; Reddy, 2003)
  • This kind of inference requires understanding how natural language statements fit together in a way that permits drawing conclusions.
  • Scientists think that the earliest flowers attracted insects and other animals, which spread pollen from flower to flower
  • This greatly increased the efficiency of fertilization over wind-spread pollen, which might or might not land on another flower.
  • Category A flowers spread pollen via wind, and category B flowers spread pollen via animals
Highlights
  • Performing chained inference over natural language text is a long-standing goal in artificial intelligence (Grosz et al, 1986; Reddy, 2003)
  • We present a model that is a middle ground between these two approaches: a compositional model reminiscent of neural module networks that can perform chained logical reasoning
  • We found that the distribution over question types in train set is similar to development set, where most of the question are noun phrase (NP) type (85%) and the second frequent questions are adjective phrase (ADJP) type
  • We propose a multi-step reading comprehension model that performs chained inference over natural language text
  • We have demonstrated that our model substantially outperforms prior work on ROPES, a challenging new reading comprehension dataset
  • Self-assembling over text, leading to a single model that could perform the necessary reasoning for multiple different datasets
Methods
  • The authors compare the performance of the systems the authors presented on ROPES, to investi- 4.1 Settings

    Data The authors use the 10,924 questions as the training set, and 1,688 questions as dev set and 1,710 questions as test set, where each question has only one answer, which is a span from either the situation or the question.
  • The authors compare the performance of the systems the authors presented on ROPES, to investi- 4.1 Settings.
  • The hidden sizes of all layers are set to 1024, and the number of heads on multi-step attentions is 8.
  • All systems are trained in 1e-5 learning rate with 0.1 weight decay.
  • The authors use the SGD training method with the batch size 8 and the Adam optimizer (Kingma and Ba, 2015)
Results
  • While performance is better on the official test set, the gap is not nearly so large
  • To understand whether this was due to overfitting to the dev set or to the distributional shift mentioned in Section 3.2, Table 4 shows the results on dev-test, the split that treats the official dev set as a held-out test set.
  • The authors still see large gains of 7.2% EM from the model, suggesting that it is a distributional shift and not overfitting that is the cause of the difference in performance between the original dev and test sets.
  • Handling the distributional shift in the ROPES test set is an interesting challenge for future work
Conclusion
  • The performance of the multi-step reranker without Q SELECT, B SELECT or S CHAIN drops (-5.9% EM) more than that of the multi-step reranker without B CHAIN (-3.7% EM).
  • The multi-step system drops average 3.7% EM while the multi-step reranker drops average 5.4% EM, showing that the multi-step reranker depends more on the modules.The authors propose a multi-step reading comprehension model that performs chained inference over natural language text.
  • In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages
Summary
  • Introduction:

    Performing chained inference over natural language text is a long-standing goal in artificial intelligence (Grosz et al, 1986; Reddy, 2003)
  • This kind of inference requires understanding how natural language statements fit together in a way that permits drawing conclusions.
  • Scientists think that the earliest flowers attracted insects and other animals, which spread pollen from flower to flower
  • This greatly increased the efficiency of fertilization over wind-spread pollen, which might or might not land on another flower.
  • Category A flowers spread pollen via wind, and category B flowers spread pollen via animals
  • Methods:

    The authors compare the performance of the systems the authors presented on ROPES, to investi- 4.1 Settings

    Data The authors use the 10,924 questions as the training set, and 1,688 questions as dev set and 1,710 questions as test set, where each question has only one answer, which is a span from either the situation or the question.
  • The authors compare the performance of the systems the authors presented on ROPES, to investi- 4.1 Settings.
  • The hidden sizes of all layers are set to 1024, and the number of heads on multi-step attentions is 8.
  • All systems are trained in 1e-5 learning rate with 0.1 weight decay.
  • The authors use the SGD training method with the batch size 8 and the Adam optimizer (Kingma and Ba, 2015)
  • Results:

    While performance is better on the official test set, the gap is not nearly so large
  • To understand whether this was due to overfitting to the dev set or to the distributional shift mentioned in Section 3.2, Table 4 shows the results on dev-test, the split that treats the official dev set as a held-out test set.
  • The authors still see large gains of 7.2% EM from the model, suggesting that it is a distributional shift and not overfitting that is the cause of the difference in performance between the original dev and test sets.
  • Handling the distributional shift in the ROPES test set is an interesting challenge for future work
  • Conclusion:

    The performance of the multi-step reranker without Q SELECT, B SELECT or S CHAIN drops (-5.9% EM) more than that of the multi-step reranker without B CHAIN (-3.7% EM).
  • The multi-step system drops average 3.7% EM while the multi-step reranker drops average 5.4% EM, showing that the multi-step reranker depends more on the modules.The authors propose a multi-step reading comprehension model that performs chained inference over natural language text.
  • In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages
Tables
  • Table1: The examples in ROPES, where the bold red spans are answers
  • Table2: The percentage (%) of question types in ROPES
  • Table3: The ROPES dataset types, where less than half of the questions are NP type, and there are more questions with VP, ADJP, ADVP and other types
  • Table4: The exact match scores by three systems
  • Table5: The ablation results on development. Q SELECT denotes the question SELECT module; B CHAIN denotes the CHAIN module applied on the background and the question; B SELECT denotes the background SELECT module; S CHAIN denotes the CHAIN module applied on the situation and the previous chained reasoning
  • Table6: The exact match accuracy of most four fre-
  • Table7: The average accuracy on training data for the quent question types multi-step reranker
  • Table8: The oracle scores for top k candidates
  • Table9: The examples of the answers to the questions by the baseline system, the multi-step system and the multi-step reranker
Download tables as Excel
Related work
  • Neural Module Network The neural module network (NMN) was originally proposed for visual question answering tasks (Andreas et al, 2016b,a), and recently has been used on several reading comprehension tasks (Jiang et al, 2019; Jiang and Bansal, 2019), where they specialize the module functions such as FIND and COMPARE to retrieve the relevant entities with or without supervised signals. Instead, we generalize the modules with the attentions over the text and make these basic modules freely combinable.

    Multi-Hop Reasoning There are several datasets constructed for multi-hop reasoning e.g. HOTPOTQA (Yang et al, 2018; Jiang et al, 2019; Jiang and Bansal, 2019; Min et al, 2019) and QANGAROO (Welbl et al, 2018; Chen et al, 2019b; Zhuang and Wang, 2019; Tu et al, 2019), which aims to get the answer across the documents. The term “multi-hop” reasoning on these datasets is similar to relative information retrieval, where one entity is bridged to another entity with one hop. Differently, the multi-step reasoning on ROPES aims to do reasoning over the effects of a passage (background and situation passage) and then give the answer to the question in the specific situation, without retrieval on the background passage.

    Models beyond Pre-trained Transformer As the emergence of fully pre-trained transformer (Peters et al, 2018; Devlin et al, 2019; Liu et al, 2019; Radford et al.; Dai et al, 2019; Yang et al, 2019), most of NLP benchmarks got new state-of-the-art results by the models built beyond the pre-trained transformer on specific tasks (e.g. syntactic parsing, semantic parsing and GLUE) (Wang et al, 2018; Kitaev and Klein, 2018; Zhang et al, 2019; Tsai et al, 2019). Our work is in the same line to adopt the advantages of pre-trained transformer, which has already collected contextualized word representation from a large amount of data.
Reference
  • Yichen Jiang, Nitish Joshi, Yen-Chun Chen, and Mo-Jacob Andreas, Marcus Rohrbach, Trevor hit Bansal. 2019.
    Google ScholarFindings
  • Explore, propose, and assemble: Darrell, and Dan Klein. 2016a.
    Google ScholarFindings
  • Learning to compose neural networks for question answerinpgre. hension. In Proceedings of the 57th Annual Meet-In Proceedings of the 2016 Conference of the North ing of the Association for Computational Linguistics, American Chapter of the Association for Computapages 2714–2725, Florence, Italy.
    Google ScholarLocate open access versionFindings
  • ter Clark, Oren Etzioni, and Dan Roth. 2016. Ques-Jacob Andreas, Marcus Rohrbach, Trevor Darrell, and Dan Klein. 2016b. Neural module networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 39–48.
    Google ScholarLocate open access versionFindings
  • tion answering via integer programming over semistructured knowledge. In Proceedings of the TwentyFifth International Joint Conference on Artificial Intelligence, pages 1145–1152. AAAI Press.
    Google ScholarLocate open access versionFindings
  • Singh, and Matt Gardner. 2019a.
    Google ScholarFindings
  • Evaluating question answering evaluation. In Proceedings of the 2nd Workshop on Machine
    Google ScholarLocate open access versionFindings
  • Reading for Question Answering, pages 119–124, Hong Kong, China. Association for Computational
    Google ScholarFindings
  • Tushar Khot, Niranjan Balasubramanian, Eric Gribkoff, Ashish Sabharwal, Peter Clark, and Oren Etzioni. 2015. Exploring markov logic networks for question answering. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 685–694.
    Google ScholarLocate open access versionFindings
  • Diederik P. Kingma and Jimmy Ba. 2015.
    Google ScholarFindings
  • Jifan Chen, Shih-ting Lin, and Greg Durrett. 2019b.
    Google ScholarFindings
  • sentations, ICLR 2015, San Diego, CA, USA, May
    Google ScholarFindings
  • 7-9, 2015, Conference Track Proceedings.
    Google ScholarLocate open access versionFindings
  • bonell, Quoc Le, and Ruslan Salakhutdinov. 2019. Nikita Kitaev and Dan Klein. 2018. Constituency
    Google ScholarFindings
  • Transformer-XL. Attentive language models beyond a fixedp-alersnigntgh wcointhteaxt.self-attentive encoder. In Proceed-In Proceedings of the 57th Annual Meeting of the ings of the 56th Annual Meeting of the Association
    Google ScholarLocate open access versionFindings
  • Association for Computational Linguistics, pages for Computational Linguistics (Volume 1: Long Pa-
    Google ScholarFindings
  • 2978–2988, Florence, Italy. Association for Compupers), Melbourne, Australia. Association for Comtational Linguistics.
    Google ScholarFindings
  • Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186.
    Google ScholarLocate open access versionFindings
  • Mor Geva, Yoav Goldberg, and Jonathan Berant. 20Are we modeling the task or the annotator? an investigation of annotator bias in natural language understanding datasets.
    Google ScholarFindings
  • Barbara J Grosz, Karen Sparck-Jones, and Bonnie Lynn Webber. 1986. Readings in natural language processing.
    Google ScholarLocate open access versionFindings
  • Ronghang Hu, Jacob Andreas, Marcus Rohrbach, Trevor Darrell, and Kate Saenko. 2017. Learning to reason: End-to-end module networks for visual question answering. 2017 IEEE International Conference on Computer Vision (ICCV), pages 804–813.
    Google ScholarLocate open access versionFindings
  • Kevin Lin, Oyvind Tafjord, Peter Clark, and Matt Gardner. 2019. Reasoning over paragraph effects in situations. In Proceedings of the 2nd Workshop on Machine Reading for Question Answering, pages 58– 62.
    Google ScholarLocate open access versionFindings
  • Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.
    Findings
  • Sewon Min, Victor Zhong, Luke Zettlemoyer, and Hannaneh Hajishirzi. 2019. Multi-hop reading comprehension through question decomposition and rescoring. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 6097–6109, Florence, Italy. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Zettlemoyer. 2018. Deep contextualized word representations. In Proceedings of NAACL-HLT, pages 2227–2237.
    Google ScholarLocate open access versionFindings
  • Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. Squad: 100,000+ questions for machine comprehension of text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 2383–2392. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 80–94, Florence, Italy. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Yimeng Zhuang and Huadong Wang. 2019. Token-level dynamic self-attention network for multi-passage reading In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2252–2262, Florence, Italy. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Raj Reddy. 2003. Three open problems in ai. Journal of the ACM (JACM), 50(1):83–86.
    Google ScholarLocate open access versionFindings
  • Henry Tsai, Jason Riesa, Melvin Johnson, Naveen Arivazhagan, Xin Li, and Amelia Archer. 2019. Small and practical bert models for sequence labeling. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3623– 3627.
    Google ScholarLocate open access versionFindings
  • Ming Tu, Guangtao Wang, Jing Huang, Yun Tang, Xiaodong He, and Bowen Zhou. 2019. Multi-hop reading comprehension across multiple documents by reasoning over heterogeneous graphs. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2704–2713, Florence, Italy. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. 2018. Glue: A multi-task benchmark and analysis platform for natural language understanding. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 353–355.
    Google ScholarLocate open access versionFindings
  • Johannes Welbl, Pontus Stenetorp, and Sebastian Riedel. 2018. Constructing datasets for multi-hop reading comprehension across documents. Transactions of the Association for Computational Linguistics, 6:287–302.
    Google ScholarLocate open access versionFindings
  • Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, and Quoc V Le. 2019. Xlnet: Generalized autoregressive pretraining for language understanding. arXiv preprint arXiv:1906.08237.
    Findings
  • Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D Manning. 2018. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2369–2380.
    Google ScholarLocate open access versionFindings
  • Sheng Zhang, Xutai Ma, Kevin Duh, and Benjamin Van Durme. 2019. AMR parsing as sequence-to-graph transduction. In
    Google ScholarFindings
Your rating :
0

 

Tags
Comments
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn
小科