SciTaiL: A Textual Entailment Dataset from Science Question Answering

AAAI, 2018.

Cited by: 93|Bibtex|Views76
EI
Other Links: dblp.uni-trier.de|academic.microsoft.com
Weibo:
We present a new natural dataset for textual entailment, SCITAIL, derived directly from an end task, namely that of Science question answering

Abstract:

We present a new dataset and model for textual entailment, derived from treating multiple-choice question-answering as an entailment problem. SCITAIL is the first entailment set that is created solely from natural sentences that already exist independently “in the wild” rather than sentences authored specifically for the entailment task. ...More

Code:

Data:

0
Introduction
  • Recognizing textual entailment (RTE) involves assessing whether a given textual premise entails or implies a given hypothesis.
  • To facilitate the development of strong RTE systems, increasingly larger datasets have been proposed, ranging in size from 100s to over 500,000 annotated premise-hypothesis pairs.
  • Datasets such as RTE-n (Dagan, Glickman, and Magnini 2005), SICK (Marelli et al 2014), and SNLI (Bowman et al 2015) have played an important role in advancing the field.
Highlights
  • Recognizing textual entailment (RTE) involves assessing whether a given textual premise entails or implies a given hypothesis
  • We present the largest entailment dataset that is directly derived from an end task and consists of naturally occurring text as both premise and hypothesis
  • We find that current Recognizing textual entailment systems, including neural entailment models, have mediocre performance on this dataset, whether pre-trained on their own datasets or on SCITAIL
  • We describe a general methodology for annotating such an entailment dataset starting from a multiple-choice question set, and discuss specifics of the SCITAIL dataset
  • We present a new natural dataset for textual entailment, SCITAIL, derived directly from an end task, namely that of Science question answering
  • On the highly studied SNLI dataset, the 75% accuracy of even the basic entailment models is much higher than the 33.3% majority baseline
  • We propose a new neural entailment architecture that can use any graph-based syntactic/semantic structure from the hypothesis
Methods
  • The authors compare the system against two state-of-the-art neural entailment systems along with a simple overlap-based model trained on the SCITAIL dataset.
  • The authors compute the proportion of unigrams, 1skip bigrams, and 1-skip trigrams (Guthrie et al 2006) in the hypothesis that are present in the premise as three features13.
  • The authors feed these features into a two-layer perceptron.
  • The authors feed these features into a two-layer perceptron. (20 parameters)
Results
  • Table 6 shows the accuracies of the baseline systems on this dataset.
  • State-of-the-art neural methods achieve 10-12% above the majority class baseline.
  • The ngrambased model, is able to achieve similar results on the test set.14.
  • This shows the sequence-based neural models barely.
  • The important of considering structure is further illustrated by the drop in test accuracy when the authors ignore the edge probabilities in the model
Conclusion
  • The authors present a new natural dataset for textual entailment, SCITAIL, derived directly from an end task, namely that of Science question answering
  • The authors show that this is a challenging dataset for current state-of-the-art models.
  • The authors propose a new neural entailment architecture that can use any graph-based syntactic/semantic structure from the hypothesis.
  • This additional use of structure results in 5% improvement on this dataset.
  • Exploring other possible syntactic representations for the hypothesis and comparing against newly developed approaches for using structure (Chen et al 2017) remain interesting directions for future work, as does the translation of improvements on this entailment sub-task to more effective question-answering systems for the Science domain
Summary
  • Introduction:

    Recognizing textual entailment (RTE) involves assessing whether a given textual premise entails or implies a given hypothesis.
  • To facilitate the development of strong RTE systems, increasingly larger datasets have been proposed, ranging in size from 100s to over 500,000 annotated premise-hypothesis pairs.
  • Datasets such as RTE-n (Dagan, Glickman, and Magnini 2005), SICK (Marelli et al 2014), and SNLI (Bowman et al 2015) have played an important role in advancing the field.
  • Methods:

    The authors compare the system against two state-of-the-art neural entailment systems along with a simple overlap-based model trained on the SCITAIL dataset.
  • The authors compute the proportion of unigrams, 1skip bigrams, and 1-skip trigrams (Guthrie et al 2006) in the hypothesis that are present in the premise as three features13.
  • The authors feed these features into a two-layer perceptron.
  • The authors feed these features into a two-layer perceptron. (20 parameters)
  • Results:

    Table 6 shows the accuracies of the baseline systems on this dataset.
  • State-of-the-art neural methods achieve 10-12% above the majority class baseline.
  • The ngrambased model, is able to achieve similar results on the test set.14.
  • This shows the sequence-based neural models barely.
  • The important of considering structure is further illustrated by the drop in test accuracy when the authors ignore the edge probabilities in the model
  • Conclusion:

    The authors present a new natural dataset for textual entailment, SCITAIL, derived directly from an end task, namely that of Science question answering
  • The authors show that this is a challenging dataset for current state-of-the-art models.
  • The authors propose a new neural entailment architecture that can use any graph-based syntactic/semantic structure from the hypothesis.
  • This additional use of structure results in 5% improvement on this dataset.
  • Exploring other possible syntactic representations for the hypothesis and comparing against newly developed approaches for using structure (Chen et al 2017) remain interesting directions for future work, as does the translation of improvements on this entailment sub-task to more effective question-answering systems for the Science domain
Tables
  • Table1: Randomly selected examples from the entailment dataset. The first sentence supports the right answer but also provides lot more information that needs to be ignored. Second example has some word overlap but can not be used to answer the question. In the third example, we only have partial support for the question, i.e., “Plasma comprises the sun and other stars.”
  • Table2: Average number of stop-word filtered tokens in the premise and hypothesis in the training set per gold label
  • Table3: Average proportion of the hypothesis tokens that overlap with the premise and average difference between the number of tokens in premise and the hypothesis in the training set per gold label
  • Table4: Distribution of entailment examples and underlying questions in the SCITAIL train/dev/test split
  • Table5: Percentage of sentences with ‘S’-rooted parse trees, percentage of sentences with at least one Open IE extraction, and number of distinct words in the SCITAIL dataset
  • Table6: Validation and test set accuracy on the entailment dataset. Our proposed models outperforms the state-of-theart by exploiting the structure of the hypothesis
Download tables as Excel
Related work
  • We discuss prior work on textual entailment and question answering that is most closely related to SCITAIL.

    Textual Entailment

    The PASCAL RTE challenges (Dagan, Glickman, and Magnini 2005) have played an important role in developing our understanding of the linguistic entailment problem. Due to the small size of these datasets, most earlier approaches relied on hand-designed features and alignment systems (Androutsopoulos and Malakasiotis 2010). With the advent of large entailment datasets (Bowman et al 2015), novel neural network architectures have been developed for the entailment task. However, these datasets were designed in isolation from any end task and with synthesized sentences. As a result, while they help advance our understanding of entailment, they do not necessarily capture entailment queries that naturally arise in an end task.
Funding
  • As a step forward, we demonstrate that one can improve accuracy on SCITAIL by 5% using a new neural model that exploits linguistic structure
  • The state-of-the-art Decomposable Attention Model (Parikh et al 2016) achieves an accuracy of 72.3%, which is only 2% higher than a simple n-gram overlap model and 12% higher than the majority class prediction baseline of 60.3%
  • On the highly studied SNLI dataset, the 75% accuracy of even the basic entailment models is much higher than the 33.3% majority baseline
  • On the other hand, our structurebased approach is able to achieve about 5% gain over the best baseline system on this task
Study subjects and analysis
sentences from these datasets: 115564
We used Amazon Mechanical Turk5 to annotate our sentences. In total, we annotated 3,234 questions and 115,564 sentences from these datasets. About 43.3% of the questions did not have a single supporting sentence, indicating that

popular datasets: 4
We next compare this dataset with previous published datasets to highlight some of the challenges relative to these dataset. Dataset Size We compare SCITAIL against four popular datasets, listed chronologically. 1

cases: 2
Even though the model is not able to find support for the phrase ‘from water droplets’, it is able to use the edge probability model on the LSTMembeddings to identify the ‘from’ relation between ‘water droplets’ and ‘are formed’. Comparison to Decomposable Attention Next we present two cases where the decomposable attention model incorrectly labels the example but our model is able to use structure to accurately label the example. Consider the following entails example: premise: Upwelling upward movement of deep (abyssal), cold water to the surface. hypothesis: Upwelling is the term for when deep ocean water rises to the surface

Reference
  • Androutsopoulos, I., and Malakasiotis, P. 2010. A survey of paraphrasing and textual entailment methods. J. Artif. Intell. Res. 38:135–187.
    Google ScholarLocate open access versionFindings
  • Bahdanau, D.; Cho, K.; and Bengio, Y. 2015. Neural machine translation by jointly learning to align and translate. In ICLR.
    Google ScholarFindings
  • Banko, M.; Cafarella, M. J.; Soderland, S.; Broadhead, M.; and Etzioni, O. 2007. Open information extraction from the web. In IJCAI.
    Google ScholarLocate open access versionFindings
  • Bentivogli, L.; Clark, P.; Dagan, I.; and Giampiccolo, D. 2010. The Sixth PASCAL Recognizing Textual Entailment Challenge. In TAC.
    Google ScholarFindings
  • Bowman, S. R.; Angeli, G.; Potts, C.; and Manning, C. D. 201A large annotated corpus for learning natural language inference. In EMNLP.
    Google ScholarFindings
  • Chen, Q.; Zhu, X.; Ling, Z.; Wei, S.; Jiang, H.; and Inkpen, D. 2017. Enhanced LSTM for natural language inference. In ACL.
    Google ScholarLocate open access versionFindings
  • Clark, P.; Etzioni, O.; Khot, T.; Sabharwal, A.; Tafjord, O.; Turney, P.; and Khashabi, D. 2016. Combining retrieval, statistics, and inference to answer elementary science questions. In AAAI.
    Google ScholarFindings
  • Dagan, I.; Roth, D.; Sammons, M.; and Zanzotto, F. M. 2013. Recognizing textual entailment: Models and applications. Synthesis Lectures on Human Language Technologies 6(4):1–220.
    Google ScholarLocate open access versionFindings
  • Dagan, I.; Glickman, O.; and Magnini, B. 2005. The PASCAL Recognising Textual Entailment Challenge. In MLCW.
    Google ScholarFindings
  • de Marneffe, M.-C., and Manning, C. D. 2008. The Stanford typed dependencies representation.
    Google ScholarFindings
  • Gardner, M.; Grus, J.; Neumann, M.; Tafjord, O.; Dasigi, P.; Liu, N.; Peters, M.; Schmitz, M.; and Zettlemoyer, L. 2017. AllenNLP: A deep semantic natural language processing platform. Technical report.
    Google ScholarFindings
  • Glorot, X.; Bordes, A.; and Bengio, Y. 2011. Deep sparse rectifier neural networks. In AISTATS.
    Google ScholarFindings
  • Guthrie, D.; Allison, B.; Liu, W.; Guthrie, L.; and Wilks, Y. 2006. A closer look at skip-gram modelling. In LREC.
    Google ScholarFindings
  • Hochreiter, S., and Schmidhuber, J. 1997. Long short-term memory. Neural computation 98:1735–80.
    Google ScholarLocate open access versionFindings
  • Ji, H., and Grishman, R. 2011. Knowledge base population: Successful approaches and challenges. In ACL.
    Google ScholarFindings
  • Joshi, M.; Choi, E.; Weld, D. S.; and Zettlemoyer, L. S. 2017. TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension. In ACL.
    Google ScholarFindings
  • Khashabi, D.; Khot, T.; Sabharwal, A.; Clark, P.; Etzioni, O.; and Roth, D. 2016. Question answering via integer programming over semi-structured knowledge. In IJCAI.
    Google ScholarFindings
  • Khashabi, D.; Khot, T.; Sabharwal, A.; and Roth, D. 2017. Learning what is essential in questions. In CoNLL, 80–89.
    Google ScholarLocate open access versionFindings
  • Khot, T.; Balasubramanian, N.; Gribkoff, E.; Sabharwal, A.; Clark, P.; and Etzioni, O. 2015. Exploring Markov logic networks for question answering. In EMNLP.
    Google ScholarFindings
  • Khot, T.; Sabharwal, A.; and Clark, P. 2017. Answering complex questions using open information extraction. In ACL.
    Google ScholarFindings
  • Kingma, D. P., and Ba, J. 2015. Adam: A method for stochastic optimization. In ICLR.
    Google ScholarFindings
  • Magnini, B.; Zanoli, R.; Dagan, I.; Eichler, K.; Neumann, G.; Noh, T.-G.; Pado, S.; Stern, A.; and Levy, O. 2014. The Excitement Open Platform for textual inferences. In ACL.
    Google ScholarFindings
  • Marelli, M.; Menini, S.; Baroni, M.; Bentivogli, L.; Bernardi, R.; and Zamparelli, R. 2014. A sick cure for the evaluation of compositional distributional semantic models. In LREC.
    Google ScholarFindings
  • Miwa, M., and Bansal, M. 2016. End-to-end relation extraction using LSTMs on sequences and tree structures. In ACL.
    Google ScholarFindings
  • Mou, L.; Men, R.; Li, G.; Xu, Y.; Zhang, L.; Yan, R.; and Jin, Z. 2016. Natural language inference by tree-based convolution and heuristic matching. In ACL.
    Google ScholarFindings
  • Parikh, A. P.; Tackstrom, O.; Das, D.; and Uszkoreit, J. 2016. A decomposable attention model for natural language inference. In EMNLP.
    Google ScholarFindings
  • Pennington, J.; Socher, R.; and Manning, C. D. 2014. Glove: Global vectors for word representation. In EMNLP.
    Google ScholarFindings
  • Rajpurkar, P.; Zhang, J.; Lopyrev, K.; and Liang, P. 2016. SQuAD: 100,000+ questions for machine comprehension of text. In EMNLP.
    Google ScholarFindings
  • Richardson, M.; Burges, C. J. C.; and Renshaw, E. 2013. MCTest: A challenge dataset for the open-domain machine comprehension of text. In EMNLP.
    Google ScholarFindings
  • Srivastava, N.; Hinton, G. E.; Krizhevsky, A.; Sutskever, I.; and Salakhutdinov, R. 2014. Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15:1929–1958.
    Google ScholarLocate open access versionFindings
  • Welbl, J.; Liu, N. F.; and Gardner, M. 2017. Crowdsourcing multiple choice science questions. In Workshop on Noisy User-generated Text.
    Google ScholarFindings
  • Zhao, K.; Huang, L.; and Ma, M. 2016. Textual entailment with structured attentions and composition. In COLING.
    Google ScholarFindings
Full Text
Your rating :
0

 

Tags
Comments