Benefits of Intermediate Annotations in Reading Comprehension

Dheeru Dua
Dheeru Dua

ACL, pp. 5627-5634, 2020.

Cited by: 0|Bibtex|Views112
EI
Other Links: dblp.uni-trier.de|academic.microsoft.com
Weibo:
The Vikings completed the remarkable comeback with Favre finding wide receiver Sidney Rice on a 6-yard TD pass on 4th-and-goal with 15 seconds left in regulation

Abstract:

Complex, compositional reading comprehension datasets require performing latent sequential decisions that are learned via supervision from the final answer. A large combinatorial space of possible decision paths that result in the same answer, compounded by the lack of intermediate supervision to help choose the right path, makes the lear...More
0
Introduction
  • Many reading comprehension datasets requiring complex and compositional reasoning over text have been introduced, including HotpotQA (Yang et al, 2018), DROP (Dua et al, 2019), Quoref (Dasigi et al, 2019), and ROPES (Lin et al, 2019).
  • Models trained on these datasets (Hu et al, 2019; Andor et al, 2019) only have the final answer as supervision, leaving the model guessing at the correct latent reasoning.
  • The Bears increased their lead over the Vikings with Cutler’s 3-yard TD pass to tight end Desmond Clark.
  • The Bears responded with Cutler firing a 20-yard TD pass to wide receiver Earl Bennett.
  • The Vikings completed the remarkable comeback with Favre finding wide receiver Sidney Rice on a 6-yard TD pass on 4th-and-goal with 15 seconds left in regulation.
  • Did the Vikings fall to 11-4, they surrendered homefield advantage to the Saints
Highlights
  • Many reading comprehension datasets requiring complex and compositional reasoning over text have been introduced, including HotpotQA (Yang et al, 2018), DROP (Dua et al, 2019), Quoref (Dasigi et al, 2019), and ROPES (Lin et al, 2019)
  • The Vikings completed the remarkable comeback with Favre finding wide receiver Sidney Rice on a 6-yard TD pass on 4th-and-goal with 15 seconds left in regulation
  • We find that the sweet-spot percentage of the budget and training-set that should be allocated to intermediate annotations is 2% and ∼10% respectively
  • We investigate the effects of spending a small additional budget, either by adding more QA pairs or by collecting intermediate annotations, on this bias
  • We show that under low budget constraints, collecting these annotations for up to 10% of the training data (2-5% of the total budget) can improve the performance by 4-5% in F1
  • Our work shows that collecting intermediate annotations for a fraction of dataset is cost-effective and helps alleviate dataset collection biases to a degree
Results
  • The authors train multiple models for the DROP and Quoref datasets, and evaluate the benefits of intermediate annotations as compared to traditional QA pairs.
  • The authors study the impact of annotations on DROP on two models at the top of the leaderboard: NABERT1 and MTMSN (Hu et al, 2019)
  • Both the models employ a similar arithmetic block introduced in the baseline model (Dua et al, 2019) on top of contextual representations from BERT (Devlin et al, 2019).
  • For Quoref, the authors use the baseline XLNet (Yang et al, 2019) model released with the dataset
  • The authors supervise these models with the annotations in a simple way, by jointly predicting intermediate annotation and the final answer.
  • The first is a cross-entropy loss between the gold annotations (g) and predicted annotations, which are obtained by passing the final BERT representations through a linear layer to get a score per token p, normalizing each token’s score of being selected as an annotation
Summary
  • Introduction:

    Many reading comprehension datasets requiring complex and compositional reasoning over text have been introduced, including HotpotQA (Yang et al, 2018), DROP (Dua et al, 2019), Quoref (Dasigi et al, 2019), and ROPES (Lin et al, 2019).
  • Models trained on these datasets (Hu et al, 2019; Andor et al, 2019) only have the final answer as supervision, leaving the model guessing at the correct latent reasoning.
  • The Bears increased their lead over the Vikings with Cutler’s 3-yard TD pass to tight end Desmond Clark.
  • The Bears responded with Cutler firing a 20-yard TD pass to wide receiver Earl Bennett.
  • The Vikings completed the remarkable comeback with Favre finding wide receiver Sidney Rice on a 6-yard TD pass on 4th-and-goal with 15 seconds left in regulation.
  • Did the Vikings fall to 11-4, they surrendered homefield advantage to the Saints
  • Results:

    The authors train multiple models for the DROP and Quoref datasets, and evaluate the benefits of intermediate annotations as compared to traditional QA pairs.
  • The authors study the impact of annotations on DROP on two models at the top of the leaderboard: NABERT1 and MTMSN (Hu et al, 2019)
  • Both the models employ a similar arithmetic block introduced in the baseline model (Dua et al, 2019) on top of contextual representations from BERT (Devlin et al, 2019).
  • For Quoref, the authors use the baseline XLNet (Yang et al, 2019) model released with the dataset
  • The authors supervise these models with the annotations in a simple way, by jointly predicting intermediate annotation and the final answer.
  • The first is a cross-entropy loss between the gold annotations (g) and predicted annotations, which are obtained by passing the final BERT representations through a linear layer to get a score per token p, normalizing each token’s score of being selected as an annotation
Tables
  • Table1: F1 performance and confusion loss (lower is better) of models in three settings: baseline with 10k(DROP) and 5k(Quoref) QA pairs, additional QA pairs worth 250 and 100 for DROP and Quoref respectively, and additional annotations worth 250 and 100 for DROP and Quoref respectively. To put confusion loss in perspective, the best confusion loss, i.e. perfect diffusion, is 90.1 for DROP and 87.0 for Quoref
Download tables as Excel
Related work
  • Similar to our work, Zaidan et al (2007) studied the impact of providing explicit supervision via rationales, rather than generating them, for varying fractions of training set in text classification. However, we study the benefits of such supervision for complex compositional reading comprehension datasets. In the field of computer vision, Donahue and Grauman (2011) collected similar annotations, for visual recognition, where crowd-workers highlighted relevant regions in images.

    Within reading comprehension, various works like HotpotQA (Yang et al, 2018) and CoQA (Reddy et al, 2019) have collected similar reasoning steps for entire dataset. Our work shows that collecting intermediate annotations for a fraction of dataset is cost-effective and helps alleviate dataset collection biases to a degree. Another line of work (Ning et al, 2019) explores the cost vs. benefit of collecting full vs. partial annotations for various structured predictions tasks. However, they do not focus on intermediate reasoning required to learn the task.
Funding
  • This work was supported in part by Allen Institute of Artificial Intelligence, in part by Amazon, and in part by the National Science Foundation (NSF) grant #CNS-1730158
Reference
  • Daniel Andor, Luheng He, Kenton Lee, and Emily Pitler. 2019. Giving bert a calculator: Finding operations and arguments with reading comprehension. Annual Meeting of the Association for Computational Linguistics (ACL).
    Google ScholarFindings
  • Ming-Wei Chang, Lev Ratinov, and Dan Roth. 2007. Guiding semi-supervision with constraint-driven learning. In Annual Meeting of the Association for Computational Linguistics (ACL), pages 280–287.
    Google ScholarLocate open access versionFindings
  • Pradeep Dasigi, Nelson F Liu, Ana Marasovic, Noah A Smith, and Matt Gardner. 2019. Quoref: A reading comprehension dataset with questions requiring coreferential reasoning. Annual Meeting of the Association for Computational Linguistics (ACL).
    Google ScholarFindings
  • Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. NAACL.
    Google ScholarLocate open access versionFindings
  • Jeff Donahue and Kristen Grauman. 2011. Annotator rationales for visual recognition. In 2011 International Conference on Computer Vision, pages 1395– 1402. IEEE.
    Google ScholarLocate open access versionFindings
  • Dheeru Dua, Yizhong Wang, Pradeep Dasigi, Gabriel Stanovsky, Sameer Singh, and Matt Gardner. 2019. Drop: A reading comprehension benchmark requiring discrete reasoning over paragraphs. In NAACL.
    Google ScholarFindings
  • Kuzman Ganchev, Joao Graca, Jennifer Gillenwater, and Ben Taskar. 2010. Posterior regularization for structured latent variable models. Journal of Machine Learning Research (JMLR).
    Google ScholarLocate open access versionFindings
  • Mor Geva, Yoav Goldberg, and Jonathan Berant. 2019. Are we modeling the task or the annotator? an investigation of annotator bias in natural language understanding datasets. Annual Meeting of the Association for Computational Linguistics (ACL).
    Google ScholarFindings
  • Suchin Gururangan, Swabha Swayamdipta, Omer Levy, Roy Schwartz, Samuel R. Bowman, and Noah A. Smith. 2018. Annotation artifacts in natural language inference data. In NAACL.
    Google ScholarLocate open access versionFindings
  • Minghao Hu, Yuxing Peng, Zhen Huang, and Dongsheng Li. 2019. A multi-type multi-span network for reading comprehension that requires discrete reasoning. Annual Meeting of the Association for Computational Linguistics (ACL).
    Google ScholarFindings
  • Sokol Koco and Cecile Capponi. 2013. On multi-class classification through the minimization of the confusion matrix norm. In Asian Conference on Machine Learning, pages 277–292.
    Google ScholarLocate open access versionFindings
  • Tao Lei, Regina Barzilay, and Tommi Jaakkola. 2016. Rationalizing neural predictions. Annual Meeting of the Association for Computational Linguistics (ACL).
    Google ScholarFindings
  • Kevin Lin, Oyvind Tafjord, Peter Clark, and Matt Gardner. 2019. Reasoning over paragraph effects in situations. MRQA Workshop.
    Google ScholarFindings
  • Pierre Machart and Liva Ralaivola. 2012. Confusion matrix stability bounds for multiclass classification. arXiv preprint arXiv:1202.6221.
    Findings
  • Varun Manjunatha, Nirat Saini, and Larry S Davis. 2019. Explicit bias discovery in visual question answering models. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 9562–9571.
    Google ScholarLocate open access versionFindings
  • Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, and Quoc V Le. 2019. Xlnet: Generalized autoregressive pretraining for language understanding. NeurIPS.
    Google ScholarLocate open access versionFindings
  • Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W Cohen, Ruslan Salakhutdinov, and Christopher D Manning. 2018. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. Annual Meeting of the Association for Computational Linguistics (ACL).
    Google ScholarFindings
  • Omar Zaidan, Jason Eisner, and Christine Piatko. 2007. Using annotator rationales to improve machine learning for text categorization. In Human language technologies 2007: The conference of the North American chapter of the association for computational linguistics; proceedings of the main conference, pages 260–267.
    Google ScholarLocate open access versionFindings
  • Gideon S. Mann and Andrew McCallum. 2008. Generalized expectation criteria for semi-supervised learning of conditional random fields. In Annual Meeting of the Association for Computational Linguistics (ACL), pages 870–878.
    Google ScholarLocate open access versionFindings
  • Sewon Min, Eric Wallace, Sameer Singh, Matt Gardner, Hannaneh Hajishirzi, and Luke Zettlemoyer. 2019. Compositional questions do not necessitate multi-hop reasoning. Annual Meeting of the Association for Computational Linguistics (ACL).
    Google ScholarFindings
  • Qiang Ning, Hangfeng He, Chuchu Fan, and Dan Roth. 2019. Partial or complete, that’s the question. Annual Meeting of the Association for Computational Linguistics (ACL).
    Google ScholarFindings
  • Siva Reddy, Danqi Chen, and Christopher D Manning. 2019. Coqa: A conversational question answering challenge. Transactions of the Association for Computational Linguistics, 7:249–266.
    Google ScholarLocate open access versionFindings
  • Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. 2016. Why should i trust you?: Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, pages 1135–1144. ACM.
    Google ScholarLocate open access versionFindings
  • Tim Rocktaschel, Sameer Singh, and Sebastian Riedel. 2015. Injecting logical background knowledge into embeddings for relation extraction. In Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL).
    Google ScholarLocate open access versionFindings
  • Andrew Slavin Ross, Michael C Hughes, and Finale Doshi-Velez. 2017. Right for the right reasons: Training differentiable models by constraining their explanations. IJCAI.
    Google ScholarLocate open access versionFindings
Full Text
Your rating :
0

 

Tags
Comments