Controlled Crowdsourcing for High-Quality QA-SRL Annotation

Paul Roit
Paul Roit
Ayal Klein
Ayal Klein
Daniela Stepanov
Daniela Stepanov
Jonathan Mamou
Jonathan Mamou
Julian Michael
Julian Michael
Gabriel Stanovsky
Gabriel Stanovsky

ACL, pp. 7008-7013, 2020.

Cited by: 0|Bibtex|Views58
EI
Other Links: dblp.uni-trier.de|academic.microsoft.com
Weibo:
Trying to replicate the QA-SRL annotation for new texts, we found that the resulting annotations were lacking in quality, in coverage, making them insufficient for further research and evaluation

Abstract:

Question-answer driven Semantic Role Labeling (QA-SRL) was proposed as an attractive open and natural flavour of SRL, potentially attainable from laymen. Recently, a large-scale crowdsourced QA-SRL corpus and a trained parser were released. Trying to replicate the QA-SRL annotation for new texts, we found that the resulting annotations we...More
Introduction
  • Semantic Role Labeling (SRL) provides explicit annotation of predicate-argument relations.
  • SRL annotation of new texts requires substantial efforts involving expert annotation, and possibly lexicon extension, limiting scalability.
  • Aiming to address these limitations, QuestionAnswer driven Semantic Role Labeling (QA-SRL) (He et al, 2015) labels each predicate-argument relationship with a question-answer pair, where natural language questions represent semantic roles, and answers correspond to arguments.
  • The importance of implicit arguments has been recognized in the literature (Cheng and Erk, 2018; Do et al, 2017; Gerber and Chai, 2012), yet they are mostly overlooked by common SRL formalisms and tools
Highlights
  • Semantic Role Labeling (SRL) provides explicit annotation of predicate-argument relations
  • Semantic Role Labeling annotation of new texts requires substantial efforts involving expert annotation, and possibly lexicon extension, limiting scalability. Aiming to address these limitations, QuestionAnswer driven Semantic Role Labeling (QA-Semantic Role Labeling) (He et al, 2015) labels each predicate-argument relationship with a question-answer pair, where natural language questions represent semantic roles, and answers correspond to arguments
  • We show that our annotation protocol and dataset are of high quality and coverage, enabling subsequent QuestionAnswer driven Semantic Role Labeling research
  • Evaluation in QuestionAnswer driven Semantic Role Labeling involves, for each verb, aligning its predicted argument spans to a reference set of arguments, and evaluating question equivalence, i.e., whether predicted and gold questions for aligned spans correspond to the same semantic role
  • Since detecting question equivalence is still an open challenge, we propose both unlabeled and labeled evaluation metrics
  • As seen in Table 4, our gold set yields comparable precision with drastically higher recall, in line with our 25% higher yield
  • We suggest that our simple yet rigorous controlled crowdsourcing protocol would be effective for other challenging annotation tasks, which often prove to be a hurdle for research projects
Results
  • Evaluation Metrics

    Evaluation in QA-SRL involves, for each verb, aligning its predicted argument spans to a reference set of arguments, and evaluating question equivalence, i.e., whether predicted and gold questions for aligned spans correspond to the same semantic role.
  • Since detecting question equivalence is still an open challenge, the authors propose both unlabeled and labeled evaluation metrics.
  • Unlabeled Argument Detection (UA) Inspired by the method presented in (Fitzgerald et al, 2018), argument spans are matched using a token-based matching criterion of intersection over union (IOU) ≥ 0.5.
  • To credit each argument only once, the authors employ maximal bipartite matching7 between the two sets of arguments, drawing an edge for each pair that passes the above mentioned criterion.
  • The resulting maximal matching determines the truepositive set, while remaining non-aligned arguments become false positives or false negatives
Conclusion
  • Applying the proposed controlled crowdsourcing protocol to QA-SRL successfully attains truly scalable high-quality annotation by laymen, facilitating future research of this paradigm.
  • Software and protocol, enabling easy future dataset production and evaluation for QA-SRL, as well as possible extensions of the QA-based semantic annotation paradigm.
  • The authors suggest that the simple yet rigorous controlled crowdsourcing protocol would be effective for other challenging annotation tasks, which often prove to be a hurdle for research projects
Summary
  • Introduction:

    Semantic Role Labeling (SRL) provides explicit annotation of predicate-argument relations.
  • SRL annotation of new texts requires substantial efforts involving expert annotation, and possibly lexicon extension, limiting scalability.
  • Aiming to address these limitations, QuestionAnswer driven Semantic Role Labeling (QA-SRL) (He et al, 2015) labels each predicate-argument relationship with a question-answer pair, where natural language questions represent semantic roles, and answers correspond to arguments.
  • The importance of implicit arguments has been recognized in the literature (Cheng and Erk, 2018; Do et al, 2017; Gerber and Chai, 2012), yet they are mostly overlooked by common SRL formalisms and tools
  • Results:

    Evaluation Metrics

    Evaluation in QA-SRL involves, for each verb, aligning its predicted argument spans to a reference set of arguments, and evaluating question equivalence, i.e., whether predicted and gold questions for aligned spans correspond to the same semantic role.
  • Since detecting question equivalence is still an open challenge, the authors propose both unlabeled and labeled evaluation metrics.
  • Unlabeled Argument Detection (UA) Inspired by the method presented in (Fitzgerald et al, 2018), argument spans are matched using a token-based matching criterion of intersection over union (IOU) ≥ 0.5.
  • To credit each argument only once, the authors employ maximal bipartite matching7 between the two sets of arguments, drawing an edge for each pair that passes the above mentioned criterion.
  • The resulting maximal matching determines the truepositive set, while remaining non-aligned arguments become false positives or false negatives
  • Conclusion:

    Applying the proposed controlled crowdsourcing protocol to QA-SRL successfully attains truly scalable high-quality annotation by laymen, facilitating future research of this paradigm.
  • Software and protocol, enabling easy future dataset production and evaluation for QA-SRL, as well as possible extensions of the QA-based semantic annotation paradigm.
  • The authors suggest that the simple yet rigorous controlled crowdsourcing protocol would be effective for other challenging annotation tasks, which often prove to be a hurdle for research projects
Tables
  • Table1: QA-SRL examples. The bar (|) separates multiple answers. Implicit arguments are highlighted
  • Table2: Examples for the question template corresponding to the 7 slots. First two examples are semantically equivalent
  • Table3: Example annotations for the consolidation task. A1 and A2 refer to question-answer pairs of the original annotators, while C refers to the consolidatorselected question and corrected answers
  • Table4: Automatic and manually-corrected evaluation of our gold standard and Dense (<a class="ref-link" id="cFitzgerald_et+al_2018_a" href="#rFitzgerald_et+al_2018_a">Fitzgerald et al, 2018</a>) against the integrated expert set
  • Table5: Performance analysis when considering PropBank as reference (all roles, core roles, and adjuncts)
  • Table6: Automatic parser evaluation against our test set, complemented by automatic and manual evaluations on the Wikinews part of the dev set (manual evaluation is over 50 sampled predicates)
  • Table7: Examples where <a class="ref-link" id="cFitzgerald_et+al_2018_a" href="#rFitzgerald_et+al_2018_a">Fitzgerald et al (2018</a>)’s parser generates redundant arguments. The first two rows illustrate different, partly redundant, argument spans for the same question, while the bottom rows illustrate two paraphrased questions for the same role
Download tables as Excel
Funding
  • This work was supported in part by an Intel Labs grant, the Israel Science Foundation grant 1951/17 and the German Research Foundation through the German-Israeli Project Cooperation (DIP, grant DA 1600/1-1)
Reference
  • Omri Abend and Ari Rappoport. 2013. Universal conceptual cognitive annotation (ucca). In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, pages 228–238.
    Google ScholarLocate open access versionFindings
  • Collin F. Baker, Charles J. Fillmore, and John B. Lowe. 1998. The berkeley framenet project. In Proceedings of the 17th International Conference on Computational Linguistics - Volume 1, COLING ’98, pages 86–90, Stroudsburg, PA, USA. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Shany Barhom, Vered Shwartz, Alon Eirew, Michael Bugert, Nils Reimers, and Ido Dagan. 2019. Revisiting joint modeling of cross-document entity and event coreference resolution. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4179–4189, Florence, Italy. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Yun-Nung Chen, William Yang Wang, and Alexander I Rudnicky. 2013. Unsupervised induction and filling of semantic slots for spoken dialogue systems using frame-semantic parsing. In 2013 IEEE Workshop on Automatic Speech Recognition and Understanding, pages 120–125. IEEE.
    Google ScholarLocate open access versionFindings
  • Pengxiang Cheng and Katrin Erk. 2018. Implicit argument prediction with event knowledge. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 831–840.
    Google ScholarLocate open access versionFindings
  • Quynh Ngoc Thi Do, Steven Bethard, and MarieFrancine Moens. 2017. Improving implicit semantic role labeling by predicting semantic frame arguments. In Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 90–99.
    Google ScholarLocate open access versionFindings
  • Nicholas Fitzgerald, Julian Michael, Luheng He, and Luke S. Zettlemoyer. 2018. Large-scale qa-srl parsing. In ACL.
    Google ScholarFindings
  • Matthew Gerber and Joyce Y Chai. 2012. Semantic role labeling of implicit arguments for nominal predicates. Computational Linguistics, 38(4):755–798.
    Google ScholarLocate open access versionFindings
  • Jan Hajic, Massimiliano Ciaramita, Richard Johansson, Daisuke Kawahara, Maria Antonia Martı, Lluıs Marquez, Adam Meyers, Joakim Nivre, Sebastian Pado, Jan Stepanek, et al. 200The conll-2009 shared task: Syntactic and semantic dependencies in multiple languages. In Proceedings of the Thirteenth Conference on Computational Natural Language Learning: Shared Task, pages 1–18. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Hangfeng He, Qiang Ning, and Dan Roth. 2020. Quase: Question-answer driven sentence encoding. In Proeedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics.
    Google ScholarFindings
  • Luheng He, Mike Lewis, and Luke S. Zettlemoyer. 2015. Question-answer driven semantic role labeling: Using natural language to annotate natural language. In EMNLP.
    Google ScholarLocate open access versionFindings
  • Martha Palmer, Dan Gildea, and Paul Kingsbury. 2005. The proposition bank: a corpus annotated with semantic roles. Computational Linguistics Journal, 31(1).
    Google ScholarLocate open access versionFindings
  • Gabriel Stanovsky and Ido Dagan. 2016. Creating a large benchmark for open information extraction. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 2300–2305.
    Google ScholarLocate open access versionFindings
  • Hai Wang, Mohit Bansal, Kevin Gimpel, and David McAllester. 2015. Machine comprehension with syntax, frames, and semantics. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), volume 2, pages 700–706.
    Google ScholarLocate open access versionFindings
  • Evaluating Redundant Annotations Recent datasets and parser outputs of QA-SRL (Fitzgerald et al., 2018) produce redundant arguments. On the other hand, our consolidated gold data, as typical, consists of a single non-redundant annotation, where arguments are non-overlapping. In order to fairly evaluate such redundant annotations against our gold standard, we ignore predicted arguments that match ground-truth but are not selected by the bipartite matching due to redundancy. After connecting unmatched predicted arguments that overlap, we count one false positive for every connected component, aiming to avoid penalizing precision too harshly when predictions are redundant.
    Google ScholarLocate open access versionFindings
Full Text
Your rating :
0

 

Tags
Comments