Divide and Conquer: Crowdsourcing the Creation of Cross-Lingual Textual Entailment Corpora.

EMNLP '11: Proceedings of the Conference on Empirical Methods in Natural Language Processing(2011)

引用 26|浏览101
We address the creation of cross-lingual textual entailment corpora by means of crowd-sourcing. Our goal is to define a cheap and replicable data collection methodology that minimizes the manual work done by expert annotators, without resorting to preprocessing tools or already annotated monolingual datasets. In line with recent works emphasizing the need of large-scale annotation efforts for textual entailment, our work aims to: i ) tackle the scarcity of data available to train and evaluate systems, and ii ) promote the recourse to crowdsourcing as an effective way to reduce the costs of data collection without sacrificing quality. We show that a complex data creation task, for which even experts usually feature low agreement scores, can be effectively decomposed into simple subtasks assigned to non-expert annotators. The resulting dataset, obtained from a pipeline of different jobs routed to Amazon Mechanical Turk, contains more than 1,600 aligned pairs for each combination of texts-hypotheses in English, Italian and German.
complex data creation task,data collection,replicable data collection methodology,cross-lingual textual entailment corpus,expert annotators,manual work,recent work,textual entailment,Amazon Mechanical Turk,annotated monolingual datasets
AI 理解论文
Chat Paper