Are The Existing Training Corpora Unnecessarily Large?

Miguel Ballesteros,Jesus Herrera,Virginia Francisco,Pablo Gervas

Procesamiento Del Lenguaje Natural（2012）

引用 2|浏览31

暂无评分

摘要

This paper addresses the problem of optimizing the training treebank data because the size and quality of the data has always been a bottleneck for the purposes of training. In previous studies we realized that current corpora used for training machine learning-based dependency parsers contain a signi fi cant proportion of redundant information at the syntactic structure level. Since the development of such training corpora involves a big effort, we argue that an appropriate process for selecting the sentences to be included in them can result in having parsing models as accurate as the ones given when training with bigger - non optimized corpora (or alternatively, bigger accuracy for an equivalent annotation e ff ort). This argument is supported by the results of the study we carried out, which is presented in this paper. Therefore, this paper demonstrates that the training corpora contain more information than needed for training accurate data-driven dependency parsers.

查看译文

关键词

Dependency parsing,CoNLL Shared Tasks,Design principles for Tree banks,Optimization

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要