Linguistically Motivated Bigrams in Part-of-Speech Tagging of Language Corpora.

The Prague Bulletin of Mathematical Linguistics(2002)

引用 0|浏览1
暂无评分
摘要
After some discussion concerning the issues of corpus representativity in the first paragraphs, this paper presents a simple yet in practice very efficient technique serving for automatic detection of those positions in a Part-ofSpeech tagged corpus where an error is to be suspected. The approach is based on the idea of creating and then applying a set of invalid bigrams, i.e. of pairs of adjacent Part-of-Speech tags which constitute an incorrect configuration in a tagged text of a particular language (in English, e.g., the bigram [ARTICLE, FINITE VERB]). Further, the paper describes the generalization of the invalid bigrams into a certain set of invalid n-grams, for any natural n, which indeed provides a powerful tool for error detection in a corpus. Some implementation issues are also presented, as well as evaluation of results of the approach when used for error detection in the NEGRA corpus of German. Finally, general implications for the quality of results of statistical taggers are discussed. Illustrative examples in the text are taken mainly from German, and hence at least a basic command of this language would be helpful for their understanding – due to the complexity of the necessary accompanying explanation, the examples are neither glossed nor translated. However, the central ideas of the paper should be understandable also without any knowledge of German.
更多
查看译文
关键词
tagging,language,part-of-speech
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要