Simulating Morphological Analyzers With Stochastic Taggers For Confidence Estimation

CLEF'09: Proceedings of the 10th cross-language evaluation forum conference on Multilingual information access evaluation: text retrieval experiments(2010)

引用 3|浏览19
暂无评分
摘要
We propose a method for providing stochastic confidence estimates for rule-based and black-box natural language (NL) processing systems. Our method does not require labeled training data: We simply train stochastic models on the output of the original NL systems. Numeric confidence estimates enable both minimum Bayes risk style optimization as well as principled system combination for these knowledge-based and black-box systems. In our specific experiments, we enrich ParaMor, a rule-based system for unsupervised morphology induction, with probabilistic segmentation confidences by training a statistical natural language tagger to simulate ParaMor's morphological segmentations. By adjusting the numeric threshold above which the simulator proposes morpheme boundaries, we improve F, of morpheme identification on a Hungarian corpus by 5.9% absolute. With numeric confidences in hand, we also combine ParaMor's segmentation decisions with those of a second (black-box) unsupervised morphology induction system, Morfessor. Our joint ParaMor-Morfessor system enhances F(1) performance by a further 3.4% absolute, ultimately moving F(1) from 41.4% to 50.7%.
更多
查看译文
关键词
Natural Language Processing, Machine Translation, Statistical Machine Translation, Confidence Estimate, Segmentation Point
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要