Learning Extractors from Unlabeled Text using Relevant Databases

msra(2007)

引用 72|浏览56
暂无评分
摘要
Supervised machine learning algorithms for informa- tion extraction generally require large amounts of train- ing data. In many cases where labeling training data is burdensome, there may, however, already exist an in- complete database relevant to the task at hand. Records from this database can be used to label text strings that express the same information. For tasks where text strings do not follow the same format or layout, and additionally may contain extra information, label- ing the strings completely may be problematic. This paper presents a method for training extractors which fill in missing labels of a text sequence that is partially labeled using simple high-precision heuristics. Further- more, we improve the algorithm by utilizing labeled fields from the database. In experiments with BibTeX records and research paper citation strings, we show a significant improvement in extraction accuracy over a baseline that only relies on the database for training data.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要