A simple approach to the design of site-level extractors using domain-centric principles.

CIKM'12: 21st ACM International Conference on Information and Knowledge Management Maui Hawaii USA October, 2012(2012)

引用 2|浏览26
暂无评分
摘要
We consider the problem of extracting, in a domain-centric fashion, a given set of attributes from a large number of semi-structured websites. Previous approaches [7, 5] to solve this problem are based on page level inference. We propose a distinct new approach that directly chooses attribute extractors for a site using a scoring mechanism that is designed at the domain level via simple classification methods using a training set from a small number of sites. To keep the number of candidate extractors in each site manageably small we use two observations that hold in most domains: (a) imprecise annotators can be used to identify a small set of candidate extractors for a few attributes (anchors); and (b) non-anchor attributes lie in close proximity to the anchor attributes. Experiments on three domains (Events, Books and Restaurants) show that our approach is very effective in spite of its simplicity.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要