Constructing a Comprehensive Events Database from the Web
Proceedings of the 28th ACM International Conference on Information and Knowledge Management(2019)
摘要
In this paper, we consider the problem of constructing a comprehensive database of events taking place around the world. Events include small hyper-local events like farmer's markets, neighborhood garage sales, as well as larger concerts and festivals. Designing a high-precision and high-recall event extractor from unstructured pages across the whole web is a challenging problem. We cannot resort overly to domain-specific strategies since it needs to work on all web pages, including on new domains; we need to account for variations in page layouts and structure across websites. Further, we need to deal with low-quality pages on the web with limited structure. We have built an ML-powered extraction system to solve this problem, using schema.org annotations as training data. Our extraction system operates in two phases. In the first phase, we generate raw event information from individual web pages. To do this, an \em event page classifier predicts if a web page contains any event information; this is then followed by a \em single/multiple classifier that decides if the page contains a single event or multiple events; the first phase concludes by applying \em event extractors that extract the key fields of a public event (the title, the date/time information, and the location information). In the second phase, we further improve the extraction quality via three novel algorithms, \em repeated patterns, \em event consolidation and \em wrapper induction, which are designed to use the raw event extractions as input and generate events whose quality is significantly higher. We evaluate our extraction models on two large scale publicly available web corpus, Common Crawl and ClueWeb12. Experimental analysis shows that our methodology achieves over 95% extraction precision and recall on both datasets.
更多查看译文
关键词
consolidation, event data extraction, structure data, wrapper
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络