Bootstrapping Domain-Specific Content Discovery on the Web
WWW '19: The Web Conference on The World Wide Web Conference WWW 2019(2019)
摘要
The ability to continuously discover domain-specific content from the Web is critical for many applications. While focused crawling strategies have been shown to be effective for discovery, configuring a focused crawler is difficult and time-consuming. Given a domain of interest D, subject-matter experts (SMEs) must search for relevant websites and collect a set of representative Web pages to serve as training examples for creating a classifier that recognizes pages in D, as well as a set of pages to seed the crawl. In this paper, we propose DISCO, an approach designed to bootstrap domain-specific search. Given a small set of websites , DISCO aims to discover a large collection of relevant websites . DISCO uses a ranking-based framework that mimics the way users search for information on the Web: it iteratively discovers new pages, distills, and ranks them. It also applies multiple discovery strategies, including keyword-based and related queries issued to search engines, backward and forward crawling. By systematically combining these strategies, DISCO is able to attain high harvest rates and coverage for a variety of domains. We perform extensive experiments in four social-good domains, using data gathered by SMEs in the respective domains, and show that our approach is effective and outperforms state-of-the-art methods.
更多查看译文
关键词
Domain-specific Website discovery, Focused crawling, Meta search
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络