Web-scale information extraction with vertex

Data Engineering(2011)

引用 110|浏览0
暂无评分
摘要
Vertex is a Wrapper Induction system developed at Yahoo! for extracting structured records from template-based Web pages. To operate at Web scale, Vertex employs a host of novel algorithms for (1) Grouping similar structured pages in a Web site, (2) Picking the appropriate sample pages for wrapper inference, (3) Learning XPath-based extraction rules that are robust to variations in site structure, (4) Detecting site changes by monitoring sample pages, and (5) Optimizing editorial costs by reusing rules, etc. The system is deployed in production and currently extracts more than 250 million records from more than 200 Web sites. To the best of our knowledge, Vertex is the first system to do high-precision information extraction at Web scale.
更多
查看译文
关键词
web-scale information extraction,sample page,site structure,high-precision information extraction,detecting site change,appropriate sample page,learning xpath-based extraction rule,template-based web page,wrapper induction system,web scale,web site,data mining,web pages,clustering algorithms,internet,noise measurement,information retrieval,information extraction
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要