Extracting and managing structured web data

Oren Etzioni,Dan Suciu,Michael John Cafarella

Extracting and managing structured web data（2009）

引用 22|浏览27

暂无评分

摘要

The Web contains a large amount of structured data embedded in natural language text, two-dimensional tables, and other forms. This “Structured Web” of data is vast, messy, and diverse; it also promises new and compelling applications. Unfortunately, existing tools such as search engines and relational databases ignore Structured Web data entirely. This dissertation identifies four design criteria for a successful Structured Web management system. Such systems are: (1) Extraction-Focused—They obtain structured data wherever it can be found. (2) Domain-Independent—They are not tied to one particular topic area. (3) Domain-Scalable—They can effectively manage many domains simultaneously. (4) Computationally-Efficient—They can handle the Web's enormous size. We also describe three working Structured Web management systems that observe these criteria. TEXTRUNNER is an extractor for processing natural language Web text. WEBTABLES extracts and provides applications on top of relations in HTML tables. Finally, OCTOPUS provides integration services over extracted Structured Web data. Together, these three systems demonstrate that managing structured data on the Web is possible today, and also suggest directions for future systems.

查看译文

关键词

Structured Web,natural language Web text,Web data,structured web data,Web management system,structured data,compelling application,WEBTABLES extract,HTML table,natural language text,successful Structured Web management

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要