Navigating Extracted Data with Schema Discovery

Michael J. Cafarella,Dan Suciu,Oren Etzioni

WebDB（2007）

引用 45|浏览58

暂无评分

摘要

Open Information Extraction (OIE) is a recently-introduced type of information extraction that extracts small individ- ual pieces of data from input text without any domain- specific guidance such as special training data or extrac- tion rules. For example, an OIE system might discover the triple Frenzy, year, 1972 from a set of documents about movies. Because OIE is domain-independent, it promises to help users when they have a corpus of structured data, but that structure is unknown, such as when browsing a novel domain or formulating a query. We can describe the struc- ture to the user by displaying a relational schema that fits the extracted data. Unfortunately, the extractions do not carry full schema information: we have extracted values, but not the cor- rect relations, their rows, or their columns. In response we propose TGen, an algorithm for schema discovery, which automatically derives a high-quality relational schema for the extracted data. Dierent applications have dierent schema-design requirements, which can be encoded as input to TGen. We show that our data-mining approach runs in minutes on millions of documents while still resulting in schemas that are useful for exploring unfamiliar data or for composing queries over extracted data.

查看译文

关键词

structured data,data mining,information extraction

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要