Identifying relations for open information extraction

EMNLP, pp. 1535-1545, 2011.

Cited by: 1108|Bibtex|Views98
EI
Other Links: dl.acm.org|dblp.uni-trier.de|academic.microsoft.com
Weibo:
REVERB achieves an area under the curve that is 30% higher than WOEparse and is more than double the area under the curve of WOEpos or TEXTRUNNER

Abstract:

Open Information Extraction (IE) is the task of extracting assertions from massive corpora without requiring a pre-specified vocabulary. This paper shows that the output of state-of-the-art Open IE systems is rife with uninformative and incoherent extractions. To overcome these problems, we introduce two simple syntactic and lexical const...More

Code:

Data:

Introduction
  • Introduction and Motivation

    Typically, Information Extraction (IE) systems learn an extractor for each target relation from labeled training examples (Kim and Moldovan, 1993; Riloff, 1996; Soderland, 1999).
  • The authors implemented the constraints in the REVERB Open IE system, which more than doubles the area under the precision-recall curve relative to previous extractors such as TEXTRUNNER and WOEpos.
  • While the syntactic constraint greatly reduces uninformative extractions, it can sometimes match relation phrases that are so specific that they have only a few possible instances, even in a Web-scale corpus.
Highlights
  • Introduction and Motivation

    Typically, Information Extraction (IE) systems learn an extractor for each target relation from labeled training examples (Kim and Moldovan, 1993; Riloff, 1996; Soderland, 1999)
  • This approach to Information Extraction does not scale to corpora where the number of target relations is very large, or where the target relations cannot be specified in advance
  • Open Information Extraction solves this problem by identifying relation phrases—phrases that denote relations in English sentences (Banko et al, 2007)
  • REVERB achieves an area under the curve that is 30% higher than WOEparse and is more than double the area under the curve of WOEpos or TEXTRUNNER
  • The lexical constraint provides a significant boost in performance, with REVERB achieving an area under the curve 23% higher than REVERB¬lex
  • We found that 65% of the incorrect extractions returned by REVERB were cases where a relation phrase was correctly identified, but the argument-finding heuristics failed
Results
  • The syntactic constraint reduces uninformative extractions by capturing relation phrases expressed via LVCs. For example, the POS pattern matched against the sentence “Faust made a deal with the Devil,” would result in the relation phrase made a deal with, instead of the uninformative made.
  • REVERB first identifies relation phrases that satisfy the syntactic and lexical constraints, and finds a pair of NP arguments for each identified
  • The results in Table 3 are similar to Banko and Etzioni’s findings that a set of eight POS patterns cover a large fraction of binary verbal relation phrases.
  • Previous open extractors require labeled training data to learn a model of relations, which is used to extract relation phrases from text.
  • The authors passed these labeled examples to TEXTRUNNER’s training procedure, which learns a linear-chain CRF using closedclass features like POS tags, capitalization, punctuation, etc.TEXTRUNNER-R uses the argument-first extraction algorithm described in Section 2.
  • WOEpos - Wu and Weld’s modification to TEXTRUNNER, which uses a model of relations learned from extractions heuristically generated from Wikipedia.
  • The lexical constraint boosts recall over REVERB¬lex, since REVERB is able to find a correct relation phrase where REVERB¬lex finds an overspecified one.
  • REVERB proves to be a useful source of train- In order to highlight the role of the relational ing data, with TEXTRUNNER-R having an AUC model of each system, the authors evaluate their per-
  • TEXTRUNNER-R was able to learn a model that precision-recall curves for each system on the relapredicts contiguous relation phrases, but still re- tion phrase-only evaluation.
Conclusion
  • These errors hurt both precision and recall, since each case results in the extractor overlooking a correct relation phrase and choosing another.
  • The authors found that 65% of the incorrect extractions returned by REVERB were cases where a relation phrase was correctly identified, but the argument-finding heuristics failed.
  • One common As Downey’s work predicts, precision increased mistake that REVERB made was extracting a rela- in both systems for extractions found multiple tion phrase that expresses an n-ary relationship via times, compared with extractions found only once.
Summary
  • Introduction and Motivation

    Typically, Information Extraction (IE) systems learn an extractor for each target relation from labeled training examples (Kim and Moldovan, 1993; Riloff, 1996; Soderland, 1999).
  • The authors implemented the constraints in the REVERB Open IE system, which more than doubles the area under the precision-recall curve relative to previous extractors such as TEXTRUNNER and WOEpos.
  • While the syntactic constraint greatly reduces uninformative extractions, it can sometimes match relation phrases that are so specific that they have only a few possible instances, even in a Web-scale corpus.
  • The syntactic constraint reduces uninformative extractions by capturing relation phrases expressed via LVCs. For example, the POS pattern matched against the sentence “Faust made a deal with the Devil,” would result in the relation phrase made a deal with, instead of the uninformative made.
  • REVERB first identifies relation phrases that satisfy the syntactic and lexical constraints, and finds a pair of NP arguments for each identified
  • The results in Table 3 are similar to Banko and Etzioni’s findings that a set of eight POS patterns cover a large fraction of binary verbal relation phrases.
  • Previous open extractors require labeled training data to learn a model of relations, which is used to extract relation phrases from text.
  • The authors passed these labeled examples to TEXTRUNNER’s training procedure, which learns a linear-chain CRF using closedclass features like POS tags, capitalization, punctuation, etc.TEXTRUNNER-R uses the argument-first extraction algorithm described in Section 2.
  • WOEpos - Wu and Weld’s modification to TEXTRUNNER, which uses a model of relations learned from extractions heuristically generated from Wikipedia.
  • The lexical constraint boosts recall over REVERB¬lex, since REVERB is able to find a correct relation phrase where REVERB¬lex finds an overspecified one.
  • REVERB proves to be a useful source of train- In order to highlight the role of the relational ing data, with TEXTRUNNER-R having an AUC model of each system, the authors evaluate their per-
  • TEXTRUNNER-R was able to learn a model that precision-recall curves for each system on the relapredicts contiguous relation phrases, but still re- tion phrase-only evaluation.
  • These errors hurt both precision and recall, since each case results in the extractor overlooking a correct relation phrase and choosing another.
  • The authors found that 65% of the incorrect extractions returned by REVERB were cases where a relation phrase was correctly identified, but the argument-finding heuristics failed.
  • One common As Downey’s work predicts, precision increased mistake that REVERB made was extracting a rela- in both systems for extractions found multiple tion phrase that expresses an n-ary relationship via times, compared with extractions found only once.
Tables
  • Table1: Examples of incoherent extractions. Inpression composed of a verb and a noun, with the coherent extractions make up approximately 13% of noun carrying the semantic content of the predi- TEXTRUNNER’s output, 15% of WOEpos’s output, and cate (<a class="ref-link" id="cGrefenstette_1995_a" href="#rGrefenstette_1995_a">Grefenstette and Teufel, 1995</a>; Stevenson et al, 30% of WOEparse’s output
  • Table2: Examples of uninformative relations (left) and their completions (right). Uninformative relations occur in approximately 4% of WOEparse’s output, 6% of WOEpos’s output, and 7% of TEXTRUNNER’s output
  • Table3: Approximately 85% of the binary verbal relation phrases in a sample of Web sentences satisfy our constraints
  • Table4: REVERB uses these features to assign a confidence score to an extraction (x, r, y) from a sentence s using a logistic regression classifier
  • Table5: The majority of the incorrect extractions returned by REVERB are due to errors in argument extraction
  • Table6: The majority of extractions that were missed by REVERB were cases where the correct relation phrase was found, but the arguments were not correctly identified
Download tables as Excel
Funding
  • Although the syntactic constraint significantly retook took place in, took control over, took advantage of gave gave birth to, gave a talk at, gave new meaning to got got tickets to, got a deal on, got funding from duces incoherent and uninformative extractions, it allows overly-specific relation phrases such as is offering only modest greenhouse gas reduction targets at
Reference
  • The paper’s contributions are as follows: David J. Allerton. 2002. Stretched Verb Constructions in English. Routledge Studies in Germanic Linguistics. Routledge (Taylor and Francis), New York.
    Google ScholarLocate open access versionFindings
  • In future work, we plan to explore utilizing our constraints to improve the performance of learned CRF models. Roth et al. have shown how to incorporate constraints into CRF learners (Roth and Yih, 2005). It is natural, then, to consider whether the combination of heuristically labeled training examples, CRF learning, and our constraints will result in superior performance. The error analysis in Section 5.2 also suggests natural directions for future work. For instance, since many of REVERB’s errors are due to incorrect arguments, improved methods for argument extraction are in order.
    Google ScholarLocate open access versionFindings
  • Collin F. Baker, Charles J. Fillmore, and John B. Lowe. 1998. The berkeley framenet project. In Proceedings of the 17th international conference on Computational linguistics, pages 86–90.
    Google ScholarLocate open access versionFindings
  • Michele Banko and Oren Etzioni. 2008. The tradeoffs between open and traditional relation extraction. In Proceedings of ACL-08: HLT, pages 28–36, Columbus, Ohio, June. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Michele Banko, Michael J. Cafarella, Stephen Soderland, Matt Broadhead, and Oren Etzioni. 2007. Open information extraction from the web. In In the Proceedings of the 20th International Joint Conference on Artificial Intelligence, pages 2670–2676, January.
    Google ScholarLocate open access versionFindings
  • Jonathan Berant, Ido Dagan, and Jacob Goldberger. 2011. Global learning of typed entailment rules. In Proceedings of ACL, Portland, OR.
    Google ScholarLocate open access versionFindings
  • Janara Christensen, Mausam, Stephen Soderland, and Oren Etzioni. 2010. Semantic role labeling for open information extraction. In Proceedings of the NAACL HLT 2010 First International Workshop on Formalisms and Methodology for Learning by Reading, FAM-LbR ’10, pages 52–60, Stroudsburg, PA, USA. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Doug Downey, Oren Etzioni, and Stephen Soderland. 2005. A probabilistic model of redundancy in information extraction. In IJCAI, pages 1034–1041.
    Google ScholarLocate open access versionFindings
  • Gregory Grefenstette and Simone Teufel. 1995. Corpusbased method for automatic identification of support verbs for nominalizations. In Proceedings of the seventh conference on European chapter of the Association for Computational Linguistics, pages 98–103, San Francisco, CA, USA. Morgan Kaufmann Publishers Inc.
    Google ScholarLocate open access versionFindings
Full Text
Your rating :
0

 

Tags
Comments