Interpreting Open Domain Modifiers: Decomposition of Wikipedia Categories into Disambiguated Property Value Pairs

Marius Pasca
Marius Pasca

EMNLP 2020, 2020.

Cited by: 0|Bibtex|Views15
Keywords:
knowledge repositorywikipedia categorywikipedia articleopen domain20th century
Weibo:
Current work explores the utility of alternative sources besides Wikidata, in increasing the coverage of the annotations; and the role of the annotations in generating plausible categories for Wikipedia articles

Abstract:

This paper proposes an open-domain method for automatically annotating modifier constituents (“20th-century”) within Wikipedia categories (“20th-century male writers”) with properties (“date of birth”). The annotations offer a semantically-anchored understanding of the role of the constituents in defining the underlying meaning of the cat...More

Code:

Data:

0
Introduction
  • Motivation: As Web search moves towards returning structured answers rather than flat sets of document links in response to users’ queries, the need for high-quality, wide-coverage knowledge to support such answers is growing stronger.
  • Contributions: The main contributions of this paper are as follows
  • It provides a precise, semantically-anchored understanding of the role of the constituents in defining the underlying meaning of Wikipedia categories.
Highlights
  • Motivation: As Web search moves towards returning structured answers rather than flat sets of document links in response to users’ queries, the need for high-quality, wide-coverage knowledge to support such answers is growing stronger
  • Millions of Wikipedia articles are connected to parent categories, which are in turn connected to their own, iteratively broader categories
  • Current work explores the utility of alternative sources besides Wikidata, in increasing the coverage of the annotations; and the role of the annotations in generating plausible categories for Wikipedia articles
Results
  • Coverage: When coverage is measured as the fraction of Wikipedia categories for which some annotations are extracted, the proposed method outperforms the baseline run Bwcn in Table 3.
  • Figure 2 shows the annotations extracted most frequently by run Bwcn and by the proposed method.
  • There are 1,612 unique annotations extracted by Bwcn but virtually all of them are really type annotations rather than capturing any property annotations.
  • The proposed method extracts as many as 519 unique property annotations.
  • Evaluation Metrics: To automatically assess the annotations extracted by an experimental run, over the target categories in the evaluation set, their cor-
Conclusion
  • This paper takes advantage of data from Wikidata, to extract annotations for understanding the role played by various constituents in determining the meaning of Wikipedia categories.
  • The annotations are semanticallyanchored properties and values, rather than ambiguous strings.
  • They offer a better trade-off between precision vs recall.
  • Current work explores the utility of alternative sources besides Wikidata, in increasing the coverage of the annotations; and the role of the annotations in generating plausible categories for Wikipedia articles
Summary
  • Introduction:

    Motivation: As Web search moves towards returning structured answers rather than flat sets of document links in response to users’ queries, the need for high-quality, wide-coverage knowledge to support such answers is growing stronger.
  • Contributions: The main contributions of this paper are as follows
  • It provides a precise, semantically-anchored understanding of the role of the constituents in defining the underlying meaning of Wikipedia categories.
  • Results:

    Coverage: When coverage is measured as the fraction of Wikipedia categories for which some annotations are extracted, the proposed method outperforms the baseline run Bwcn in Table 3.
  • Figure 2 shows the annotations extracted most frequently by run Bwcn and by the proposed method.
  • There are 1,612 unique annotations extracted by Bwcn but virtually all of them are really type annotations rather than capturing any property annotations.
  • The proposed method extracts as many as 519 unique property annotations.
  • Evaluation Metrics: To automatically assess the annotations extracted by an experimental run, over the target categories in the evaluation set, their cor-
  • Conclusion:

    This paper takes advantage of data from Wikidata, to extract annotations for understanding the role played by various constituents in determining the meaning of Wikipedia categories.
  • The annotations are semanticallyanchored properties and values, rather than ambiguous strings.
  • They offer a better trade-off between precision vs recall.
  • Current work explores the utility of alternative sources besides Wikidata, in increasing the coverage of the annotations; and the role of the annotations in generating plausible categories for Wikipedia articles
Tables
  • Table1: Correctness labels assigned to triples of a target Wikipedia category, modifier constituent and annotation in the evaluation set (Label=correctness label; Score=score of correctness label; Ip?=ignored during computation of precision?; Ir?=ignored during computation of recall?). Micro- and macro-averaged precision and recall scores are computed out of individual correctness scores. Micro-averaged scores are computed as an average over all annotations extracted by an experimental run. Macro-averaged scores are first computed separately for each target category, then averaged over all target categories
  • Table2: Examples of entries from the evaluation set. An entry is tuple of a target category, a modifier constituent, an extracted (or unspecified) annotation and a correctness label (Label=correctness label)
  • Table3: Fraction of Wikipedia categories for which various runs extract some annotations for at least one modifier constituent. Computed as a fraction of the reference sets of all Wikipedia categories (All Wiki) and also of all Wikipedia categories from the gold evaluation set (Gold Wiki)
  • Table4: Precision (P) and recall (R) (F=F1-score)
  • Table5: Examples of annotations extracted by runs Bwcn vs. Rprp for a sample of target categories (B=run Bwcn; R=run Rprp; prp=Property)
  • Table6: Examples of modifier constituents (underlined) within categories from the gold evaluation set, for which some annotation(s) are extracted only by Bwcn vs. only by Rprp (Cnt=total count of unique such modifier constituents from the gold evaluation set, for the respective run)
Download tables as Excel
Related work
  • As it extracts semantic annotations over opendomain concepts (namely, over categories from Wikipedia), the proposed method falls under the area of open-domain information extraction (Ernst et al, 2018; Qu et al, 2018; Sun et al, 2018; Zhu et al, 2019; Zhan and Zhao, 2020; Dash et al, 2020; Cao et al, 2020). Previous work in that area often uses Wikipedia data (Tsurel et al, 2017; Konovalov et al, 2017; Korn et al, 2019; Bornemann et al, 2020).

    In previous work, annotations for modifier constituents within compositional noun phrases may be extracted out of an unbound set of ambiguous strings, with no explicit semantics and possibly redundant (“from”, “born in”, “born at”) (Hendrickx et al, 2013; Nakov and Hearst, 2013). Alternatively, when annotations are selected out of a small, manually-created set of candidate annota-
Funding
  • The proposed method gives uniformly higher F1-scores than the baseline
Study subjects and analysis
articles: 5101643
The last is extracted via alignment to values of more to Wikipedia categories. The filtered Wikipedia descendant Wikipedia articles (e.g., “art:Imagine snapshot connects 5,101,643 articles to 1,124,679 (Armin van Buuren album)”, “art:10 Years (Armin categories. van Buuren album)”, “art:Intense”) than the other candidates

unique pairs: 1316
Consequently, the evaluation set can be used to compute not just the precision but also the recall of a given experimental run. Overall, the evaluation set contains one or more annotations for each of 1,316 unique pairs of a target category and a modifier constituent. Note that the count is larger than the number of entries in evaluation sets previously introduced for the evaluation of tasks related to compositionality analysis (Hendrickx et al, 2013; Pasca and Buisman, 2015; Pasca, 2017)

Reference
  • C. Bizer, J. Lehmann, G. Kobilarov, S. Auer, C. Becker, R. Cyganiak, and S. Hellmann. 2009. DBpedia - a crystallization point for the Web of data. Journal of Web Semantics, 7(3):154–165.
    Google ScholarLocate open access versionFindings
  • K. Bollacker, C. Evans, P. Paritosh, T. Sturge, and J. Taylor. 2008. Freebase: A collaboratively created graph database for structuring human knowledge. In Proceedings of the 2008 International Conference on Management of Data (SIGMOD-08), pages 1247–1250, Vancouver, Canada.
    Google ScholarLocate open access versionFindings
  • L. Bornemann, T. Bleifuß, D. Kalashnikov, and F. Naumann an D. Srivastava. 2020. Natural key discovery in Wikipedia tables. In Proceedings of the 2020 Web Conference (WWW-20), pages 2789–2795, Taipei, Taiwan.
    Google ScholarLocate open access versionFindings
  • E. Cao, D. Wang, J. Huang, and W. Hu. 2020. Open knowledge enrichment for long-tail entities. In Proceedings of the 2020 Web Conference (WWW-20), pages 384–394, Taipei, Taiwan.
    Google ScholarLocate open access versionFindings
  • D. Chen, A. Fisch, J. Weston, and A. Bordes. 2017. Reading Wikipedia to answer open-domain questions. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (ACL17), pages 1870–1879, Vancouver, Canada.
    Google ScholarLocate open access versionFindings
  • A. Chisholm, W. Radford, and B. Hachey. 2017. Learning to generate one-sentence biographies from Wikidata. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics (EACL-17), pages 633–642, Valencia, Spain.
    Google ScholarLocate open access versionFindings
  • S. Dash, F. Chowdhury, A. Gliozzo, N. Mihindukulasooriya, and N. Fauceglia. 2020. Hypernym detection using strict partial order networks. In Proceedings of the 34th National Conference on Artificial Intelligence (AAAI-20), New York, New York.
    Google ScholarLocate open access versionFindings
  • F. Ensan and E. Bagheri. 2017. Document retrieval model through semantic linking. In Proceedings of the 10th ACM Conference on Web Search and Data Mining (WSDM-17), pages 181–190, Cambridge, United Kingdom.
    Google ScholarLocate open access versionFindings
  • P. Ernst, A. Siu, and G. Weikum. 2018. HighLife: Higher-arity fact harvesting. In Proceedings of the 2018 Web Conference (WWW-18), pages 1013–1022, Lyon, France.
    Google ScholarLocate open access versionFindings
  • C. Fellbaum, editor. 1998. WordNet: An Electronic Lexical Database and Some of its Applications. MIT Press.
    Google ScholarFindings
  • T. Flati, D. Vannella, T. Pasini, and R. Navigli. 2014. Two is bigger (and better) than one: the Wikipedia Bitaxonomy project. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (ACL-14), pages 945–955, Baltimore, Maryland.
    Google ScholarLocate open access versionFindings
  • A. Gupta, R. Lebret, H. Harkous, and K. Aberer. 2018. 280 birds with one stone: Inducing multilingual taxonomies from Wikipedia using character-level classification. In Proceedings of the 32nd National Conference on Artificial Intelligence (AAAI-18), pages 4824–4831, New Orleans, Louisiana.
    Google ScholarLocate open access versionFindings
  • P. Gupta, S. Rajaram, H. Schutze, and T. Runkler. 2019. Neural relation extraction within and across sentence boundaries. In Proceedings of the 33rd National Conference on Artificial Intelligence (AAAI19), pages 6513–6520, Honolulu, Hawaii.
    Google ScholarLocate open access versionFindings
  • I. Hendrickx, Z. Kozareva, P. Nakov, D. O Seaghdha, S. Szpakowicz, and T. Veale. 2013. SemEval-2013 task 4: Free paraphrases of noun compounds. In Proceedings of the 7th International Workshop on Semantic Evaluation (SemEval-13), pages 138–143, Atlanta, Georgia.
    Google ScholarLocate open access versionFindings
  • J. Hoffart, F. Suchanek, K. Berberich, and G. Weikum. 2013. YAGO2: a spatially and temporally enhanced knowledge base from Wikipedia. Artificial Intelligence Journal. Special Issue on Artificial Intelligence, Wikipedia and Semi-Structured Resources, 194:28–61.
    Google ScholarLocate open access versionFindings
  • A. Konovalov, B. Strauss, A. Ritter, and B. O’Connor. 2017. Learning to extract events from knowledge base revisions. In Proceedings of the 26th World Wide Web Conference (WWW-17), pages 1007–1014, Perth, Australia.
    Google ScholarLocate open access versionFindings
  • F. Korn, X. Wang, Y. Wu, and C. Yu. 2019. Automatically generating interesting facts from Wikipedia tables. In Proceedings of the 2019 International Conference on Management of Data (SIGMOD-19), pages 349–361, Amsterdam, Netherlands.
    Google ScholarLocate open access versionFindings
  • D. Ma, Y. Chen, K. Chang, and X. Du. 20Leveraging fine-grained Wikipedia categories for entity search. In Proceedings of the 2018 Web Conference (WWW-18), pages 1623–1632, Lyon, France.
    Google ScholarLocate open access versionFindings
  • A. Moniruzzaman, R. Nayak, M. Tang, and T. Balasubramaniam. 20Fine-grained type inference in knowledge graphs via probabilistic and tensor factorization methods. In Proceedings of the 2019 Web Conference (WWW-19), pages 3093–3100, San Francisco, California.
    Google ScholarLocate open access versionFindings
  • S. Murty, P. Verga, L. Vilnis, I. Radovanovic, and A. McCallum. 2018. Hierarchical losses and new resources for fine-grained entity typing and linking. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (ACL-18), pages 97–109, Melbourne, Australia.
    Google ScholarLocate open access versionFindings
  • P. Nakov and M. Hearst. 2013. Semantic interpretation of noun compounds using verbal and other paraphrases. ACM Transactions on Speech and Language Processing, 10(3):1–51.
    Google ScholarLocate open access versionFindings
  • V. Nastase and M. Strube. 2013. Transforming Wikipedia into a large scale multilingual concept network. Artificial Intelligence, 194:62–85.
    Google ScholarLocate open access versionFindings
  • M. Pasca. 2017. German typographers vs. German grammar: Decomposition of Wikipedia category labels into attribute-value pairs. In Proceedings of the 10th ACM Conference on Web Search and Data Mining (WSDM-17), pages 315–324, Cambridge, United Kingdom.
    Google ScholarLocate open access versionFindings
  • M. Pasca. 2018. Finding needles in an encyclopedic haystack: Detecting classes among Wikipedia articles. In Proceedings of the 2018 Web Conference (WWW-18), pages 1267–1276, Lyon, France.
    Google ScholarLocate open access versionFindings
  • M. Pasca and H. Buisman. 2015. Dissecting German grammar and Swiss passports: Open-domain decomposition of compositional entries in large-scale knowledge repositories. In Proceedings of the 24th International Joint Conference on Artificial Intelligence (IJCAI-15), pages 896–902, Buenos Aires, Argentina.
    Google ScholarLocate open access versionFindings
  • T. Piccardi, M. Catasta, L. Zia, and R. West. 2018. Structuring Wikipedia articles with section recommendations. In Proceedings of the 41st International Conference on Research and Development in Information Retrieval (SIGIR-18), pages 665–674, Ann Arbor, Michigan.
    Google ScholarLocate open access versionFindings
  • S. Ponzetto and M. Strube. 2007. Deriving a large scale taxonomy from Wikipedia. In Proceedings of the 22nd National Conference on Artificial Intelligence (AAAI-07), pages 1440–1447, Vancouver, British Columbia.
    Google ScholarLocate open access versionFindings
  • M. Qu, X. Ren, Y. Zhang, and J. Han. 2018. Weakly-supervised relation extraction by patternenhanced embedding learning. In Proceedings of the 2018 Web Conference (WWW-18), pages 1257– 1266, Lyon, France.
    Google ScholarLocate open access versionFindings
  • L. Ratinov, D. Roth, D. Downey, and M. Anderson. 2011. Local and global algorithms for disambiguation to Wikipedia. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics (ACL-11), pages 1375–1384, Portland, Oregon.
    Google ScholarLocate open access versionFindings
  • V. Shwartz and C. Waterson. 2018. Olive oil is made of olives, baby oil is made for babies: Interpreting noun compounds using paraphrases in a neural model. In Proceedings of the 2018 Conference of the North American Association for Computational Linguistics (NAACL-HLT-18), pages 218–224, New Orleans, Louisiana.
    Google ScholarLocate open access versionFindings
  • M. Sun, X. Li, X. Wang, M. Fan, Y. Feng, and P. Li. 2018. Logician: A unified end-to-end neural approach for open-domain information extraction. In Proceedings of the 11th ACM Conference on Web Search and Data Mining (WSDM-18), pages 556– 564, Marina del Rey, California.
    Google ScholarLocate open access versionFindings
  • S. Tratz and E. Hovy. 2010. A taxonomy, dataset, and classifier for automatic noun compound interpretation. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics (ACL10), pages 678–687, Uppsala, Sweden.
    Google ScholarLocate open access versionFindings
  • D. Tsurel, D. Pelleg, I. Guy, and D. Shahaf. 2017. Fun facts: Automatic trivia fact extraction from Wikipedia. In Proceedings of the 10th ACM Conference on Web Search and Data Mining (WSDM-17), pages 345–354, Cambridge, United Kingdom.
    Google ScholarLocate open access versionFindings
  • D. Vrandecicand M. Krotzsch. 2014. Wikidata: A free collaborative knowledge base. Communications of the ACM, 57:78–85.
    Google ScholarLocate open access versionFindings
  • W. Wu, H. Li, H. Wang, and K. Zhu. 2012. Probase: a probabilistic taxonomy for text understanding. In Proceedings of the 2012 International Conference on Management of Data (SIGMOD-12), pages 481– 492, Scottsdale, Arizona.
    Google ScholarLocate open access versionFindings
  • J. Zhan and H. Zhao. 2020. Span model for open information extraction on accurate corpus. In Proceedings of the 34th National Conference on Artificial Intelligence (AAAI-20), New York, New York.
    Google ScholarLocate open access versionFindings
  • S. Zhang and K. Balog. 2018. Ad hoc table retrieval using semantic similarity. In Proceedings of the 2018 Web Conference (WWW-18), pages 1553–1562, Lyon, France.
    Google ScholarLocate open access versionFindings
  • Q. Zhu, X. Ren, J. Shang, Y. Zhang, A. El-Kishky, and J. Han. 2019. Integrating local context and global cohesiveness for open information extraction. In Proceedings of the 12th ACM Conference on Web Search and Data Mining (WSDM-19), pages 42–50, Melbourne, Australia.
    Google ScholarLocate open access versionFindings
Full Text
Your rating :
0

 

Tags
Comments