Design Challenges for Entity Linking

Trans. Assoc. Comput. Linguistics, pp. 315-328, 2015.

Cited by: 143|Bibtex|Views62
EI
Other Links: academic.microsoft.com|dblp.uni-trier.de
Weibo:
We show that a strong candidate generation component leads to a surprisingly good result; using fine-grained entity types helps filter out incorrect links; and a simple unsupervised system like VINCULUM can achieve comparable performance with existing machine-learned linking syst...

Abstract:

Recent research on entity linking (EL) has introduced a plethora of promising techniques, ranging from deep neural networks to joint inference. But despite numerous papers there is surprisingly little understanding of the state of the art in EL. We attack this confusion by analyzing differences between several versions of the EL problem a...More

Code:

Data:

Introduction
  • Entity Linking (EL) is a central task in information extraction — given a textual passage, identify entity mentions and link them to the corresponding entry in a given Knowledge Base (KB, e.g. Wikipedia or Freebase).
  • JetBlue begins direct service between Barnstable Airport and JFK International.
  • “JetBlue” should be linked to the entity KB:JetBlue, “Barnstable Airport” to KB:Barnstable Municipal Airport, and “JFK International” to KB:John F.
  • Many other NLP applications can benefit from such links, such as distantly-supervised relation extraction (Craven and Kumlien, 1999; Riedel et al, 2010; Hoffmann et al, 2011; Koch et al, 2014) that uses EL to create training data, and some coreference systems that use EL for disambiguation (Hajishirzi et al, 2013; Zheng et al, 2013; Durrett and Klein, 2014).
  • In spite of numerous papers on the topic and several published data sets, there is surprisingly little understanding about state-of-the-art performance
Highlights
  • Entity Linking (EL) is a central task in information extraction — given a textual passage, identify entity mentions and link them to the corresponding entry in a given Knowledge Base (KB, e.g. Wikipedia or Freebase)
  • We show an extensive set of experimental results conducted on nine data sets as well as a detailed ablation analysis to assess each subcomponent of a linking system
  • We examine 9 Entity Linking data sets and discuss the inconsistencies among them
  • To have a better understanding of an Entity Linking system, we implement a simple yet effective, unsupervised system, VINCULUM, and conduct extensive ablation tests to measure the relative impact of each component
  • We show that a strong candidate generation component (CrossWikis) leads to a surprisingly good result; using fine-grained entity types helps filter out incorrect links; and a simple unsupervised system like VINCULUM can achieve comparable performance with existing machine-learned linking systems and, is suitable as a strong baseline for future research
  • We hope to catalyze agreement on a more precise Entity Linking annotation guideline that resolves the issues discussed in Section 3
Methods
  • The authors start by using Stanford NER for mention extraction and measure its efficacy by the recall of correct mentions shown in Table 3.
  • Some of the missing mentions are noun phrases without capitalization, a well-known limitation of automated extractors.
  • The authors experiment with an NP chunker (NP) 12 and a deterministic noun phrase extractor based on parse trees (DP).
  • The authors expect them to introduce spurious mentions, the purpose is to estimate an upper bound for mention recall.
  • Note that the recall of mention extraction is an upper bound of the recall of end-to-end predictions
Results
  • Evaluation Metrics

    While a variety of metrics have been used for evaluation, there is little agreement on which one to use.
  • Bag-of-Concept F1 (ACE, MSNBC): For each document, a gold bag of Wikipedia entities is evaluated against a bag of system output entities requiring exact segmentation match
  • This metric may have its historical reason for comparison but is flawed since it will obtain 100% F1 for an annotation in which every mention is linked to the wrong entity, but the bag of entities is the same as the gold bag.
  • The overall data set is evaluated using a entity cluster-based B3+ F1
Conclusion
  • Conclusion and Future

    Work

    Despite recent progress in Entity Linking, the community has had little success in reaching an agreement on annotation guidelines or building a standard benchmark for evaluation.
  • The authors show that a strong candidate generation component (CrossWikis) leads to a surprisingly good result; using fine-grained entity types helps filter out incorrect links; and a simple unsupervised system like VINCULUM can achieve comparable performance with existing machine-learned linking systems and, is suitable as a strong baseline for future research.
  • The authors hope to catalyze agreement on a more precise EL annotation guideline that resolves the issues discussed in Section 3.
  • The authors hope to design a joint model that avoids cascading errors from the current pipeline (Wick et al, 2013; Durrett and Klein, 2014)
Summary
  • Introduction:

    Entity Linking (EL) is a central task in information extraction — given a textual passage, identify entity mentions and link them to the corresponding entry in a given Knowledge Base (KB, e.g. Wikipedia or Freebase).
  • JetBlue begins direct service between Barnstable Airport and JFK International.
  • “JetBlue” should be linked to the entity KB:JetBlue, “Barnstable Airport” to KB:Barnstable Municipal Airport, and “JFK International” to KB:John F.
  • Many other NLP applications can benefit from such links, such as distantly-supervised relation extraction (Craven and Kumlien, 1999; Riedel et al, 2010; Hoffmann et al, 2011; Koch et al, 2014) that uses EL to create training data, and some coreference systems that use EL for disambiguation (Hajishirzi et al, 2013; Zheng et al, 2013; Durrett and Klein, 2014).
  • In spite of numerous papers on the topic and several published data sets, there is surprisingly little understanding about state-of-the-art performance
  • Methods:

    The authors start by using Stanford NER for mention extraction and measure its efficacy by the recall of correct mentions shown in Table 3.
  • Some of the missing mentions are noun phrases without capitalization, a well-known limitation of automated extractors.
  • The authors experiment with an NP chunker (NP) 12 and a deterministic noun phrase extractor based on parse trees (DP).
  • The authors expect them to introduce spurious mentions, the purpose is to estimate an upper bound for mention recall.
  • Note that the recall of mention extraction is an upper bound of the recall of end-to-end predictions
  • Results:

    Evaluation Metrics

    While a variety of metrics have been used for evaluation, there is little agreement on which one to use.
  • Bag-of-Concept F1 (ACE, MSNBC): For each document, a gold bag of Wikipedia entities is evaluated against a bag of system output entities requiring exact segmentation match
  • This metric may have its historical reason for comparison but is flawed since it will obtain 100% F1 for an annotation in which every mention is linked to the wrong entity, but the bag of entities is the same as the gold bag.
  • The overall data set is evaluated using a entity cluster-based B3+ F1
  • Conclusion:

    Conclusion and Future

    Work

    Despite recent progress in Entity Linking, the community has had little success in reaching an agreement on annotation guidelines or building a standard benchmark for evaluation.
  • The authors show that a strong candidate generation component (CrossWikis) leads to a surprisingly good result; using fine-grained entity types helps filter out incorrect links; and a simple unsupervised system like VINCULUM can achieve comparable performance with existing machine-learned linking systems and, is suitable as a strong baseline for future research.
  • The authors hope to catalyze agreement on a more precise EL annotation guideline that resolves the issues discussed in Section 3.
  • The authors hope to design a joint model that avoids cascading errors from the current pipeline (Wick et al, 2013; Durrett and Klein, 2014)
Tables
  • Table1: Characteristics of the nine NEL data sets. Entity types: The AIDA data sets include named entities in four NER classes, Person (PER), Organization (ORG), Location (LOC) and Misc. In TAC KBP data sets, both Person (PERT ) and Organization entities (ORGT ) are defined differently from their NER counterparts and geo-political entities (GPE), different from LOC, exclude places like KB:Central California. KB (Sec. 2.2): The knowledge base used when each data was being developed. Evaluation Metric (Sec. 2.3): Bag-of-Concept F1 is used as the evaluation metric in (Ratinov et al, 2011; <a class="ref-link" id="cCheng_2013_a" href="#rCheng_2013_a">Cheng and Roth, 2013</a>). B3+ F1 used in TAC KBP measures the accuracy in terms of entity clusters, grouped by the mentions linked to the same entity
  • Table2: A sample of papers on entity linking with the data sets used in each paper (ordered chronologically). TAC-KBP proceedings comprise additional papers (McNamee and Dang, 2009; <a class="ref-link" id="cJi_et+al_2010_a" href="#rJi_et+al_2010_a"><a class="ref-link" id="cJi_et+al_2010_a" href="#rJi_et+al_2010_a">Ji et al, 2010</a></a>; <a class="ref-link" id="cJi_et+al_2010_a" href="#rJi_et+al_2010_a"><a class="ref-link" id="cJi_et+al_2010_a" href="#rJi_et+al_2010_a">Ji et al, 2010</a></a>; Mayfield et al, 2012). Our intention is not to exhaust related work but to illustrate how sparse evaluation impedes comparison
  • Table3: Performance(%, R: Recall; P: Precision) of the correct mentions using different mention extraction strategies. ACE and MSNBC only annotate a subset of all the mentions and therefore the absolute values of precision are largely underestimated
  • Table4: Performance (%) after incorporating entity types, comparing two sets of entity types (NER and FIGER). Using a set of fine-grained entity types (FIGER) generally achieves better results
  • Table5: Performance (%) after re-ranking candidates using coherence scores, comparing two coherence measures (NGD and REL). “no COH”: no coherence based re-ranking is used. “+BOTH”: an average of two scores is used for re-ranking. Coherence in general helps: a combination of both measures often achieves the best effect and NGD has a slight advantage over REL
  • Table6: End-to-end performance (%): We compare VINCULUM in different stages with two state-of-the-art systems, AIDA and WIKIFIER. The column “Overall” lists the average performance of nine data sets for each approach. CrossWikis appears to be a strong baseline. VINCULUM is 0.6% shy from WIKIFIER, each winning in four data sets; AIDA tops both VINCULUM and WIKIFIER on AIDA-test
  • Table7: Comparison of entity linking pipeline architectures. VINCULUM components are described in detail in Section 4, and correspond to Figure 2. Components found to be most useful for VINCULUM are highlighted
  • Table8: We divide linking errors into six error categories and provide an example for each class
  • Table9: Error analysis: We analyze a random sample of 250 of VINCULUM’s errors, categorize the errors into six classes, and display the frequencies of each type across the nine datasets
Download tables as Excel
Related work
  • Most related work has been discussed in the earlier sections; see Shen et al (2014) for an EL survey. Two other papers deserve comparison. Cornolti et al (2013) present a variety of evaluation measures and experimental results on five systems compared headto-head. In a similar spirit, Hachey et al (2014) provide an easy-to-use evaluation toolkit on the AIDA data set. In contrast, our analysis focuses on the problem definition and annotations, revealing the lack of consistent evaluation and a clear annotation guideline. We also show an extensive set of experimental results conducted on nine data sets as well as a detailed ablation analysis to assess each subcomponent of a linking system.
Funding
  • This work is supported in part by the Air Force Research Laboratory (AFRL) under prime contract no
  • FA8750-13-2-0019, an ONR grant N0001412-1-0211, a WRF / TJ Cable Professorship, a gift from Google, an ARO grant number W911NF-131-0246, and by TerraSwarm, one of six centers of STARnet, a Semiconductor Research Corporation program sponsored by MARCO and DARPA
Study subjects and analysis
data sets: 9
We attack this confusion by analyzing differences between several versions of the EL problem and presenting a simple yet effective, modular, unsupervised system, called VINCULUM, for entity linking. We conduct an extensive evaluation on nine data sets, comparing VINCULUM with two state-of-theart systems, and elucidate key aspects of the system that include mention extraction, candidate generation, entity type prediction, entity coreference, and coherence. In this section, we present experiments to address the following questions:

• Is NER sufficient to identify mentions? (Sec. 5.1)

• How much does candidate generation affect final EL performance? (Sec. 5.2)

• How much does entity type prediction help EL? What type set is most appropriate? (Sec. 5.3)

• How much does coherence improve the EL results? (Sec. 5.4)

• How well does VINCULUM perform compared to the state-of-the-art? (Sec. 5.5)

• Finally, which of VINCULUM’s components contribute the most to its performance? (Sec. 5.6)

5.1 Mention Extraction

We start by using Stanford NER for mention extraction and measure its efficacy by the recall of correct mentions shown in Table 3

data sets: 9
We attack this confusion by analyzing differences between several versions of the EL problem and presenting a simple yet effective, modular, unsupervised system, called VINCULUM, for entity linking. We conduct an extensive evaluation on nine data sets, comparing VINCULUM with two state-of-theart systems, and elucidate key aspects of the system that include mention extraction, candidate generation, entity type prediction, entity coreference, and coherence. Entity Linking (EL) is a central task in information extraction — given a textual passage, identify entity mentions (substrings corresponding to world entities) and link them to the corresponding entry in a given Knowledge Base (KB, e.g. Wikipedia or Freebase)

data sets: 9
We make the implementation open source and publicly available for future research.2 (Section 4). • We compare VINCULUM to 2 state-of-the-art systems on an extensive evaluation of 9 data sets. We also investigate several key aspects of the system including mention extraction, candidate generation, entity type prediction, entity coreference, and coherence between entities. (Section 5)

data sets: 9
2.1 Data Sets. Nine data sets are in common use for EL evaluation; we partition them into three groups. The UIUC group (ACE and MSNBC datasets) (Ratinov et al, 2011), AIDA group (with dev and test sets) (Hoffart et al, 2011), and TAC-KBP group (with datasets ranging from the 2009 through 2012 competitions) (McNamee and Dang, 2009)

data sets: 9
Freebase API provides scores for the entities using a combination of text similarity and an in-house entity relevance score. We compute candidates for the union of all the non-NIL mentions from all 9 data sets and measure their efficacy by recall@k. From Figure 3, it is clear that CrossWikis outperforms both the intraWikipedia dictionary and Freebase Search API for almost all k

data sets: 9
Two coherence measures suggested in Section 4.5 are tested in isolation to better understand their effects in terms of the linking performance (Table 5). In general, the link-based NGD works slightly better than the relational facts in 6 out of 9 data sets (comparing row “+NGD” with row “+REL”). We hypothesize that the inferior results of REL may be due to the incompleteness of Freebase triples, which makes it less robust than NGD

data sets: 9
Table 6 shows the performance of VINCULUM after each stage of candidate generation (CrossWikis), entity type prediction (+FIGER), coreference (+Coref) and coherence (+Coherence). The column “Overall” displays the average of the performance numbers for nine data sets for each approach. WIKIFIER achieves the highest in the overall performance

data sets: 9
VINCULUM performs quite comparably, only 0.6% shy from WIKIFIER, despite its simplicity and unsupervised nature. Looking at the performance per data set, VINCULUM and WIKIFIER each is superior in 4 out of 9 data sets while AIDA tops the performance only on AIDA-test. The performance of all the systems on TAC12 is generally lower than on the other dataset, mainly because of a low recall in the candidate generation stage

datasets: 9
In the example, usually the location name appearing in the byline of a news article is a city name; and VINCULUM, without knowledge of this convention, mistakenly links to a state with the same name. The distribution of errors shown in Table 9 provides valuable insights into VINCULUM’s varying performance across the nine datasets. First, we observe a notably high percentage of metonymy-related errors

data sets: 9
In contrast, our analysis focuses on the problem definition and annotations, revealing the lack of consistent evaluation and a clear annotation guideline. We also show an extensive set of experimental results conducted on nine data sets as well as a detailed ablation analysis to assess each subcomponent of a linking system. 7 Conclusion and Future Work

EL data sets: 9
When complex EL systems are introduced, there are limited ablation studies for readers to interpret the results. In this paper, we examine 9 EL data sets and discuss the inconsistencies among them. To have a better understanding of an EL system, we implement a simple yet effective, unsupervised system, VINCULUM, and conduct extensive ablation tests to measure the relative impact of each component

NEL data sets: 9
. Characteristics of the nine NEL data sets. Entity types: The AIDA data sets include named entities in four NER classes, Person (PER), Organization (ORG), Location (LOC) and Misc. In TAC KBP data sets, both Person (PERT ) and Organization entities (ORGT ) are defined differently from their NER counterparts and geo-political entities (GPE), different from LOC, exclude places like KB:Central California. KB (Sec. 2.2): The knowledge base used when each data was being developed. Evaluation Metric (Sec. 2.3): Bag-of-Concept F1 is used as the evaluation metric in (Ratinov et al, 2011; Cheng and Roth, 2013). B3+ F1 used in TAC KBP measures the accuracy in terms of entity clusters, grouped by the mentions linked to the same entity. A sample of papers on entity linking with the data sets used in each paper (ordered chronologically). TAC-KBP proceedings comprise additional papers (McNamee and Dang, 2009; Ji et al, 2010; Ji et al, 2010; Mayfield et al, 2012). Our intention is not to exhaust related work but to illustrate how sparse evaluation impedes comparison

data sets: 9
The process of finding the best entity for a mention. All possible entities are sifted through as VINCULUM proceeds at each stage with a widening range of context in consideration. Recall@k on an aggregate of nine data sets, comparing three candidate generation methods. Recall@k using CrossWikis for candidate generation, split by data set. 30 is chosen to be the cut-off value in consideration of both efficiency and accuracy

Reference
  • Jonathan Bragg, Andrey Kolobov, and Daniel S Weld. 2014. Parallel task routing for crowdsourcing. In Second AAAI Conference on Human Computation and Crowdsourcing.
    Google ScholarLocate open access versionFindings
  • Xiao Cheng and Dan Roth. 2013. Relational inference for wikification. In EMNLP.
    Google ScholarFindings
  • Andrew Chisholm and Ben Hachey. 2015. Entity disambiguation with web links. Transactions of the Association for Computational Linguistics, 3:145–156.
    Google ScholarLocate open access versionFindings
  • Marco Cornolti, Paolo Ferragina, and Massimiliano Ciaramita. 2013. A framework for benchmarking entityannotation systems. In Proceedings of the 22nd international conference on World Wide Web, pages 249–260. International World Wide Web Conferences Steering Committee.
    Google ScholarLocate open access versionFindings
  • Mark Craven and Johan Kumlien. 1999. Constructing biological knowledge bases by extracting information from text sources. In Proceedings of the Seventh International Conference on Intelligent Systems for Molecular Biology (ISMB-1999), pages 77–86.
    Google ScholarLocate open access versionFindings
  • S. Cucerzan. 2007. Large-scale named entity disambiguation based on wikipedia data. In Proceedings of EMNLP-CoNLL, volume 2007, pages 708–716.
    Google ScholarLocate open access versionFindings
  • Silviu Cucerzan. 2012. The msr system for entity linking at tac 2012. In Text Analysis Conference 2012.
    Google ScholarFindings
  • Greg Durrett and Dan Klein. 2014. A joint model for entity analysis: Coreference, typing, and linking. Transactions of the Association for Computational Linguistics, 2:477–490.
    Google ScholarLocate open access versionFindings
  • Paolo Ferragina and Ugo Scaiella. 2012. Fast and accurate annotation of short texts with wikipedia pages. IEEE Software, 29(1):70–75.
    Google ScholarLocate open access versionFindings
  • J.R. Finkel, T. Grenager, and C. Manning. 2005. Incorporating non-local information into information extraction systems by gibbs sampling. In Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, pages 363–370. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Ben Hachey, Joel Nothman, and Will Radford. 2014. Cheap and easy entity evaluation. In ACL.
    Google ScholarFindings
  • Hannaneh Hajishirzi, Leila Zilles, Daniel S. Weld, and Luke Zettlemoyer. 2013. Joint Coreference Resolution and Named-Entity Linking with Multi-pass Sieves. In EMNLP.
    Google ScholarFindings
  • Xianpei Han and Le Sun. 2012. An entity-topic model for entity linking. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pages 105–115. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Zhengyan He, Shujie Liu, Mu Li, Ming Zhou, Longkai Zhang, and Houfeng Wang. 2013a. Learning entity representation for entity disambiguation. Proc. ACL2013.
    Google ScholarLocate open access versionFindings
  • Zhengyan He, Shujie Liu, Yang Song, Mu Li, Ming Zhou, and Houfeng Wang. 2013b. Efficient collective entity linking with stacking. In EMNLP, pages 426–435.
    Google ScholarLocate open access versionFindings
  • Johannes Hoffart, Mohamed A. Yosef, Ilaria Bordino, Hagen Furstenau, Manfred Pinkal, Marc Spaniol, Bilyana Taneva, Stefan Thater, and Gerhard Weikum. 2011. Robust disambiguation of named entities in text. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 782–792. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Johannes Hoffart, Yasemin Altun, and Gerhard Weikum. 2014. Discovering emerging entities with ambiguous names. In Proceedings of the 23rd international conference on World wide web, pages 385–396. International World Wide Web Conferences Steering Committee.
    Google ScholarLocate open access versionFindings
  • Raphael Hoffmann, Congle Zhang, Xiao Ling, Luke Zettlemoyer, and Daniel S Weld. 2011. Knowledgebased weak supervision for information extraction of overlapping relations. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, volume 1, pages 541–550.
    Google ScholarLocate open access versionFindings
  • Heng Ji, Ralph Grishman, Hoa Trang Dang, Kira Griffitt, and Joe Ellis. 2010. Overview of the tac 2010 knowledge base population track. In Text Analysis Conference (TAC 2010).
    Google ScholarFindings
  • Mitchell Koch, John Gilmer, Stephen Soderland, and Daniel S Weld. 2014. Type-aware distantly supervised relation extraction with linked arguments. In EMNLP.
    Google ScholarFindings
  • Sayali Kulkarni, Amit Singh, Ganesh Ramakrishnan, and Soumen Chakrabarti. 2009. Collective annotation of Wikipedia entities in web text. In Proceedings of the
    Google ScholarLocate open access versionFindings
  • Valentin I Spitkovsky and Angel X Chang. 2012. A crosslingual dictionary for english wikipedia concepts. In LREC, pages 3168–3175.
    Google ScholarLocate open access versionFindings
  • 2012. Tac kbp entity selection. http://www.nist.
    Findings
  • gov/tac/2012/KBP/task_guidelines/ TAC_KBP_Entity_Selection_V1.1.pdf. Michael Wick, Sameer Singh, Harshal Pandya, and Andrew McCallum. 2013. A joint model for discovering and linking entities. In CIKM Workshop on Automated Knowledge Base Construction (AKBC). Jiaping Zheng, Luke Vilnis, Sameer Singh, Jinho D. Choi, and Andrew McCallum. 2013. Dynamic knowledgebase alignment for coreference resolution. In Conference on Computational Natural Language Learning (CoNLL).
    Google ScholarLocate open access versionFindings
Full Text
Your rating :
0

 

Tags
Comments