Learning Named Entity Tagger using Domain-Specific Dictionary

EMNLP, pp. 2054-2064, 2018.

Cited by: 24|Bibtex|Views57|Links
EI
Keywords:
entity recognitionhuman effortbreak schemeconditional random fieldPubMed CentralMore(6+)
Wei bo:
We discuss how to refine the distant supervision for better named entity recognition performance, including incorporating high-quality phrases mined from the corpus as well as tailoring dictionary according to the given corpus, and demonstrate their effectiveness in ablation expe...

Abstract:

Recent advances in deep neural models allow us to build reliable named entity recognition (NER) systems without handcrafting features. However, such methods require large amounts of manually-labeled training data. There have been efforts on replacing human annotations with distant supervision (in conjunction with external dictionaries), b...More

Code:

Data:

Introduction
  • Extensive efforts have been made on building reliable named entity recognition (NER) models without handcrafting features (Liu et al, 2018; Ma and Hovy, 2016; Lample et al, 2016).
  • Most existing methods require large amounts of manually annotated sentences for training supervised models (Liu et al, 2018; Ma and Hovy, 2016; Lample et al, 2016; Finkel et al, 2005).
  • Open knowledge bases are becoming increasingly popular, such as WikiData and YAGO in the general domain, as well as MeSH and CTD in the biomedical domain
  • The existence of such dictionaries makes it possible to generate training data for NER at a large scale without additional human effort
Highlights
  • Recently, extensive efforts have been made on building reliable named entity recognition (NER) models without handcrafting features (Liu et al, 2018; Ma and Hovy, 2016; Lample et al, 2016)
  • We propose AutoNER, a novel neural model with the new Tie or Break scheme for the distantly supervised named entity recognition task
  • We explore how to learn an effective named entity recognition model by using, and only using dictionaries
  • We discuss how to refine the distant supervision for better named entity recognition performance, including incorporating high-quality phrases mined from the corpus as well as tailoring dictionary according to the given corpus, and demonstrate their effectiveness in ablation experiments
  • The proposed framework can be further extended to other sequence labeling tasks, such as noun phrase chunking
Methods
  • Dictionary Match is the proposed distant supervision generation method. the authors apply it to the testing set directly to obtain entity mentions with exactly the same surface name as in the dictionary.
  • SwellShark, in the biomedical domain, is arguably the best distantly supervised model, especially on the BC5CDR and NCBI-Disease datasets (Fries et al, 2017).
  • It needs no human annotated data, it requires extra expert effort for entity span detection on building POS tagger, designing effective regular expressions, and hand-tuning for special cases
Results
  • The authors explore the change of test F1 scores when the authors have different sizes of distantly supervised texts.
Conclusion
  • The authors explore how to learn an effective NER model by using, and only using dictionaries.
  • The authors discuss how to refine the distant supervision for better NER performance, including incorporating high-quality phrases mined from the corpus as well as tailoring dictionary according to the given corpus, and demonstrate their effectiveness in ablation experiments.
  • Going beyond the classical NER setting in this paper, it is interesting to further explore distant supervised methods for the nested and multiple typed entity recognitions in the future
Summary
  • Introduction:

    Extensive efforts have been made on building reliable named entity recognition (NER) models without handcrafting features (Liu et al, 2018; Ma and Hovy, 2016; Lample et al, 2016).
  • Most existing methods require large amounts of manually annotated sentences for training supervised models (Liu et al, 2018; Ma and Hovy, 2016; Lample et al, 2016; Finkel et al, 2005).
  • Open knowledge bases are becoming increasingly popular, such as WikiData and YAGO in the general domain, as well as MeSH and CTD in the biomedical domain
  • The existence of such dictionaries makes it possible to generate training data for NER at a large scale without additional human effort
  • Methods:

    Dictionary Match is the proposed distant supervision generation method. the authors apply it to the testing set directly to obtain entity mentions with exactly the same surface name as in the dictionary.
  • SwellShark, in the biomedical domain, is arguably the best distantly supervised model, especially on the BC5CDR and NCBI-Disease datasets (Fries et al, 2017).
  • It needs no human annotated data, it requires extra expert effort for entity span detection on building POS tagger, designing effective regular expressions, and hand-tuning for special cases
  • Results:

    The authors explore the change of test F1 scores when the authors have different sizes of distantly supervised texts.
  • Conclusion:

    The authors explore how to learn an effective NER model by using, and only using dictionaries.
  • The authors discuss how to refine the distant supervision for better NER performance, including incorporating high-quality phrases mined from the corpus as well as tailoring dictionary according to the given corpus, and demonstrate their effectiveness in ablation experiments.
  • Going beyond the classical NER setting in this paper, it is interesting to further explore distant supervised methods for the nested and multiple typed entity recognitions in the future
Tables
  • Table1: Dataset Overview
  • Table2: Table 2
  • Table3: Table 3
  • Table4: Ablation Experiments for Dictionary Refinement. The dictionary for the LaptopReview dataset contains no alias, so the corpus-aware dictionary tailoring is not applicable
Download tables as Excel
Related work
Funding
  • We would like to thank Yu Zhang from University of Illinois at Urbana-Champaign for providing results of supervised benchmark methods on the BC5CDR and NCBI datasets. Research was sponsored in part by U.S Army Research Lab. under Cooperative Agreement No W911NF-09-2-0053 (NSCTA), DARPA under Agreement No W911NF-17-C-0099, National Science Foundation IIS 16-18481, IIS 17-04532, and IIS-17-41317, DTRA HDTRA11810026, Google Ph.D
  • Fellowship and grant 1U54GM114838 awarded by NIGMS through funds provided by the trans-NIH Big Data to Knowledge (BD2K) initiative (www.bd2k.nih.gov)
Reference
  • Mark Craven, Johan Kumlien, et al. 1999. Constructing biological knowledge bases by extracting information from text sources. In ISMB, volume 1999, pages 77–86.
    Google ScholarLocate open access versionFindings
  • Franck Dernoncourt, Ji Young Lee, Ozlem Uzuner, and Peter Szolovits. 2017. De-identification of patient notes with recurrent neural networks. Journal of the American Medical Informatics Association, 24(3):596–606.
    Google ScholarLocate open access versionFindings
  • Oren Etzioni, Michael Cafarella, Doug Downey, AnaMaria Popescu, Tal Shaked, Stephen Soderland, Daniel S Weld, and Alexander Yates. 2005. Unsupervised named-entity extraction from the web: An experimental study. Artificial intelligence, 165(1):91–134.
    Google ScholarLocate open access versionFindings
  • Jenny Rose Finkel, Trond Grenager, and Christopher Manning. 2005. Incorporating non-local information into information extraction systems by gibbs sampling. In Proceedings of the 43rd annual meeting on association for computational linguistics, pages 363–370. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Jason Fries, Sen Wu, Alex Ratner, and Christopher Re. 2017. Swellshark: A generative model for biomedical named entity recognition without labeled data. arXiv preprint arXiv:1704.06360.
    Findings
  • Athanasios Giannakopoulos, Claudiu Musat, Andreea Hossmann, and Michael Baeriswyl. 2017. Unsupervised aspect term extraction with b-lstm & crf using automatically labelled datasets. In Proceedings of the 8th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis, pages 180–188.
    Google ScholarLocate open access versionFindings
  • Daniel Hanisch, Katrin Fundel, Heinz-Theodor Mevissen, Ralf Zimmer, and Juliane Fluck. 2005. Prominer: rule-based protein and gene entity recognition. BMC bioinformatics, 6(1):S14.
    Google ScholarLocate open access versionFindings
  • Wenqi He. 2017. Autoentity: automated entity detection from massive text corpora. M.S. Thesis for Computer Science of University of Illinois at Urbana-Champaign.
    Google ScholarFindings
  • Sepp Hochreiter and Jurgen Schmidhuber. 1997. Long short-term memory. Neural computation, 9(8):1735–1780.
    Google ScholarLocate open access versionFindings
  • Pavel P Kuksa and Yanjun Qi. 20Semi-supervised bio-named entity recognition with word-codebook learning. In Proceedings of the 2010 SIAM International Conference on Data Mining, pages 25–36. SIAM.
    Google ScholarLocate open access versionFindings
  • Guillaume Lample, Miguel Ballesteros, Sandeep Subramanian, Kazuya Kawakami, and Chris Dyer. 2016. Neural architectures for named entity recognition. In Proceedings of NAACL-HLT, pages 260–270.
    Google ScholarLocate open access versionFindings
  • Robert Leaman and Graciela Gonzalez. 2008. Banner: an executable survey of advances in biomedical named entity recognition. In Biocomputing 2008, pages 652–663. World Scientific.
    Google ScholarLocate open access versionFindings
  • Thomas Lin, Oren Etzioni, et al. 2012. Entity linking at web scale. In Proceedings of the Joint Workshop on Automatic Knowledge Base Construction and Webscale Knowledge Extraction, pages 84–88. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Liyuan Liu, Jingbo Shang, Frank Xu, Xiang Ren, Huan Gui, Jian Peng, and Jiawei Han. 2018. Empower sequence labeling with task-aware neural language model. AAAI.
    Google ScholarLocate open access versionFindings
  • Xuezhe Ma and Eduard Hovy. 2016. End-to-end sequence labeling via bi-directional lstm-cnns-crf. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, pages 1064–1074.
    Google ScholarLocate open access versionFindings
  • Mike Mintz, Steven Bills, Rion Snow, and Dan Jurafsky. 2009. Distant supervision for relation extraction without labeled data. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 2-Volume 2, pages 1003–1011. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Maria Pontiki, Dimitrios Galanis, John Pavlopoulos, Haris Papageorgiou, Ion Androutsopoulos, and Suresh Manandhar. 2014. Semeval-2014 task 4: Aspect based sentiment analysis. In Proceedings of Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014), page 2735.
    Google ScholarLocate open access versionFindings
  • Sampo Pyysalo, Filip Ginter, Hans Moen, Tapio Salakoski, and Sophia Ananiadou. 2013. Distributional semantics resources for biomedical text processing. In Proceedings of the 5th International Symposium on Languages in Biology and Medicine, Tokyo, Japan, pages 39–43.
    Google ScholarLocate open access versionFindings
  • Lev Ratinov and Dan Roth. 2009. Design challenges and misconceptions in named entity recognition. In CoNLL.
    Google ScholarFindings
  • Xiang Ren, Ahmed El-Kishky, Chi Wang, Fangbo Tao, Clare R Voss, and Jiawei Han. 2015. Clustype: Effective entity recognition and typing by relation phrase-based clustering. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 995–1004. ACM.
    Google ScholarLocate open access versionFindings
  • Sunil Sahu and Ashish Anand. 2016. Recurrent neural network models for disease name recognition using domain invariant features. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, pages 2216–2225.
    Google ScholarLocate open access versionFindings
  • Burr Settles. 2004. Biomedical named entity recognition using conditional random fields and rich feature sets. In Proceedings of the international joint workshop on natural language processing in biomedicine and its applications, pages 104–107. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Jingbo Shang, Jialu Liu, Meng Jiang, Xiang Ren, Clare R Voss, and Jiawei Han. 2018. Automated phrase mining from massive text corpora. IEEE Transactions on Knowledge and Data Engineering.
    Google ScholarLocate open access versionFindings
  • Buzhou Tang, Hongxin Cao, Xiaolong Wang, Qingcai Chen, and Hua Xu. 2014. Evaluating word representation features in biomedical named entity recognition tasks. BioMed research international, 2014.
    Google ScholarFindings
  • Andreas Vlachos and Caroline Gasperin. 2006. Bootstrapping and evaluating named entity recognition in the biomedical domain. In Proceedings of the HLTNAACL BioNLP Workshop on Linking Natural Language and Biology, pages 138–145. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Hongning Wang, Yue Lu, and ChengXiang Zhai. 2011. Latent aspect rating analysis without aspect keyword supervision. In Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 618–6ACM.
    Google ScholarLocate open access versionFindings
  • Xuan Wang, Yu Zhang, Xiang Ren, Yuhao Zhang, Marinka Zitnik, Jingbo Shang, Curtis Langlotz, and Jiawei Han. 2018. Cross-type biomedical named entity recognition with deep multi-task learning. arXiv preprint arXiv:1801.09851.
    Findings
Your rating :
0

 

Tags
Comments