Interpretable Multi dataset Evaluation for Named Entity Recognition

Jinlan Fu
Jinlan Fu
Pengfei Liu
Pengfei Liu
Graham Neubig
Graham Neubig

EMNLP 2020, 2020.

Cited by: 0|Bibtex|Views11
Other Links: arxiv.org
Keywords:
broadcast newslatexit sha1_base64="fwa7q91l9we2sd7+qcxg628m1/k="entity recognitionbroadcast conversationmagazine genreMore(2+)
Weibo:
This paper has provided a framework where we can covert our understanding of the Named Entity Recognition task into interpretable evaluation aspects, and define axes through which we can apply them to acquire insights and make model improvements

Abstract:

With the proliferation of models for natural language processing tasks, it is even harder to understand the differences between models and their relative merits. Simply looking at differences between holistic metrics such as accuracy, BLEU, or F1 does not tell us why or how particular methods perform differently and how diverse datasets...More

Code:

Data:

0
Introduction
Highlights
Results
  • As expressed in Fig. 6, the value of each unit in the heat maps denotes the relative increase achieved by the larger-context method.
  • Different evaluation attributes allow them to understand the source of improvement from diverse perspectives: 1) in terms of label consistency, test entities with lower label consistency will achieve larger improvements with the help of more contextual sentences.
  • In the three of five datasets, more contextual sentences lead to worse performance on longer test entities
Conclusion
  • This paper has provided a framework where the authors can covert the understanding of the NER task into interpretable evaluation aspects, and define axes through which the authors can apply them to acquire insights and make model improvements.
  • This is just a first step towards the goal of fully-automated interpretable evaluation, and applications to new attributes and tasks beyond NER are promising future directions
Summary
  • Introduction:

    With improvements in model architectures (Hochreiter and Schmidhuber, 1997; Kalchbrenner et al, 2014; Lample et al, 2016; Collobert et al, 2011) and learning of pre-trained embeddings (Peters et al, 2018; Akbik et al, 2018, 2019; Devlin et al, 2018; Pennington et al, 2014), Named Entity Recognition (NER) systems are evolving rapidly and quickly reaching a performance plateau (Akbik et al, 2018, 2019)
  • This proliferation of methods poses a great challenge for the current evaluation methodology, which usually is based on comparing systems on a single holistic score assessing accuracy.
  • The authors calculate the performance for each bucket of test entities
  • Results:

    As expressed in Fig. 6, the value of each unit in the heat maps denotes the relative increase achieved by the larger-context method.
  • Different evaluation attributes allow them to understand the source of improvement from diverse perspectives: 1) in terms of label consistency, test entities with lower label consistency will achieve larger improvements with the help of more contextual sentences.
  • In the three of five datasets, more contextual sentences lead to worse performance on longer test entities
  • Conclusion:

    This paper has provided a framework where the authors can covert the understanding of the NER task into interpretable evaluation aspects, and define axes through which the authors can apply them to acquire insights and make model improvements.
  • This is just a first step towards the goal of fully-automated interpretable evaluation, and applications to new attributes and tasks beyond NER are promising future directions
Tables
  • Table1: Neural NER systems with different architectures. CRF++ is a Conditional Random Fields (<a class="ref-link" id="cLafferty_et+al_2001_a" href="#rLafferty_et+al_2001_a">Lafferty et al, 2001</a>) method based on feature engineering. Bold is the best performance of a given dataset according to F1. For the model name, “C” refers to “Char/Subword” and “W” refers to “Word”. For example, ”CnonWrandLstmCrf ” is a model with no character features, with randomly initialized embeddings, and the sentence encoder is LSTM and decoder is CRF
  • Table2: Model-wise measures (Percentage) Sρi,j and Sσi,j which are the average over all the datasets. The F1 score for a model is also an average case on all the datasets. The value in grey denotes the attribute does not pass a significance test (p ≥ 0.05). The values in green and in pink support observation 1 and observation 2, respectively. The bold is the maximum value in the attribute column
  • Table3: Self-diagnosis, and Comparative diagnosis (Sec. 5.3.1) of different NER systems. M1 and M2 denote two models. We classify the attribute values into four categories: extra-small (XS), small (S), large (L) and extra-large (XL). In the self-diagnosis histogram, green (red) x ticklabels represents the bucket value of a specific attribute on which system achieved best (worst) performance. Gray bins represent worst performance while blue bins denote the gap between best and worst performance. In the comparative diagnosis histogram, green (red) x ticklabels represents the bucket value of a specific attribute on which system M1 surpasses (under-performs) M2 by the largest margin that is illustrated by a green (red) bin
  • Table4: Comparative diagnosis of different NER systems. M1 and M2 denote two models. We classify the attribute values into four categories: extra-small (XS), small (S), large (L), and extra-large (XL). In the comparative diagnosis histogram, green (red) x ticklabels represents the bucket value of a specific attribute on which system M1 surpasses (under-performs) M2 by the largest margin that is illustrated by a green (red) bin
  • Table5: p-values from the Friedman test. The null hypothesis is that the performance of different buckets with respect to an attribute has the same means for a given dataset
  • Table6: p-values from the Friedman test. The null hypothesis is that the performance of different buckets with respect to an attribute has the same means for a given model. The Pink region denote the attribute on the given model does not pass (p ≥ 0.05) a significance test at p = 0.05
Funding
  • This material is based on research sponsored by the Air Force Research Laboratory under agreement number FA8750-192-0200
Study subjects and analysis
datasets: 6
tCon eDen tCon eDen eFre eLen eFre eLen tFre tFre (a) ζ (b) ρ. We normalize ζ on each attribute by dividing the maximum ζ on six datasets, and ρ ∈ [0, 1]. ζj(E, φ(·)) = N φj(x), (10)

datasets: 6
MLP The benefits of using CRF on the sentence with high entity density (eDen:XL) are test requires more than two groups (Zimmerman and Zumbo, 1993). 10We restarted the BERT-based system twice on six datasets, and we got 12 best and 12 worst F1 scores for a given attribute. remarkably stable, and improvement can be seen in all datasets (p = 1.8 × 10−5 < 0.05)

datasets: 6
6.1 Experimental Setting. We choose CbertWnoneLstmMlp as a base model, which will be trained under different numbers (K = 1, 2, 3, 4, 5, 6, 10) of contextual sentences on all six datasets respectively. For example, K = 2 represents that each training sample is constructed by concatenating two consecutive sentences from the original dataset (K = 1)

Reference
  • Alan Akbik, Tanja Bergmann, and Roland Vollgraf. 2019. Pooled contextualized embeddings for named entity recognition. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 724–728.
    Google ScholarLocate open access versionFindings
  • Alan Akbik, Duncan Blythe, and Roland Vollgraf. 2018. Contextual string embeddings for sequence labeling. In Proceedings of the 27th International Conference on Computational Linguistics, pages 1638–1649.
    Google ScholarLocate open access versionFindings
  • Hui Chen, Zijia Lin, Guiguang Ding, Jianguang Lou, Yusen Zhang, and Borje Karlsson. 2019. Grn: Gated relation network to enhance convolutional neural network for named entity recognition. ThirtyThird AAAI Conference on Artificial Intelligence, 33(01):6236–6243.
    Google ScholarLocate open access versionFindings
  • Jason PC Chiu and Eric Nichols. 2015. Named entity recognition with bidirectional lstm-cnns. arXiv preprint arXiv:1511.08308.
    Findings
  • Ronan Collobert, Jason Weston, Leon Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel Kuksa. 2011. Natural language processing (almost) from scratch. Journal of Machine Learning Research, 12(Aug):2493–2537.
    Google ScholarLocate open access versionFindings
  • Leon Derczynski, Diana Maynard, Giuseppe Rizzo, Marieke Van Erp, Genevieve Gorrell, Raphael Troncy, Johann Petrak, and Kalina Bontcheva. 2015. Analysis of named entity recognition and linking for tweets. Information Processing & Management, 51(2):32–49.
    Google ScholarLocate open access versionFindings
  • Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
    Findings
  • Jinlan Fu, Pengfei Liu, Qi Zhang, and Xuanjing Huang. 2020. Rethinking generalization of neural models: A named entity recognition case study. In AAAI, pages 7732–7739.
    Google ScholarLocate open access versionFindings
  • Sepp Hochreiter and Jurgen Schmidhuber. 1997. Long short-term memory. Neural computation, 9(8):1735–1780.
    Google ScholarLocate open access versionFindings
  • Anwen Hu, Zhicheng Dou, Jirong Wen, and Jianyun Nie. 2020. Leveraging multi-token entities in document-level named entity recognition. ThirtyForth AAAI Conference on Artificial Intelligence.
    Google ScholarLocate open access versionFindings
  • Zhiheng Huang, Wei Xu, and Kai Yu. 2015. Bidirectional lstm-crf models for sequence tagging. arXiv preprint arXiv:1508.01991.
    Findings
  • Masaaki Ichihara, Kanako Komiya, Tomoya Iwakura, and Maiko Yamazaki. 2015. Error analysis of named entity recognition in bccwj. Recall, 61:2641.
    Google ScholarLocate open access versionFindings
  • Nal Kalchbrenner, Edward Grefenstette, and Phil Blunsom. 2014. A convolutional neural network for modelling sentences. In Proceedings of ACL.
    Google ScholarLocate open access versionFindings
  • John Lafferty, Andrew Mccallum, and Fernando Pereira. 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. pages 282–289.
    Google ScholarFindings
  • Guillaume Lample, Miguel Ballesteros, Sandeep Subramanian, Kazuya Kawakami, and Chris Dyer. 2016. Neural architectures for named entity recognition. In Proceedings of NAACL-HLT, pages 260–270.
    Google ScholarLocate open access versionFindings
  • Xiaoya Li, Jingrong Feng, Yuxian Meng, Qinghong Han, Fei Wu, and Jiwei Li. 2020. A unified mrc framework for named entity recognition. Proceedings of the 58rd Annual Meeting of the Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Bill Yuchen Lin, Dong-Ho Lee, Ming Shen, Ryan Moreno, Xiao Huang, Prashant Shiralkar, and Xiang Ren. 2020. Triggerner: Learning with entity triggers as explanations for named entity recognition. In Proceedings of ACL.
    Google ScholarLocate open access versionFindings
  • Ying Luo, Fengshun Xiao, and Hai Zhao. 2020. Hierarchical contextualized representation for named entity recognition.
    Google ScholarFindings
  • Xuezhe Ma and Eduard Hovy. 2016. End-to-end sequence labeling via bi-directional lstm-cnns-crf. In Proceedings of the 54th Annual Meeting of ACL, volume 1, pages 1064–1074.
    Google ScholarLocate open access versionFindings
  • Mavuto M Mukaka. 2012. A guide to appropriate use of correlation coefficient in medical research. Malawi Medical Journal, 24(3):69–71.
    Google ScholarLocate open access versionFindings
  • Graham Neubig, Zi-Yi Dou, Junjie Hu, Paul Michel, Danish Pruthi, and Xinyi Wang. 2019. compare-mt: A tool for holistic comparison of language generation systems. arXiv preprint arXiv:1903.07926.
    Findings
  • Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 1532–1543.
    Google ScholarLocate open access versionFindings
  • Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations. In Proceedings of the 2018 Conference of NAACL, volume 1, pages 2227–2237.
    Google ScholarLocate open access versionFindings
  • Yujie Qian, Enrico Santus, Zhijing Jin, Jiang Guo, and Regina Barzilay. 2018. Graphie: A graph-based framework for information extraction. arXiv: Computation and Language.
    Google ScholarFindings
  • Erik F Sang and Fien De Meulder. 2003. Introduction to the conll-2003 shared task: Languageindependent named entity recognition. arXiv preprint cs/0306050.
    Google ScholarFindings
  • Benjamin Strauss, Bethany Toma, Alan Ritter, Mariecatherine De Marneffe, and Wei Xu. 2016. Results of the wnut16 named entity recognition shared task. pages 138–144.
    Google ScholarFindings
  • Ralph Weischedel, Martha Palmer, Mitchell Marcus, Eduard Hovy, Sameer Pradhan, Lance Ramshaw, Nianwen Xue, Ann Taylor, Jeff Kaufman, Michelle Franchini, et al. 2013. Ontonotes release 5.0 ldc2013t19. Linguistic Data Consortium, Philadelphia, PA.
    Google ScholarFindings
  • Frank Wilcoxon, SK Katti, and Roberta A Wilcox. 1970. Critical values and probability levels for the wilcoxon rank sum test and the wilcoxon signed rank test. Selected tables in mathematical statistics, 1:171–259.
    Google ScholarLocate open access versionFindings
  • Donald W Zimmerman and Bruno D Zumbo. 1993. Relative power of the wilcoxon test, the friedman test, and repeated-measures anova on ranks. The Journal of Experimental Education, 62(1):75–86.
    Google ScholarLocate open access versionFindings
Full Text
Your rating :
0

 

Tags
Comments