Uncertainty Aware Label Refinement for Sequence Labeling

Jiacheng Ye
Jiacheng Ye
Zhengyan Li
Zhengyan Li
Zichu Fei
Zichu Fei

EMNLP 2020, pp. 2316-2326, 2020.

Other Links: arxiv.org|academic.microsoft.com
Weibo:
We introduce a novel sequence labeling framework that incorporates Bayesian neural networks to estimate model uncertainty

Abstract:

Conditional random fields (CRF) for label decoding has become ubiquitous in sequence labeling tasks. However, the local label dependencies and inefficient Viterbi decoding have always been a problem to be solved. In this work, we introduce a novel two-stage label decoding framework to model long-term label dependencies, while being much m...More

Code:

Data:

0
Introduction
  • Linguistic sequence labeling is one of the fundamental tasks in natural language processing.
  • It has the goal of predicting a linguistic label for each word, including part-of-speech (POS) tagging, text chunking, and named entity recognition (NER).
  • The use of representation learning to obtain better text representation is very successful, True Label Refinement.
  • Draft Label B-LOC.
  • Input United I-LOC
Highlights
  • Linguistic sequence labeling is one of the fundamental tasks in natural language processing
  • The above label refinement operations can be processed in parallel, which can avoid the use of Viterbi decoding of the Conditional random fields (CRF) for a faster prediction
  • We propose a novel sequence labeling framework, which incorporates Bayesian neural networks to estimate the epistemic uncertainty of the draft labels
  • We introduce a novel sequence labeling framework that incorporates Bayesian neural networks to estimate model uncertainty
  • We find that the model uncertainty can effectively indicate the labels with a high probability of being wrong
  • Experimental results across three sequence labeling datasets demonstrated that the proposed method significantly outperforms the previous methods
Methods
  • The authors mainly focus on improving decoding efficiency and enhancing label dependencies.
  • The authors make comparisons with the classic methods that have different decoding layers, such as Softmax, CRF, and LAN frameworks.
  • The authors compare some recent competitive methods, such as Transformer, IntNet (Xin et al, 2018), and BERT (Devlin et al, 2019).
  • BiLSTM-Softmax.
  • This baseline uses bidirectional LSTM to reprensent a sequence.
Results
  • Results and Analysis

    the authors present the experimental results of the proposed and baseline models.
  • Table 3 reports model performances on CoNLL2003, OntoNotes, and WSJ dataset, which shows that the proposed method can achieve state-of-the-art results on NER task and is effective on other sequence labeling tasks, like POS tagging.
  • The previous methods leverage rich handcrafted features (Huang et al, 2015; Chiu and Nichols, 2016), CRF decoding (Strubell et al, 2017), and longer range label dependencies (Zhang et al, 2018; Cui and Zhang, 2019)
  • Compared with these methods, the UANet model gives better results.
  • It outperforms LAN and seq2seq models on all of the three datasets
Conclusion
  • When Γ is too large, the model mainly uses draft labels as final predictions, resulting in performance degradation, which verifies the motivation that a reasonable uncertainty threshold can avoid side effects on correct draft labels.
  • The proposed model can capture different ranges of label dependencies and word-label interactions in parallel, which can avoid the use of Viterbi decoding of the CRF for a faster prediction.
  • The Journal of Machine Learning Research, 15(1):981–1009
Summary
  • Introduction:

    Linguistic sequence labeling is one of the fundamental tasks in natural language processing.
  • It has the goal of predicting a linguistic label for each word, including part-of-speech (POS) tagging, text chunking, and named entity recognition (NER).
  • The use of representation learning to obtain better text representation is very successful, True Label Refinement.
  • Draft Label B-LOC.
  • Input United I-LOC
  • Methods:

    The authors mainly focus on improving decoding efficiency and enhancing label dependencies.
  • The authors make comparisons with the classic methods that have different decoding layers, such as Softmax, CRF, and LAN frameworks.
  • The authors compare some recent competitive methods, such as Transformer, IntNet (Xin et al, 2018), and BERT (Devlin et al, 2019).
  • BiLSTM-Softmax.
  • This baseline uses bidirectional LSTM to reprensent a sequence.
  • Results:

    Results and Analysis

    the authors present the experimental results of the proposed and baseline models.
  • Table 3 reports model performances on CoNLL2003, OntoNotes, and WSJ dataset, which shows that the proposed method can achieve state-of-the-art results on NER task and is effective on other sequence labeling tasks, like POS tagging.
  • The previous methods leverage rich handcrafted features (Huang et al, 2015; Chiu and Nichols, 2016), CRF decoding (Strubell et al, 2017), and longer range label dependencies (Zhang et al, 2018; Cui and Zhang, 2019)
  • Compared with these methods, the UANet model gives better results.
  • It outperforms LAN and seq2seq models on all of the three datasets
  • Conclusion:

    When Γ is too large, the model mainly uses draft labels as final predictions, resulting in performance degradation, which verifies the motivation that a reasonable uncertainty threshold can avoid side effects on correct draft labels.
  • The proposed model can capture different ranges of label dependencies and word-label interactions in parallel, which can avoid the use of Viterbi decoding of the CRF for a faster prediction.
  • The Journal of Machine Learning Research, 15(1):981–1009
Tables
  • Table1: Results of LAN with uncertainty estimation evaluated on CoNLL2003 test dataset. refers to the correct prediction, and refers to the wrong prediction. We use Bayesian neural networks (<a class="ref-link" id="cKendall_2017_a" href="#rKendall_2017_a">Kendall and Gal, 2017</a>) to estimate the uncertainty. We can see that the uncertainty value of incorrect prediction is 29 times larger than that of correct predictions, which can effectively indicate the incorrect predictions
  • Table2: Statistics of CoNLL2003, OntoNotes and WSJ datasets, where # represents the number of tokens in datasets. The class number of NER datasets is counted under BIOES tag scheme
  • Table3: Main results on three sequence labeling datasets. ∗ indicates the results by running <a class="ref-link" id="cCui_2019_a" href="#rCui_2019_a">Cui and Zhang (2019</a>)’s released code5
  • Table4: Ablation study of UANet
  • Table5: Results on CoNLL2003 test set. We implement BERT for NER task without documentlevel information. Original result of BERT in (Devlin et al, 2019) was not achieved with the current version of the library. See a discussion in (<a class="ref-link" id="cStanislawek_et+al_2019_a" href="#rStanislawek_et+al_2019_a">Stanislawek et al, 2019</a>) and the reported results at (<a class="ref-link" id="cZhang_et+al_2019_a" href="#rZhang_et+al_2019_a">Zhang et al, 2019</a>)
  • Table6: Comparison of inference speed. M represents for the number of sampling. We show how many sentences the model can process per second
  • Table7: NER cases analysis. Contents with bold red and italic blue styles represent incorrect and correct entities, respectively. Draft labels with uncertainty greater than 0.35 will be refined
Download tables as Excel
Related work
  • Related Work and Background

    2.1 Sequence Labeling

    Traditional sequence labeling models use statistical approaches such as Hidden Markov Models (HMM) and Conditional Random Fields (CRF) (Passos et al, 2014; Cuong et al, 2014; Luo et al, 2015) with handcrafted features and task-specific resources. With advances in deep learning, neural models could achieve competitive performances without massive handcrafted feature engineering (Chiu and Nichols, 2016; Santos and Zadrozny, 2014). In recent years, modeling label dependencies has been the other focus of sequence labeling tasks, such as using a CRF layer integrated with neural encoders to capture label transition patterns (Zhou and Xu, 2015; Ma and Hovy, 2016), and introducing label embeddings to manage longer ranges of dependencies (Vaswani et al, 2016; Zhang et al, 2018; Cui and Zhang, 2019). Our work is an extension of label embedding methods, which applies label dependencies and word-label interactions to only refine the labels with high probabilities of being incorrect. The probability of making a mistake is estimated using Bayesian neural networks, which will be described in the next subsection.

    2.2 Bayesian Neural Networks

    The predictive probabilities obtained by the softmax output are often erroneously interpreted as model confidence. However, a model can be uncertain in its predictions even with a high softmax output (Gal and Ghahramani, 2016a). Gal and Ghahramani (2016a) gives results showing that simply using predictive probabilities to estimate the uncertainty results in extrapolations with unjustified high confidence for points far from the training data. They verified that modeling a distribution over the parameters through Bayesian NNs can effectively reflect the uncertainty, and
Funding
  • This work was partially funded by China National Key R&D Program (No 2018YFC0831105, 2018YFB1005104, 2017YFB1002104), National Natural Science Foundation of China (No 61751201, 61976056, 61532011), Shanghai Municipal Science and Technology Major Project (No.2018SHZDZX01), Science and Technology Commission of Shanghai Municipality Grant (No.18DZ1201000, 17JC1420200). Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc Le, and Ruslan Salakhutdinov. 2019
Study subjects and analysis
sequence labeling datasets: 3
4.1 Datasets. We conduct experiments on three sequence labeling datasets. The statistics are listed in Table 2

datasets: 3
Moreover, different from the seq2seq and LAN models that also leverage label dependencies, our UANet model integrates model uncertainty into the refinement stage to avoid side effects on correct draft labels. As a result, it outperforms LAN and seq2seq models on all of the three datasets. 5.2 Ablation Study

datasets: 3
5.2 Ablation Study. To study the contribution of each component in BiLSTM-UANet, we conducted ablation experiments on the three datasets and display the results in Table 4. The results show that the model’s performance is degraded if the draft label information is removed, indicating that label dependencies are useful in the refinement

datasets: 3
Uncertainty Threshold. In order to investigate the influence of uncertainty threshold Γ, we evaluate the performance with different uncertainty thresholds on three datasets, as shown in Figure 4. Γ = 0 represents that the model uses all of the refined labels as final predictions

sequence labeling datasets: 3
In addition, the proposed model can capture different ranges of label dependencies and word-label interactions in parallel, which can avoid the use of Viterbi decoding of the CRF for a faster prediction. Experimental results across three sequence labeling datasets demonstrated that the proposed method significantly outperforms the previous methods. Nguyen Viet Cuong, Nan Ye, Wee Sun Lee, and Hai Leong Chieu. 2014

Reference
  • Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and Noam Shazeer. 2015. Scheduled sampling for sequence prediction with recurrent neural networks. In Advances in Neural Information Processing Systems, pages 1171–1179.
    Google ScholarLocate open access versionFindings
  • Hui Chen, Zijia Lin, Guiguang Ding, Jian-Guang Lou, Yusen Zhang, and Borje F. Karlsson. 2019. Grn: Gated relation network to enhance convolutional neural network for named entity recognition. In AAAI.
    Google ScholarFindings
  • Jason PC Chiu and Eric Nichols. 2016. Named entity recognition with bidirectional lstm-cnns. Transactions of the Association for Computational Linguistics, 4:357–370.
    Google ScholarLocate open access versionFindings
  • Michael Collins. 2002. Discriminative training methods for hidden Markov models: Theory and experiments with perceptron algorithms. In Proceedings of the 2002 Conference on Empirical Methods in Natural Language Processing (EMNLP 2002), pages 1–8. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Ronan Collobert, Jason Weston, Leon Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel Kuksa. 2011. Natural language processing (almost) from scratch. Journal of machine learning research, 12(Aug):2493–2537.
    Google ScholarLocate open access versionFindings
  • Leyang Cui and Yue Zhang. 2019. Hierarchicallyrefined label attention network for sequence labeling. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 4106– 4119.
    Google ScholarLocate open access versionFindings
  • Yarin Gal and Zoubin Ghahramani. 2015. Bayesian convolutional neural networks with bernoulli approximate variational inference. arXiv preprint arXiv:1506.02158.
    Findings
  • Yarin Gal and Zoubin Ghahramani. 2016a. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In international conference on machine learning, pages 1050–1059.
    Google ScholarLocate open access versionFindings
  • Yarin Gal and Zoubin Ghahramani. 2016b. A theoretically grounded application of dropout in recurrent neural networks. In Advances in neural information processing systems, pages 1019–1027.
    Google ScholarLocate open access versionFindings
  • Zhiheng Huang, Wei Xu, and Kai Yu. 2015. Bidirectional lstm-crf models for sequence tagging. arXiv preprint arXiv:1508.01991.
    Findings
  • Michael I Jordan, Zoubin Ghahramani, Tommi S Jaakkola, and Lawrence K Saul. 1999. An introduction to variational methods for graphical models. Machine learning, 37(2):183–233.
    Google ScholarLocate open access versionFindings
  • Alex Kendall and Yarin Gal. 2017. What uncertainties do we need in bayesian deep learning for computer vision? In Advances in neural information processing systems, pages 5574–5584.
    Google ScholarLocate open access versionFindings
  • Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
    Findings
  • Vijay Krishnan and Christopher D Manning. 2006. An effective two-stage model for exploiting nonlocal dependencies in named entity recognition. In Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics, pages 1121–1128. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Guillaume Lample, Miguel Ballesteros, Sandeep Subramanian, Kazuya Kawakami, and Chris Dyer. 2016. Neural architectures for named entity recognition. In Proceedings of NAACL-HLT, pages 260–270.
    Google ScholarLocate open access versionFindings
  • Liyuan Liu, Jingbo Shang, Xiang Ren, Frank Fangzheng Xu, Huan Gui, Jian Peng, and Jiawei Han. 2018. Empower sequence labeling with task-aware neural language model. In ThirtySecond AAAI Conference on Artificial Intelligence.
    Google ScholarLocate open access versionFindings
  • Gang Luo, Xiaojiang Huang, Chin-Yew Lin, and Zaiqing Nie. 2015. Joint entity recognition and disambiguation. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 879–888.
    Google ScholarLocate open access versionFindings
  • Xuezhe Ma and Eduard Hovy. 2016. End-to-end sequence labeling via bi-directional lstm-cnns-crf. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1064–1074.
    Google ScholarLocate open access versionFindings
  • Christopher D. Manning. 2011. Part-of-speech tagging from 97% to 100%: Is it time for some linguistics? In Proceedings of the 12th International Conference on Computational Linguistics and Intelligent Text Processing - Volume Part I, CICLing’11, pages 171– 189, Berlin, Heidelberg. Springer-Verlag.
    Google ScholarLocate open access versionFindings
  • Mitchell P. Marcus, Beatrice Santorini, and Mary Ann Marcinkiewicz. 1993. Building a large annotated corpus of English: The Penn Treebank. Computational Linguistics, 19(2):313–330.
    Google ScholarLocate open access versionFindings
  • Alexandre Passos, Vineet Kumar, and Andrew McCallum. 2014. Lexicon infused phrase embeddings for named entity resolution. In Proceedings of the Eighteenth Conference on Computational Natural Language Learning, pages 78–86.
    Google ScholarLocate open access versionFindings
  • Matthew E Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations. In Proceedings of NAACL-HLT, pages 2227–2237.
    Google ScholarLocate open access versionFindings
  • Cicero D Santos and Bianca Zadrozny. 2014. Learning character-level representations for part-of-speech tagging. In Proceedings of the 31st International Conference on Machine Learning (ICML-14), pages 1818–1826.
    Google ScholarLocate open access versionFindings
  • Tomasz Stanislawek, Anna Wroblewska, Alicja Wojcicka, Daniel Ziembicki, and Przemyslaw Biecek. 2019. Named entity recognition-is there a glass ceiling? In Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL), pages 624–633.
    Google ScholarLocate open access versionFindings
  • Emma Strubell, Patrick Verga, David Belanger, and Andrew McCallum. 2017. Fast and accurate entity recognition with iterated dilated convolutions. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 2670–2680.
    Google ScholarLocate open access versionFindings
  • Erik F. Tjong Kim Sang and Fien De Meulder. 2003. Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition. In Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003, pages 142–147.
    Google ScholarLocate open access versionFindings
  • Ashish Vaswani, Yonatan Bisk, Kenji Sagae, and Ryan Musa. 2016. Supertagging with lstms. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 232–237.
    Google ScholarLocate open access versionFindings
  • Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008.
    Google ScholarLocate open access versionFindings
  • Ralph Weischedel, Martha Palmer, Mitchell Marcus, Eduard Hovy, Sameer Pradhan, Lance Ramshaw, Nianwen Xue, Ann Taylor, Jeff Kaufman, Michelle Franchini, et al. 2013. Ontonotes release 5.0 ldc2013t19. Linguistic Data Consortium, Philadelphia, PA, 23.
    Google ScholarFindings
  • Yingwei Xin, Ethan Hart, Vibhuti Mahajan, and JeanDavid Ruvini. 2018. Learning better internal structure of words for sequence labeling. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2584–2593, Brussels, Belgium. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Jie Yang, Shuailong Liang, and Yue Zhang. 2018. Design challenges and misconceptions in neural sequence labeling. In Proceedings of the 27th International Conference on Computational Linguistics, pages 3879–3889, Santa Fe, New Mexico, USA. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Jie Yang and Yue Zhang. 2018. NCRF++: An open-source neural sequence labeling toolkit. In Proceedings of ACL 2018, System Demonstrations, pages 74–79, Melbourne, Australia. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Zhixiu Ye and Zhen-Hua Ling. 2018. Hybrid semimarkov crf for neural sequence labeling. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 235–240.
    Google ScholarLocate open access versionFindings
  • Yuan Zhang, Hongshen Chen, Yihong Zhao, Qun Liu, and Dawei Yin. 2018. Learning tag dependencies for sequence tagging. In IJCAI, pages 4581–4587.
    Google ScholarLocate open access versionFindings
  • Zhuosheng Zhang, Bingjie Tang, Zuchao Li, and Hai Zhao. 2019. Modeling named entity embedding distribution into hypersphere. arXiv preprint arXiv:1909.01065.
    Findings
  • Jie Zhou and Wei Xu. 2015. End-to-end learning of semantic role labeling using recurrent neural networks. In Proceedings of the 53rd Annual
    Google ScholarLocate open access versionFindings
  • Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 1127–1137.
    Google ScholarLocate open access versionFindings
Full Text
Your rating :
0

 

Tags
Comments