Deep Active Learning for Named Entity Recognition

international conference on learning representations, 2017.

Cited by: 167|Views54
EI
Weibo:
We proposed deep active learning algorithms for named entity recognition and empirically demonstrated that they achieve state-of-the-art performance with much less data than models trained in the standard supervised fashion

Abstract:

Deep learning has yielded state-of-the-art performance on many natural language processing tasks including named entity recognition (NER). However, this typically requires large amounts of labeled data. In this work, we demonstrate that the amount of labeled training data can be drastically reduced when deep learning is combined with acti...More

Code:

Data:

0
Full Text
Bibtex
Weibo
Introduction
  • Over the past several years, a series of papers have used deep neural networks (DNNs) to advance the state of the art in named entity recognition (NER) (Collobert et al, 2011; Huang et al, 2015; Lample et al, 2016; Chiu and Nichols, 2015; Yang et al, 2016).
  • The model uses two convolutional neural networks (CNNs) (LeCun et al, 1995) to encode characters and words respectively, and a long shortterm memory (LSTM) recurrent neural network (Hochreiter and Schmidhuber, 1997) as a decoder
  • This model achieves the best F1 scores on the OntoNotes-5.0 English and Chinese dataset, and its use of CNNs in encoders enables faster training as compared to previous work relying on LSTM encoders (Lample et al, 2016; Chiu and Nichols, 2015).
  • In order to batch the computation of multiple sentences, sentences with similar length are grouped together into buckets, and [PAD] tokens are added at the end of sentences to make their lengths uniform inside of the
Highlights
  • Over the past several years, a series of papers have used deep neural networks (DNNs) to advance the state of the art in named entity recognition (NER) (Collobert et al, 2011; Huang et al, 2015; Lample et al, 2016; Chiu and Nichols, 2015; Yang et al, 2016)
  • We present positive preliminary results demonstrating the effectiveness of deep active learning
  • The model uses two convolutional neural networks (CNNs) (LeCun et al, 1995) to encode characters and words respectively, and a long shortterm memory (LSTM) recurrent neural network (Hochreiter and Schmidhuber, 1997) as a decoder. This model achieves the best F1 scores on the OntoNotes-5.0 English and Chinese dataset, and its use of CNNs in encoders enables faster training as compared to previous work relying on LSTM encoders (Lample et al, 2016; Chiu and Nichols, 2015)
  • Maximum Normalized Log-Probability (MNLP): Our preliminary analysis revealed that the Least Confidence (LC) method disproportionately selects longer sentences
  • All active learning algorithms perform significantly better than the random baseline
  • We proposed deep active learning algorithms for NER and empirically demonstrated that they achieve state-of-the-art performance with much less data than models trained in the standard supervised fashion
Methods
  • The authors use OntoNotes-5.0 English and Chinese data (Pradhan et al, 2013) for the experiments.
  • The training datasets contain 1,088,503 words and 756,063 words respectively.
  • State-of-the-art models trained the full training sets achieve F1 scores of 86.86 and 75.63 on the test sets (Yun, 2017).
  • #. Test F1 score Test F1 score MNLP LC BALD RAND Best Deep Model.
  • Percent of words annotated (a) OntoNotes-5.0 English BALD.
Results
  • The authors evaluate the performance of each algorithm by its F1 score on the test dataset. All active learning algorithms perform significantly better than the random baseline.
Conclusion
  • The authors proposed deep active learning algorithms for NER and empirically demonstrated that they achieve state-of-the-art performance with much less data than models trained in the standard supervised fashion.
Summary
  • Introduction:

    Over the past several years, a series of papers have used deep neural networks (DNNs) to advance the state of the art in named entity recognition (NER) (Collobert et al, 2011; Huang et al, 2015; Lample et al, 2016; Chiu and Nichols, 2015; Yang et al, 2016).
  • The model uses two convolutional neural networks (CNNs) (LeCun et al, 1995) to encode characters and words respectively, and a long shortterm memory (LSTM) recurrent neural network (Hochreiter and Schmidhuber, 1997) as a decoder
  • This model achieves the best F1 scores on the OntoNotes-5.0 English and Chinese dataset, and its use of CNNs in encoders enables faster training as compared to previous work relying on LSTM encoders (Lample et al, 2016; Chiu and Nichols, 2015).
  • In order to batch the computation of multiple sentences, sentences with similar length are grouped together into buckets, and [PAD] tokens are added at the end of sentences to make their lengths uniform inside of the
  • Methods:

    The authors use OntoNotes-5.0 English and Chinese data (Pradhan et al, 2013) for the experiments.
  • The training datasets contain 1,088,503 words and 756,063 words respectively.
  • State-of-the-art models trained the full training sets achieve F1 scores of 86.86 and 75.63 on the test sets (Yun, 2017).
  • #. Test F1 score Test F1 score MNLP LC BALD RAND Best Deep Model.
  • Percent of words annotated (a) OntoNotes-5.0 English BALD.
  • Results:

    The authors evaluate the performance of each algorithm by its F1 score on the test dataset. All active learning algorithms perform significantly better than the random baseline.
  • Conclusion:

    The authors proposed deep active learning algorithms for NER and empirically demonstrated that they achieve state-of-the-art performance with much less data than models trained in the standard supervised fashion.
Tables
  • Table1: Example formatted sentence. To avoid clutter, [BOW] and [EOW] symbols are not shown
Download tables as Excel
Funding
  • We evaluate the performance of each algorithm by its F1 score on the test dataset
  • All active learning algorithms perform significantly better than the random baseline
Study subjects and analysis
training datasets: 3
The OntoNotes datasets consist of six genres: broadcast conversation (bc), braodcast news (bn), magazine genre (mz), newswire (nw), telephone conversation (tc), weblogs (wb). We created three training datasets: half-data, which contains random 50% of the original training data, nw-data, which contains sentences only from newswire (51.5% of words in the original data), and no-nw-data, which is the complement of nw-data. Then, we trained CNN-CNN-LSTM model on each dataset

Reference
  • Jason PC Chiu and Eric Nichols. 2015. Named entity recognition with bidirectional lstm-cnns. arXiv preprint arXiv:1511.08308.
    Findings
  • Ronan Collobert, Jason Weston, Leon Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel Kuksa. 2011. Natural language processing (almost) from scratch. Journal of Machine Learning Research 12(Aug):2493–2537.
    Google ScholarLocate open access versionFindings
  • Aron Culotta and Andrew McCallum. 2005. Reducing labeling effort for structured prediction tasks. In AAAI. volume 5, pages 746–51.
    Google ScholarLocate open access versionFindings
  • Yarin Gal and Zoubin Ghahramani. 2016. A theoretically grounded application of dropout in recurrent neural networks. In Advances in Neural Information Processing Systems. pages 1019–1027.
    Google ScholarLocate open access versionFindings
  • Yarin Gal, Riashat Islam, and Zoubin Ghahramani. 2017. Deep bayesian active learning with image data. arXiv preprint arXiv:1703.02910.
    Findings
  • Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 201Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pages 770–778.
    Google ScholarLocate open access versionFindings
  • Sepp Hochreiter and Jurgen Schmidhuber. 199Long short-term memory. Neural computation 9(8):1735–1780.
    Google ScholarLocate open access versionFindings
  • Zhiheng Huang, Wei Xu, and Kai Yu. 2015. Bidirectional lstm-crf models for sequence tagging. arXiv preprint arXiv:1508.01991.
    Findings
  • Yoon Kim. 2014. Convolutional neural networks for sentence classification. arXiv preprint arXiv:1408.5882.
    Findings
  • John Lafferty, Andrew McCallum, Fernando Pereira, et al. 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the eighteenth international conference on machine learning, ICML. volume 1, pages 282–289.
    Google ScholarLocate open access versionFindings
  • Guillaume Lample, Miguel Ballesteros, Sandeep Subramanian, Kazuya Kawakami, and Chris Dyer. 2016. Neural architectures for named entity recognition. arXiv preprint arXiv:1603.01360.
    Findings
  • Yann LeCun, Yoshua Bengio, et al. 1995. Convolutional networks for images, speech, and time series. The handbook of brain theory and neural networks 3361(10):1995.
    Google ScholarFindings
  • David D Lewis and William A Gale. 1994. A sequential algorithm for training text classifiers. In Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval. Springer-Verlag New York, Inc., pages 3–12.
    Google ScholarLocate open access versionFindings
  • Vinod Nair and Geoffrey E Hinton. 2010. Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th international conference on machine learning (ICML-10). pages 807–814.
    Google ScholarLocate open access versionFindings
  • Sameer Pradhan, Alessandro Moschitti, Nianwen Xue, Hwee Tou Ng, Anders Bjorkelund, Olga Uryupina, Yuchen Zhang, and Zhi Zhong. 2013. Towards robust linguistic analysis using ontonotes. In CoNLL. pages 143–152.
    Google ScholarLocate open access versionFindings
  • Burr Settles. 2010. Active learning literature survey. University of Wisconsin, Madison 52(55-66):11.
    Google ScholarLocate open access versionFindings
  • Nitish Srivastava, Geoffrey E Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1):1929–1958.
    Google ScholarLocate open access versionFindings
  • Zhilin Yang, Ruslan Salakhutdinov, and William Cohen. 2016. Multi-task cross-lingual sequence tagging from scratch. arXiv preprint arXiv:1603.06270.
    Findings
  • Hyokun Yun. 2017. Design choices for named entity recognition. Manuscript in preparation.
    Google ScholarFindings
Your rating :
0

 

Tags
Comments