Fast and Accurate Entity Recognition with Iterated Dilated Convolutions.

empirical methods in natural language processing, pp.2660-2670, (2017)

Cited by: 250|Views356
EI

Abstract

Today when many practitioners run basic NLP on the entire web and large-volume traffic, faster methods are paramount to saving time and energy costs. Recent advances in GPU hardware have led to the emergence of bi-directional LSTMs as a standard method for obtaining per-token vector representations serving as input to labeling tasks such ...More

Code:

Data:

Introduction
  • In order to democratize large-scale NLP and information extraction while minimizing the environmental footprint, the authors require fast, resourceefficient methods for sequence tagging tasks such as part-of-speech tagging and named entity recognition (NER).
  • Rather than composing representations incrementally over each token in a sequence, they apply filters in parallel across the entire sequence at once
  • Their computational cost grows with the number of layers, but not the input size, up to the memory and threading limitations of the hardware.
  • O(D) prediction is simple and parallelizable across the
Highlights
  • In order to democratize large-scale NLP and information extraction while minimizing our environmental footprint, we require fast, resourceefficient methods for sequence tagging tasks such as part-of-speech tagging and named entity recognition (NER)
  • We describe experiments on two benchmark English named entity recognition datasets
  • On CoNLL-2003 English named entity recognition, our Iterated Dilated Convolutional Neural Networks performs on par with a bidirectional LSTM not only when used to produce per-token logits for structured inference, but the Iterated Dilated Convolutional Neural Networks with greedy decoding performs on-par with the bidirectional LSTM-CRF while running at more than 14 times the speed
  • In Table 6 we show that, in addition to being more accurate, our Iterated Dilated Convolutional Neural Networks model is much faster than the bidirectional LSTM-CRF when incorporating context from entire documents, decoding at almost 8 times the speed
  • We present iterated dilated convolutional neural networks, fast token encoders that efficiently aggregate broad context without losing resolution
Results
  • The authors describe experiments on two benchmark English named entity recognition datasets. On CoNLL-2003 English NER, the ID-CNN performs on par with a Bi-LSTM not only when used to produce per-token logits for structured inference, but the ID-CNN with greedy decoding performs on-par with the Bi-LSTM-CRF while running at more than 14 times the speed.
  • The authors use the same OntoNotes data split used for co-reference resolution in the CoNLL-2012 shared task (Pradhan et al, 2012).
  • For both datasets, the authors convert the IOB boundary encoding to BILOU as previous work found this encoding to result in improved performance (Ratinov and Roth, 2009).
  • A more detailed description of the data, evaluation, optimization and data pre-processing can be found in the Appendix
Conclusion
  • The authors present iterated dilated convolutional neural networks, fast token encoders that efficiently aggregate broad context without losing resolution.
  • In the future the authors hope to extend this work to NLP tasks with richer structured output, such as parsing
Tables
  • Table1: F1 score of models observing sentencelevel context. No models use character embeddings or lexicons. Top models are greedy, bottom models use Viterbi inference
  • Table2: Relative test-time speed of sentence models, using the fastest batch size for each model.5
  • Table3: Comparison of models trained with and without expectation-linear dropout regularization (DR). DR improves all models
  • Table4: F1 score of models trained to predict document-at-a-time. Our greedy ID-CNN model performs as well as the Bi-LSTM-CRF
  • Table5: Comparing ID-CNNs with 1) backpropagating loss only from the final layer (1-loss) and 2) untied parameters across blocks (noshare)
  • Table6: Relative test-time speed of document models (fastest batch size for each model)
  • Table7: F1 score of sentence and document models on OntoNotes
Download tables as Excel
Related work
  • The state-of-the art models for sequence labeling include an inference step that searches the space of possible output sequences of a chain-structured graphical model, or approximates this search with a beam (Collobert et al, 2011; Weiss et al, 2015; Lample et al, 2016; Ma and Hovy, 2016; Chiu and Nichols, 2016). These outperform similar systems that use the same features, but independent local predictions. On the other hand, the greedy sequential prediction (Daume III et al, 2009) approach of Ratinov and Roth (2009), which employs lexicalized features, gazetteers, and word clusters, outperforms CRFs with similar features.

    LSTMs (Hochreiter and Schmidhuber, 1997) were used for NER as early as the CoNLL shared task in 2003 (Hammerton, 2003; Tjong Kim Sang and De Meulder, 2003). More recently, a wide variety of neural network architectures for NER have been proposed. Collobert et al (2011) employ a one-layer CNN with pre-trained word embeddings, capitalization and lexicon features, and CRF-based prediction. Huang et al (2015) achieved state-of-the-art accuracy on partof-speech, chunking and NER using a Bi-LSTMCRF. Lample et al (2016) proposed two models which incorporated Bi-LSTM-composed character embeddings alongside words: a Bi-LSTMCRF, and a greedy stack LSTM which uses a simple shift-reduce grammar to compose words into labeled entities. Their Bi-LSTM-CRF obtained the state-of-the-art on four languages without word shape or lexicon features. Ma and Hovy (2016) use CNNs rather than LSTMs to compose characters in a Bi-LSTM-CRF, achieving state-ofthe-art performance on part-of-speech tagging and CoNLL NER without lexicons. Chiu and Nichols (2016) evaluate a similar network but propose a novel method for encoding lexicon matches, presenting results on CoNLL and OntoNotes NER. Yang et al (2016) use GRU-CRFs with GRUcomposed character embeddings of words to train a single network on many tasks and languages.
Funding
  • This work was supported in part by the Center for Intelligent Information Retrieval, in part by DARPA under agreement number FA8750-13-2-0020, in part by Defense Advanced Research Agency (DARPA) contract number HR0011-15-2-0036, in part by the National Science Foundation (NSF) grant number DMR-1534431, and in part by the National Science Foundation (NSF) grant number IIS1514053
Reference
  • Andy Davis, Jeffrey Dean, Matthieu Devin, et al. 2015. Tensorflow: Large-scale machine learning on heterogeneous systems, 2015. Software available from tensorflow.org.
    Google ScholarFindings
  • Razvan Bunescu and Raymond J. Mooney. 2004. Collective information extraction with relational markov networks. In ACL, pages 439–446.
    Google ScholarLocate open access versionFindings
  • Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L. Yuille. 2015. Semantic image segmentation with deep convolutional nets and fully connected crfs. In ICLR.
    Google ScholarFindings
  • Jason PC Chiu and Eric Nichols. 2016. Named entity recognition with bidirectional lstm-cnns. Transactions of the Association for Computational Linguistics, 4:357–370.
    Google ScholarLocate open access versionFindings
  • Ronan Collobert, Jason Weston, Leon Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel Kuksa. 2011. Natural language processing (almost) from scratch. Journal of Machine Learning Research, 12(Aug):2493–2537.
    Google ScholarLocate open access versionFindings
  • Hal Daume III, John Langford, and Daniel Marcu. 2009. Search-based structured prediction. Machine Learning, 75(3):297–325.
    Google ScholarLocate open access versionFindings
  • Greg Durrett and Dan Klein. 2014. A joint model for entity analysis: Coreference, typing and linking. Transactions of the Association for Computational Linguistics, 2:477–490.
    Google ScholarLocate open access versionFindings
  • Jenny Rose Finkel, Trond Grenager, and Christopher Manning. 2005. Incorporating non-local information into information extraction systems by gibbs sampling. In ACL, pages 363–370.
    Google ScholarLocate open access versionFindings
  • Xavier Glorot, Antoine Bordes, and Yoshua Bengio. 2011. Deep sparse rectifier neural networks. In AISTATS.
    Google ScholarFindings
  • Calar Gulcehre and Yoshua Bengio. 2016. Knowledge matters: Importance of prior information for optimization. Journal of Machine Learning Research, 17(8):1–32.
    Google ScholarLocate open access versionFindings
  • James Hammerton. 2003. Named entity recognition with long short-term memory. In Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL, pages 172–175. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Sepp Hochreiter. 1998. The vanishing gradient problem during learning recurrent neural nets and problem solutions. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 6(02):107–116.
    Google ScholarLocate open access versionFindings
  • Sepp Hochreiter and J urgen Schmidhuber. 1997. Long short-term memory. Neural Computation, 9(8):1735–1780.
    Google ScholarLocate open access versionFindings
  • Eduard Hovy, Mitchell Marcus, Martha Palmer, Lance Ramshaw, and Ralph Weischedel. 2006. Ontonotes: the 90% solution. In Proceedings of the Human Language Technology Conference of the NAACL, Companion Volume: Short Papers, pages 57–60.
    Google ScholarLocate open access versionFindings
  • Zhiheng Huang, Wei Xu, and Kai Yu. 20Bidirectional lstm-crf models for sequence tagging. arXiv preprint arXiv:1508.01991.
    Findings
  • Nal Kalchbrenner, Lasse Espeholt, Karen Simonyan, Aaron van den Oord, Alex Graves, and Koray Kavukcuoglu. 20Neural machine translation in linear time. arXiv preprint arXiv:1610.10099.
    Findings
  • Nal Kalchbrenner, Edward Grefenstette, and Phil Blunsom. 2014. A convolutional neural network for modelling sentences. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Yoon Kim. 2014. Convolutional neural networks for sentence classification. In EMNLP.
    Google ScholarFindings
  • John D. Lafferty, Andrew McCallum, and Fernando C. N. Pereira. 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the Eighteenth International Conference on Machine Learning (ICML), pages 282–289.
    Google ScholarLocate open access versionFindings
  • Guillaume Lample, Miguel Ballesteros, Sandeep Subramanian, Kazuya Kawakami, and Chris Dyer. 2016. Neural architectures for named entity recognition. In NAACL.
    Google ScholarFindings
  • Chen-Yu Lee, Saining Xie, Patrick W Gallagher, Zhengyou Zhang, and Zhuowen Tu. 2015. Deeplysupervised nets. In AISTATS, volume 2, page 5.
    Google ScholarLocate open access versionFindings
  • Tao Lei, Regina Barzilay, and Tommi Jaakkola. 2015. Molding cnns for text: non-linear, non-consecutive convolutions. Empirical Methods in Natural Language Processing.
    Google ScholarLocate open access versionFindings
  • Percy Liang, Hal Daume III, and Dan Klein. 2008. Structure compilation: trading structure for features. In Proceedings of the 25th international conference on Machine learning, pages 592–599. ACM.
    Google ScholarLocate open access versionFindings
  • Wang Ling, Tiago Luıs, Luıs Marujo, Ramon Fernandez Astudillo, Silvio Amir, Chris Dyer, Alan W Black, and Isabel Trancoso. 2015. Finding Function in Form: Compositional Character Models for Open Vocabulary Word Representation. In EMNLP.
    Google ScholarFindings
  • Ben London, Bert Huang, and Lise Getoor. 2016. Stability and generalization in structured prediction. Journal of Machine Learning Research, 17(222):1– 52.
    Google ScholarLocate open access versionFindings
  • Xuezhe Ma, Yingkai Gaom, Zhiting Hu, Yaoliang Yu, Yuntian Deng, and Eduard Hovy. 2017. Dropout with expectation-linear regularization. In ICLR.
    Google ScholarFindings
  • Xuezhe Ma and Eduard Hovy. 2016. End-to-end sequence labeling via bi-directional lstm-cnns-crf. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, page 10641074.
    Google ScholarLocate open access versionFindings
  • Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. 2016. Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499.
    Findings
  • Alexandre Passos, Vineet Kumar, and Andrew McCallum. 2014. Lexicon infused phrase embeddings for named entity resolution. In CoNLL.
    Google ScholarFindings
  • Sameer Pradhan, Alessandro Moschitti, Nianwen Xue, Hwee Tou Ng, Anders Bj orkelund, Olga Uryupina, Yuchen Zhang, and Zhi Zhong. 2006. Towards robust linguistic analysis using ontonotes. In Proceedings of the Seventeenth Conference on Computational Natural Language Learning, pages 143–152.
    Google ScholarLocate open access versionFindings
  • Sameer Pradhan, Alessandro Moschitti, Nianwen Xue, Olga Uryupina, and Yuchen Zhang. 2012. Conll2012 shared task: Modeling multilingual unrestricted coreference in ontonotes. In Proceedings of the Joint Conference on EMNLP and CoNLL: Shared Task, pages 1–40.
    Google ScholarLocate open access versionFindings
  • Lance A Ramshaw and Mitchell P Marcus. 1999. Text chunking using transformation-based learning. In Natural language processing using very large corpora, pages 157–176. Springer.
    Google ScholarFindings
  • Lev Ratinov and Dan Roth. 2009. Design challenges and misconceptions in named entity recognition. In Proceedings of the Thirteenth Conference on Computational Natural Language Learning, pages 147– 155. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou, Antoine Chassang, Carlo Gatta, and Yoshua Bengio. 2014. Fitnets: Hints for thin deep nets. arXiv preprint arXiv:1412.6550.
    Findings
  • Nitish Srivastava, Geoffrey E Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15(1):1929–1958.
    Google ScholarLocate open access versionFindings
  • Charles Sutton and Andrew McCallum. 2004. Collective segmentation and labeling of distant entities in information extraction. In ICML Workshop on Statistical Relational Learning.
    Google ScholarLocate open access versionFindings
  • Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1–9.
    Google ScholarLocate open access versionFindings
  • Erik F Tjong Kim Sang and Fien De Meulder. 2003. Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition. In Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003-Volume 4, pages 142–147. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Kristina Toutanova, Danqi Chen, Patrick Pantel, Hoifung Poon, Pallavi Choudhury, and Michael Gamon. 2015. Representing text for joint embedding of text and knowledge bases. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 1499–1509. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Joseph Turian, Lev Ratinov, and Yoshua Bengio. 2010. Word representations: a simple and general method for semi-supervised learning. In Proceedings of the 48th annual meeting of the association for computational linguistics, pages 384–394. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • David Weiss, Chris Alberti, Michael Collins, and Slav Petrov. 2015. Structured training for neural network transition-based parsing. In Annual Meeting of the Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Zhilin Yang, Ruslan Salakhutdinov, and William Cohen. 2016. Multi-task cross-lingual sequence tagging from scratch. In arXiv preprint arXiv:1603.06270.
    Findings
  • Fisher Yu and Vladlen Koltun. 2016. Multi-scale context aggregation by dilated convolutions. In International Conference on Learning Representations (ICLR).
    Google ScholarLocate open access versionFindings
  • Xiang Zhang, Junbo Zhao, and Yann LeCun. 2015. Character-level convolutional networks for text classification. In Advances in Neural Information Processing Systems 28 (NIPS).
    Google ScholarLocate open access versionFindings
Your rating :
0

 

Tags
Comments
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn
小科