Fast and Accurate Entity Recognition with Iterated Dilated Convolutions.
empirical methods in natural language processing, pp.2660-2670, (2017)
Today when many practitioners run basic NLP on the entire web and large-volume traffic, faster methods are paramount to saving time and energy costs. Recent advances in GPU hardware have led to the emergence of bi-directional LSTMs as a standard method for obtaining per-token vector representations serving as input to labeling tasks such ...More
PPT (Upload PPT)
- In order to democratize large-scale NLP and information extraction while minimizing the environmental footprint, the authors require fast, resourceefficient methods for sequence tagging tasks such as part-of-speech tagging and named entity recognition (NER).
- Rather than composing representations incrementally over each token in a sequence, they apply filters in parallel across the entire sequence at once
- Their computational cost grows with the number of layers, but not the input size, up to the memory and threading limitations of the hardware.
- O(D) prediction is simple and parallelizable across the
- In order to democratize large-scale NLP and information extraction while minimizing our environmental footprint, we require fast, resourceefficient methods for sequence tagging tasks such as part-of-speech tagging and named entity recognition (NER)
- We describe experiments on two benchmark English named entity recognition datasets
- On CoNLL-2003 English named entity recognition, our Iterated Dilated Convolutional Neural Networks performs on par with a bidirectional LSTM not only when used to produce per-token logits for structured inference, but the Iterated Dilated Convolutional Neural Networks with greedy decoding performs on-par with the bidirectional LSTM-CRF while running at more than 14 times the speed
- In Table 6 we show that, in addition to being more accurate, our Iterated Dilated Convolutional Neural Networks model is much faster than the bidirectional LSTM-CRF when incorporating context from entire documents, decoding at almost 8 times the speed
- We present iterated dilated convolutional neural networks, fast token encoders that efficiently aggregate broad context without losing resolution
- The authors describe experiments on two benchmark English named entity recognition datasets. On CoNLL-2003 English NER, the ID-CNN performs on par with a Bi-LSTM not only when used to produce per-token logits for structured inference, but the ID-CNN with greedy decoding performs on-par with the Bi-LSTM-CRF while running at more than 14 times the speed.
- The authors use the same OntoNotes data split used for co-reference resolution in the CoNLL-2012 shared task (Pradhan et al, 2012).
- For both datasets, the authors convert the IOB boundary encoding to BILOU as previous work found this encoding to result in improved performance (Ratinov and Roth, 2009).
- A more detailed description of the data, evaluation, optimization and data pre-processing can be found in the Appendix
- The authors present iterated dilated convolutional neural networks, fast token encoders that efficiently aggregate broad context without losing resolution.
- In the future the authors hope to extend this work to NLP tasks with richer structured output, such as parsing
- Table1: F1 score of models observing sentencelevel context. No models use character embeddings or lexicons. Top models are greedy, bottom models use Viterbi inference
- Table2: Relative test-time speed of sentence models, using the fastest batch size for each model.5
- Table3: Comparison of models trained with and without expectation-linear dropout regularization (DR). DR improves all models
- Table4: F1 score of models trained to predict document-at-a-time. Our greedy ID-CNN model performs as well as the Bi-LSTM-CRF
- Table5: Comparing ID-CNNs with 1) backpropagating loss only from the final layer (1-loss) and 2) untied parameters across blocks (noshare)
- Table6: Relative test-time speed of document models (fastest batch size for each model)
- Table7: F1 score of sentence and document models on OntoNotes
- The state-of-the art models for sequence labeling include an inference step that searches the space of possible output sequences of a chain-structured graphical model, or approximates this search with a beam (Collobert et al, 2011; Weiss et al, 2015; Lample et al, 2016; Ma and Hovy, 2016; Chiu and Nichols, 2016). These outperform similar systems that use the same features, but independent local predictions. On the other hand, the greedy sequential prediction (Daume III et al, 2009) approach of Ratinov and Roth (2009), which employs lexicalized features, gazetteers, and word clusters, outperforms CRFs with similar features.
LSTMs (Hochreiter and Schmidhuber, 1997) were used for NER as early as the CoNLL shared task in 2003 (Hammerton, 2003; Tjong Kim Sang and De Meulder, 2003). More recently, a wide variety of neural network architectures for NER have been proposed. Collobert et al (2011) employ a one-layer CNN with pre-trained word embeddings, capitalization and lexicon features, and CRF-based prediction. Huang et al (2015) achieved state-of-the-art accuracy on partof-speech, chunking and NER using a Bi-LSTMCRF. Lample et al (2016) proposed two models which incorporated Bi-LSTM-composed character embeddings alongside words: a Bi-LSTMCRF, and a greedy stack LSTM which uses a simple shift-reduce grammar to compose words into labeled entities. Their Bi-LSTM-CRF obtained the state-of-the-art on four languages without word shape or lexicon features. Ma and Hovy (2016) use CNNs rather than LSTMs to compose characters in a Bi-LSTM-CRF, achieving state-ofthe-art performance on part-of-speech tagging and CoNLL NER without lexicons. Chiu and Nichols (2016) evaluate a similar network but propose a novel method for encoding lexicon matches, presenting results on CoNLL and OntoNotes NER. Yang et al (2016) use GRU-CRFs with GRUcomposed character embeddings of words to train a single network on many tasks and languages.
- This work was supported in part by the Center for Intelligent Information Retrieval, in part by DARPA under agreement number FA8750-13-2-0020, in part by Defense Advanced Research Agency (DARPA) contract number HR0011-15-2-0036, in part by the National Science Foundation (NSF) grant number DMR-1534431, and in part by the National Science Foundation (NSF) grant number IIS1514053
- Andy Davis, Jeffrey Dean, Matthieu Devin, et al. 2015. Tensorflow: Large-scale machine learning on heterogeneous systems, 2015. Software available from tensorflow.org.
- Razvan Bunescu and Raymond J. Mooney. 2004. Collective information extraction with relational markov networks. In ACL, pages 439–446.
- Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L. Yuille. 2015. Semantic image segmentation with deep convolutional nets and fully connected crfs. In ICLR.
- Jason PC Chiu and Eric Nichols. 2016. Named entity recognition with bidirectional lstm-cnns. Transactions of the Association for Computational Linguistics, 4:357–370.
- Ronan Collobert, Jason Weston, Leon Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel Kuksa. 2011. Natural language processing (almost) from scratch. Journal of Machine Learning Research, 12(Aug):2493–2537.
- Hal Daume III, John Langford, and Daniel Marcu. 2009. Search-based structured prediction. Machine Learning, 75(3):297–325.
- Greg Durrett and Dan Klein. 2014. A joint model for entity analysis: Coreference, typing and linking. Transactions of the Association for Computational Linguistics, 2:477–490.
- Jenny Rose Finkel, Trond Grenager, and Christopher Manning. 2005. Incorporating non-local information into information extraction systems by gibbs sampling. In ACL, pages 363–370.
- Xavier Glorot, Antoine Bordes, and Yoshua Bengio. 2011. Deep sparse rectifier neural networks. In AISTATS.
- Calar Gulcehre and Yoshua Bengio. 2016. Knowledge matters: Importance of prior information for optimization. Journal of Machine Learning Research, 17(8):1–32.
- James Hammerton. 2003. Named entity recognition with long short-term memory. In Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL, pages 172–175. Association for Computational Linguistics.
- Sepp Hochreiter. 1998. The vanishing gradient problem during learning recurrent neural nets and problem solutions. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 6(02):107–116.
- Sepp Hochreiter and J urgen Schmidhuber. 1997. Long short-term memory. Neural Computation, 9(8):1735–1780.
- Eduard Hovy, Mitchell Marcus, Martha Palmer, Lance Ramshaw, and Ralph Weischedel. 2006. Ontonotes: the 90% solution. In Proceedings of the Human Language Technology Conference of the NAACL, Companion Volume: Short Papers, pages 57–60.
- Zhiheng Huang, Wei Xu, and Kai Yu. 20Bidirectional lstm-crf models for sequence tagging. arXiv preprint arXiv:1508.01991.
- Nal Kalchbrenner, Lasse Espeholt, Karen Simonyan, Aaron van den Oord, Alex Graves, and Koray Kavukcuoglu. 20Neural machine translation in linear time. arXiv preprint arXiv:1610.10099.
- Nal Kalchbrenner, Edward Grefenstette, and Phil Blunsom. 2014. A convolutional neural network for modelling sentences. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics.
- Yoon Kim. 2014. Convolutional neural networks for sentence classification. In EMNLP.
- John D. Lafferty, Andrew McCallum, and Fernando C. N. Pereira. 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the Eighteenth International Conference on Machine Learning (ICML), pages 282–289.
- Guillaume Lample, Miguel Ballesteros, Sandeep Subramanian, Kazuya Kawakami, and Chris Dyer. 2016. Neural architectures for named entity recognition. In NAACL.
- Chen-Yu Lee, Saining Xie, Patrick W Gallagher, Zhengyou Zhang, and Zhuowen Tu. 2015. Deeplysupervised nets. In AISTATS, volume 2, page 5.
- Tao Lei, Regina Barzilay, and Tommi Jaakkola. 2015. Molding cnns for text: non-linear, non-consecutive convolutions. Empirical Methods in Natural Language Processing.
- Percy Liang, Hal Daume III, and Dan Klein. 2008. Structure compilation: trading structure for features. In Proceedings of the 25th international conference on Machine learning, pages 592–599. ACM.
- Wang Ling, Tiago Luıs, Luıs Marujo, Ramon Fernandez Astudillo, Silvio Amir, Chris Dyer, Alan W Black, and Isabel Trancoso. 2015. Finding Function in Form: Compositional Character Models for Open Vocabulary Word Representation. In EMNLP.
- Ben London, Bert Huang, and Lise Getoor. 2016. Stability and generalization in structured prediction. Journal of Machine Learning Research, 17(222):1– 52.
- Xuezhe Ma, Yingkai Gaom, Zhiting Hu, Yaoliang Yu, Yuntian Deng, and Eduard Hovy. 2017. Dropout with expectation-linear regularization. In ICLR.
- Xuezhe Ma and Eduard Hovy. 2016. End-to-end sequence labeling via bi-directional lstm-cnns-crf. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, page 10641074.
- Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. 2016. Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499.
- Alexandre Passos, Vineet Kumar, and Andrew McCallum. 2014. Lexicon infused phrase embeddings for named entity resolution. In CoNLL.
- Sameer Pradhan, Alessandro Moschitti, Nianwen Xue, Hwee Tou Ng, Anders Bj orkelund, Olga Uryupina, Yuchen Zhang, and Zhi Zhong. 2006. Towards robust linguistic analysis using ontonotes. In Proceedings of the Seventeenth Conference on Computational Natural Language Learning, pages 143–152.
- Sameer Pradhan, Alessandro Moschitti, Nianwen Xue, Olga Uryupina, and Yuchen Zhang. 2012. Conll2012 shared task: Modeling multilingual unrestricted coreference in ontonotes. In Proceedings of the Joint Conference on EMNLP and CoNLL: Shared Task, pages 1–40.
- Lance A Ramshaw and Mitchell P Marcus. 1999. Text chunking using transformation-based learning. In Natural language processing using very large corpora, pages 157–176. Springer.
- Lev Ratinov and Dan Roth. 2009. Design challenges and misconceptions in named entity recognition. In Proceedings of the Thirteenth Conference on Computational Natural Language Learning, pages 147– 155. Association for Computational Linguistics.
- Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou, Antoine Chassang, Carlo Gatta, and Yoshua Bengio. 2014. Fitnets: Hints for thin deep nets. arXiv preprint arXiv:1412.6550.
- Nitish Srivastava, Geoffrey E Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15(1):1929–1958.
- Charles Sutton and Andrew McCallum. 2004. Collective segmentation and labeling of distant entities in information extraction. In ICML Workshop on Statistical Relational Learning.
- Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1–9.
- Erik F Tjong Kim Sang and Fien De Meulder. 2003. Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition. In Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003-Volume 4, pages 142–147. Association for Computational Linguistics.
- Kristina Toutanova, Danqi Chen, Patrick Pantel, Hoifung Poon, Pallavi Choudhury, and Michael Gamon. 2015. Representing text for joint embedding of text and knowledge bases. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 1499–1509. Association for Computational Linguistics.
- Joseph Turian, Lev Ratinov, and Yoshua Bengio. 2010. Word representations: a simple and general method for semi-supervised learning. In Proceedings of the 48th annual meeting of the association for computational linguistics, pages 384–394. Association for Computational Linguistics.
- David Weiss, Chris Alberti, Michael Collins, and Slav Petrov. 2015. Structured training for neural network transition-based parsing. In Annual Meeting of the Association for Computational Linguistics.
- Zhilin Yang, Ruslan Salakhutdinov, and William Cohen. 2016. Multi-task cross-lingual sequence tagging from scratch. In arXiv preprint arXiv:1603.06270.
- Fisher Yu and Vladlen Koltun. 2016. Multi-scale context aggregation by dilated convolutions. In International Conference on Learning Representations (ICLR).
- Xiang Zhang, Junbo Zhao, and Yann LeCun. 2015. Character-level convolutional networks for text classification. In Advances in Neural Information Processing Systems 28 (NIPS).