AI helps you reading Science

AI generates interpretation videos

AI extracts and analyses the key points of the paper to generate videos automatically


pub
Go Generating

AI Traceability

AI parses the academic lineage of this thesis


Master Reading Tree
Generate MRT

AI Insight

AI extracts a summary of this paper


Weibo:
We explore arbitrary-order models with different neural parameterizations on sequence labeling tasks via energy-based inference networks

An Exploration of Arbitrary Order Sequence Labeling via Energy Based Inference Networks

EMNLP 2020, pp.5569-5582, (2020)

Cited by: 0|Views133
Full Text
Bibtex
Weibo

Abstract

Many tasks in natural language processing involve predicting structured outputs, e.g., sequence labeling, semantic role labeling, parsing, and machine translation. Researchers are increasingly applying deep representation learning to these problems, but the structured component of these approaches is usually quite simplistic. In this work...More

Code:

Data:

0
Introduction
  • Conditional random fields (CRFs; Lafferty et al, 2001) have been shown to perform well in various sequence labeling tasks.
  • A major challenge with CRFs is the complexity of training and inference, which are quadratic in the number of output labels for first order models and grow exponentially when higher order dependencies are considered.
  • This explains why the most common type of CRF used in practice is a first order model, referred to as a “linear chain” CRF.
  • Joint training of energy functions and inference networks. Belanger and McCallum (2016) proposed a structured hinge loss for learning the energy function parameters Θ, using gradient descent for the “cost-augmented” inference step required during learning. Tu and Gimpel (2018) replaced the cost-augmented inference step in the structured hinge loss with training of a “cost-augmented inference network” FΦ(x) trained with the following goal: FΦ(x) ≈ arg min (EΘ(x, y) − (y, y∗))
Highlights
  • Conditional random fields (CRFs; Lafferty et al, 2001) have been shown to perform well in various sequence labeling tasks
  • A major challenge with CRFs is the complexity of training and inference, which are quadratic in the number of output labels for first order models and grow exponentially when higher order dependencies are considered
  • While the optimal energy function varies by task, we find strong performance from skip-chain terms with short skip distances, convolutional networks with filters that consider label trigrams, and recurrent networks and self-attention networks that consider large subsequences of labels
  • Here we find that the framework of SPEN learning with inference networks can support a wide range of high-order energies for sequence labeling
  • In Section 3.3, we considered several ways to define the high-order energy function F
  • We explore arbitrary-order models with different neural parameterizations on sequence labeling tasks via energy-based inference networks
Methods
  • The authors perform experiments on four tasks: Twitter partof-speech tagging (POS), named entity recognition (NER), CCG supertagging (CCG), and semantic role labeling (SRL).
  • The authors use the annotated data from Gimpel et al (2011) and Owoputi et al (2013) which contains 25 POS tags.
  • The authors use the 100-dimensional skip-gram embeddings from Tu et al (2017) which were trained on a dataset of 56 million English tweets using word2vec (Mikolov et al, 2013).
Results
  • In Section 3.3, the authors considered several ways to define the high-order energy function F.
  • The CNN high-order energy is best when M =2 for the three tasks.
  • In. The authors consider the impact of the structured energy terms in noisy data settings.
  • The authors consider the impact of the structured energy terms in noisy data settings
  • The authors' motivation for these experiments stems from the assumption that structured energies will be more helpful when there is a weaker relationship between the observations and the labels.
  • The authors see that NER shows more benefit from structured energies, so the authors focus on NER and consider two settings: UnkTest: train on clean text, evaluate on noisy text; and UnkTrain: train on noisy text, evaluate on noisy text.
Conclusion
  • The authors explore arbitrary-order models with different neural parameterizations on sequence labeling tasks via energy-based inference networks.
  • This approach achieve substantial improvement using high-order energy terms, especially in noisy data conditions, while having same decoding speed as simple local classifiers
Summary
  • Introduction:

    Conditional random fields (CRFs; Lafferty et al, 2001) have been shown to perform well in various sequence labeling tasks.
  • A major challenge with CRFs is the complexity of training and inference, which are quadratic in the number of output labels for first order models and grow exponentially when higher order dependencies are considered.
  • This explains why the most common type of CRF used in practice is a first order model, referred to as a “linear chain” CRF.
  • Joint training of energy functions and inference networks. Belanger and McCallum (2016) proposed a structured hinge loss for learning the energy function parameters Θ, using gradient descent for the “cost-augmented” inference step required during learning. Tu and Gimpel (2018) replaced the cost-augmented inference step in the structured hinge loss with training of a “cost-augmented inference network” FΦ(x) trained with the following goal: FΦ(x) ≈ arg min (EΘ(x, y) − (y, y∗))
  • Methods:

    The authors perform experiments on four tasks: Twitter partof-speech tagging (POS), named entity recognition (NER), CCG supertagging (CCG), and semantic role labeling (SRL).
  • The authors use the annotated data from Gimpel et al (2011) and Owoputi et al (2013) which contains 25 POS tags.
  • The authors use the 100-dimensional skip-gram embeddings from Tu et al (2017) which were trained on a dataset of 56 million English tweets using word2vec (Mikolov et al, 2013).
  • Results:

    In Section 3.3, the authors considered several ways to define the high-order energy function F.
  • The CNN high-order energy is best when M =2 for the three tasks.
  • In. The authors consider the impact of the structured energy terms in noisy data settings.
  • The authors consider the impact of the structured energy terms in noisy data settings
  • The authors' motivation for these experiments stems from the assumption that structured energies will be more helpful when there is a weaker relationship between the observations and the labels.
  • The authors see that NER shows more benefit from structured energies, so the authors focus on NER and consider two settings: UnkTest: train on clean text, evaluate on noisy text; and UnkTrain: train on noisy text, evaluate on noisy text.
  • Conclusion:

    The authors explore arbitrary-order models with different neural parameterizations on sequence labeling tasks via energy-based inference networks.
  • This approach achieve substantial improvement using high-order energy terms, especially in noisy data conditions, while having same decoding speed as simple local classifiers
Tables
  • Table1: Time complexity and number of parameters of different methods during training and inference, where T is the sequence length, L is the label set size, Θ are the parameters of energy function, and Φ, Ψ are the parameters of two energy-based inference networks. For arbitrary-order energy functions or different parameterizations, the size of Θ can be different
  • Table2: Development results for different parameterizations of high-order energies when increasing the window size M of consecutive labels, where “all” denotes the whole relaxed label sequence. The inference network architecture is a one-layer BiLSTM. We ran t-tests for the mean performance (over five runs) of our proposed energies (the settings in bold) and the linear-chain energy. All differences are significant at p < 0.001 for NER and p < 0.005 for other tasks
  • Table3: Test results on all tasks for local classifiers (BiLSTM) and different structured energy functions. POS/CCG use accuracy while NER/SRL use F1. The architecture of inference networks is one-layer BiLSTM. More results are shown in the appendix
  • Table4: Test results when inference networks have 2 layers (so the local classifier baseline also has 2 layers)
  • Table5: UnkTest setting for NER: words in the test set are replaced by the unknown word symbol with probability α. For CNN energies (the settings in bold) and linear-chain energy, they differ significantly with p < 0.001
  • Table6: UnkTrain setting for NER: training on noisy text, evaluating on noisy test sets. Words are replaced by the unknown word symbol with probability α. For CNN energies (the settings in bold) and linear-chain energy, they differ significantly with p < 0.001
  • Table7: Test results for NER when using BERT. When using energy-based inference networks (our framework), BERT is used in both the energy function and as the inference network architecture
  • Table8: Top 10 CNN filters with high inner product with 3 consecutive labels for NER
  • Table9: Results on all tasks for local classifiers and different structured energy functions: linear-chain energy, Kronecker Product high-order energies, skip-chain energy and fully-connected energies. The metrics of the four tasks POS, NER, CCG, SRL are accuracy, F1, accuracy and F1. The architecture of inference networks is one-layer BiLSTM
  • Table10: Results when inference networks use 2-layer BiLSTMs (so the local classifier baseline also has 2 layers)
  • Table11: UnkTest setting for NER: Words in the test set are randomly replaced by the unknown word symbol with probability α
  • Table12: UnkTrain setting for NER: training on noisy text, evaluating on noisy test sets. Words are randomly replaced by the unknown word symbol with probability α
Download tables as Excel
Related work
Funding
  • This research was supported in part by an Amazon Research Award to K
Study subjects and analysis
datasets: 3
The deeper inference networks reach higher performance across all tasks compared to 1-layer inference networks. We observe that inference networks trained with skip-chain energies and high-order energies achieve better results than BiLSTM-CRF on the three datasets (the Viterbi algorithm is used for. 5M values are retuned based on dev sets when using 2-layer inference networks

Reference
  • David Belanger and Andrew McCallum. 2016. Structured prediction energy networks. In Proceedings of the 33rd International Conference on Machine Learning.
    Google ScholarLocate open access versionFindings
  • David Belanger, Bishan Yang, and Andrew McCallum. 2017. End-to-end learning for structured prediction energy networks. In Proceedings of the 34th International Conference on Machine Learning.
    Google ScholarLocate open access versionFindings
  • Xavier Carreras and Lluıs Marquez. 2005. Introduction to the CoNLL-2005 shared task: Semantic role labeling. In Proceedings of the Ninth Conference on Computational Natural Language Learning (CoNLL-2005), pages 152–164, Ann Arbor, Michigan. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • David Chiang. 2007. Hierarchical phrase-based translation. Computational Linguistics, 33(2):201–228.
    Google ScholarLocate open access versionFindings
  • William W. Cohen and Vitor Rocha de Carvalho. 200Stacked sequential learning. In IJCAI-05, Proceedings of the Nineteenth International Joint Conference on Artificial Intelligence, Edinburgh, Scotland, UK, July 30 - August 5, 2005, pages 671–676.
    Google ScholarLocate open access versionFindings
  • Ronan Collobert, Jason Weston, Leon Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel Kuksa. 2011. Natural language processing (almost) from scratch. JMLR.
    Google ScholarLocate open access versionFindings
  • Nguyen Viet Cuong, Nan Ye, Wee Sun Lee, and Hai Leong Chieu. 2014. Conditional random field with high-order dependencies for sequence labeling and segmentation. Journal of Machine Learning Research, 15(28):981–1009.
    Google ScholarLocate open access versionFindings
  • Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Jenny Rose Finkel, Trond Grenager, and Christopher Manning. 2005. Incorporating non-local information into information extraction systems by Gibbs sampling. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL’05), pages 363–370, Ann Arbor, Michigan. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial nets. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 27, pages 2672–2680. Curran Associates, Inc.
    Google ScholarLocate open access versionFindings
  • Sepp Hochreiter and Jurgen Schmidhuber. 1997. Long short-term memory. Neural Computation.
    Google ScholarFindings
  • Julia Hockenmaier and Mark Steedman. 2002. Acquiring compact lexicalized grammars from a cleaner treebank. In Proceedings of the Third International Conference on Language Resources and Evaluation (LREC’02), Las Palmas, Canary Islands - Spain. European Language Resources Association (ELRA).
    Google ScholarLocate open access versionFindings
  • Liang Huang and David Chiang. 2007. Forest rescoring: Faster decoding with integrated language models. In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, pages 144–151, Prague, Czech Republic. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Yoon Kim. 20Convolutional neural networks for sentence classification. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1746–1751, Doha, Qatar. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Diederik P. Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
    Findings
  • Vijay Krishnan and Christopher D. Manning. 2006. An effective two-stage model for exploiting non-local dependencies in named entity recognition. In Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics, pages 1121–1128, Sydney, Australia. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • John D. Lafferty, Andrew McCallum, and Fernando C. N. Pereira. 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proc. of ICML.
    Google ScholarLocate open access versionFindings
  • Guillaume Lample, Miguel Ballesteros, Sandeep Subramanian, Kazuya Kawakami, and Chris Dyer. 2016. Neural architectures for named entity recognition. In Proceedings of the 2016 Conference of the North
    Google ScholarLocate open access versionFindings
  • American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 260–270.
    Google ScholarFindings
  • Yann LeCun, Sumit Chopra, Raia Hadsell, Marc’Aurelio Ranzato, and Fu-Jie Huang. 2006. A tutorial on energy-based learning. In Predicting Structured Data. MIT Press.
    Google ScholarFindings
  • Olutobi Owoputi, Brendan O’Connor, Chris Dyer, Kevin Gimpel, Nathan Schneider, and Noah A. Smith. 2013. Improved part-of-speech tagging for online conversational text with word clusters. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 380–390.
    Google ScholarLocate open access versionFindings
  • Tao Lei, Yu Xin, Yuan Zhang, Regina Barzilay, and Tommi Jaakkola. 2014. Low-rank tensors for scoring dependency structures. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1381–1391, Baltimore, Maryland. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Bruce T. Lowerre. 1976. The HARPY Speech Recognition System. Ph.D. thesis, Pittsburgh, PA, USA.
    Google ScholarFindings
  • Xuezhe Ma and Eduard Hovy. 2016. End-to-end sequence labeling via bi-directional LSTM-CNNsCRF. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1064–1074.
    Google ScholarLocate open access versionFindings
  • Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. GloVe: Global vectors for word representation. In Proc. of EMNLP.
    Google ScholarLocate open access versionFindings
  • Xian Qian, Xiaoqian Jiang, Qi Zhang, Xuanjing Huang, and Lide Wu. 2009. Sparse higher order conditional random fields for improved sequence labeling. In ICML.
    Google ScholarFindings
  • Dan Roth and Wen-tau Yih. 2004. A linear programming formulation for global inference in natural language tasks. In Proceedings of the Eighth Conference on Computational Natural Language Learning (CoNLL-2004) at HLT-NAACL 2004, pages 1–8, Boston, Massachusetts, USA. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Andre Martins, Noah Smith, Mario Figueiredo, and Pedro Aguiar. 2011. Dual decomposition with many overlapping components. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, pages 238–249, Edinburgh, Scotland, UK. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Alexander M. Rush, David Sontag, Michael Collins, and Tommi Jaakkola. 2010. On dual decomposition and linear programming relaxations for natural language processing. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, pages 1–11, Cambridge, MA. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 26, pages 3111–3119. Curran Associates, Inc.
    Google ScholarLocate open access versionFindings
  • Thomas Mueller, Helmut Schmid, and Hinrich Schutze. 2013. Efficient higher-order CRFs for morphological tagging. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 322–332, Seattle, Washington, USA. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Kevin P. Murphy, Yair Weiss, and Michael I. Jordan. 1999. Loopy belief propagation for approximate inference: An empirical study. In UAI.
    Google ScholarFindings
  • Franz Josef Och, Daniel Gildea, Sanjeev Khudanpur, Anoop Sarkar, Kenji Yamada, Alex Fraser, Shankar Kumar, Libin Shen, David Smith, Katherine Eng, Viren Jain, Zhen Jin, and Dragomir Radev. 2004. A smorgasbord of features for statistical machine translation. In Proceedings of the Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics: HLT-NAACL 2004, pages 161–168, Boston, Massachusetts, USA. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Sunita Sarawagi and William W Cohen. 2005. SemiMarkov conditional random fields for information extraction. In L. K. Saul, Y. Weiss, and L. Bottou, editors, Advances in Neural Information Processing Systems 17, pages 1185–1192. MIT Press.
    Google ScholarLocate open access versionFindings
  • David Smith and Jason Eisner. 2008. Dependency parsing by belief propagation. In Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing, pages 145–156, Honolulu, Hawaii. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Vivek Srikumar and Christopher D Manning. 2014. Learning distributed representations for structured output prediction. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 27, pages 3266–3274. Curran Associates, Inc.
    Google ScholarLocate open access versionFindings
  • Emma Strubell, Patrick Verga, Daniel Andor, David Weiss, and Andrew McCallum. 2018. Linguistically-informed self-attention for semantic role labeling. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 5027–5038, Brussels, Belgium. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Charles Sutton and Andrew McCallum. 2004. Collective segmentation and labeling of distant entities in information extraction. In ICML Workshop on Statistical Relational Learning and Its Connections to Other Fields.
    Google ScholarFindings
  • Zhixing Tan, Mingxuan Wang, Jun Xie, Yidong Chen, and Xiaodong Shi. 2018. Deep semantic role labeling with self-attention. In Proceedings of AAAI.
    Google ScholarLocate open access versionFindings
  • Ben Taskar, Carlos Guestrin, and Daphne Koller. 2004. Max-margin Markov networks. In S. Thrun, L. K. Saul, and B. Scholkopf, editors, Advances in Neural Information Processing Systems 16, pages 25–32.
    Google ScholarLocate open access versionFindings
  • Erik F. Tjong Kim Sang and Fien De Meulder. 2003. Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition. In Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003, pages 142–147.
    Google ScholarLocate open access versionFindings
  • Ioannis Tsochantaridis, Thomas Hofmann, Thorsten Joachims, and Yasemin Altun. 2004. Support vector machine learning for interdependent and structured output spaces. In Proceedings of the Twenty-first International Conference on Machine Learning.
    Google ScholarLocate open access versionFindings
  • Lifu Tu and Kevin Gimpel. 2018. Learning approximate inference networks for structured prediction. In Proceedings of International Conference on Learning Representations (ICLR).
    Google ScholarLocate open access versionFindings
  • Lifu Tu and Kevin Gimpel. 2019. Benchmarking approximate inference methods for neural structured prediction. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 3313–3324, Minneapolis, Minnesota. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Lifu Tu, Kevin Gimpel, and Karen Livescu. 2017. Learning to embed words in context for syntactic tasks. In Proceedings of the 2nd Workshop on Representation Learning for NLP, pages 265–275.
    Google ScholarLocate open access versionFindings
  • Lifu Tu, Richard Yuanzhe Pang, and Kevin Gimpel. 2020a. Improving joint training of inference networks and structured prediction energy networks. Proceedings of the 4th Workshop on Structured Prediction for NLP.
    Google ScholarLocate open access versionFindings
  • Lifu Tu, Richard Yuanzhe Pang, Sam Wiseman, and Kevin Gimpel. 2020b. ENGINE: Energy-based inference networks for non-autoregressive machine translation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 5998–6008. Curran Associates, Inc.
    Google ScholarLocate open access versionFindings
  • Yi Yang and Jacob Eisenstein. 2013. A log-linear model for unsupervised text normalization. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 61–72, Seattle, Washington, USA. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Nan Ye, Wee S. Lee, Hai L. Chieu, and Dan Wu. 2009. Conditional random fields with high-order features for sequence labeling. In Y. Bengio, D. Schuurmans, J. D. Lafferty, C. K. I. Williams, and A. Culotta, editors, Advances in Neural Information Processing Systems 22, pages 2196–2204. Curran Associates, Inc.
    Google ScholarLocate open access versionFindings
  • Mo Yu, Mark Dredze, Raman Arora, and Matthew R. Gormley. 2016. Embedding lexical features via lowrank tensors. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1019–1029, San Diego, California. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Jie Zhou and Wei Xu. 2015. End-to-end learning of semantic role labeling using recurrent neural networks. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 1127–1137, Beijing, China. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
Your rating :
0

 

Tags
Comments
小科