## AI helps you reading Science

## AI Insight

AI extracts a summary of this paper

Weibo:

# Training for Gibbs Sampling on Conditional Random Fields with Neural Scoring Factors

EMNLP 2020, pp.4999-5011, (2020)

Keywords

Abstract

Most recent improvements in NLP come from changes to the neural network architectures modeling the text input. Yet, state-of-the-art models often rely on simple approaches to model the label space, e.g. bigram Conditional Random Fields (CRFs) in sequence tagging. More expressive graphical models are rarely used due to their prohibitive co...More

Code:

Data:

Introduction

- Complex probabilistic graphical models were widely adopted for NLP tasks before the prevalence of deep learning (e.g. the skip-chain CRF of Finkel et al (2005) and Sutton and Mccallum (2004) for NER).
- Consider two contrasting approaches to structured prediction: transition-based models and graphical models.
- Transition-based models (e.g. the sequence-to-sequence models of Sutskever et al (2014)) have enjoyed recent success thanks to their ability to have unbounded memory of past transitions when predicting subsequent ones; yet because no conditional independence assumptions are made, inference is typically restricted to greedy search and its variants.
- Graphical models make strong conditional independence assumptions, but enjoy a wealth of inference algorithms, both exact and approximate, as a result.
- The authors focus on this latter approach to modeling

Highlights

- Complex probabilistic graphical models were widely adopted for NLP tasks before the prevalence of deep learning (e.g. the skip-chain Conditional Random Fields (CRFs) of Finkel et al (2005) and Sutton and Mccallum (2004) for named entity recognition (NER))
- To decode a neural CRF model, we find the output that maximizes the scoring function by sampling from the conditional distribution defined in Eq 1 with Markov Chain Monte Carlo (MCMC)
- We present the NER results of our neural skipchain CRF model with Flair embedding in Table 2
- The skip-chain CRF has context dependent transition and skip-chain factors, and is trained with Neural SampleRank (NSR), while all other models are trained with standard MLE
- When trained with Flair embedding, our neural skip-chain CRF model does not improve over baseline for English and German
- We observe that block Gibbs sampling can improve the performance of the skip-chain model, which effectively leverages long-range context dependencies
- We have proposed Neural SampleRank (NSR), an efficient algorithm for approximate inference and training for CRF models with neural network factors

Methods

- 5.1 Dataset and Model Configuration

The authors evaluate Neural SampleRank for sequence tagging models on CoNLL-02 Dutch (Tjong Kim Sang, 2002), and CoNLL-03 English and German NER datasets (Tjong Kim Sang and De Meulder, 2003). 2 Summary statistics of the training set for each language is shown in Table 1. - The authors evaluate Neural SampleRank for sequence tagging models on CoNLL-02 Dutch (Tjong Kim Sang, 2002), and CoNLL-03 English and German NER datasets (Tjong Kim Sang and De Meulder, 2003).
- The authors use GLoVE (Pennington et al, 2014) for English, and Fasttext (Bojanowski et al, 2017) for German and Dutch.
- The authors use Flair (Akbik et al, 2019) in its recommended settings for each language.

Results

- The authors present the NER results of the neural skipchain CRF model with Flair embedding in Table 2.
- The skip-chain CRF has context dependent transition and skip-chain factors, and is trained with Neural SampleRank (NSR), while all other models are trained with standard MLE.
- When trained with Flair embedding, the neural skip-chain CRF model does not improve over baseline for English and German.
- The F1 score difference between baseline and neural skip-chain CRF on German is not statistically significant.
- The authors' skip-chain neural CRF model sig-

Conclusion

- The authors have proposed Neural SampleRank (NSR), an efficient algorithm for approximate inference and training for CRF models with neural network factors.
- NSR is computationally efficient for arbitrarily complex graphical models, applicable to a wide range of structured prediction tasks.
- The authors' proposed method paves the way for new neural graphical models to be designed for these tasks.
- The linear-chain model runs at 8,000 tokens per second for training with MLE.

Summary

## Introduction:

Complex probabilistic graphical models were widely adopted for NLP tasks before the prevalence of deep learning (e.g. the skip-chain CRF of Finkel et al (2005) and Sutton and Mccallum (2004) for NER).- Consider two contrasting approaches to structured prediction: transition-based models and graphical models.
- Transition-based models (e.g. the sequence-to-sequence models of Sutskever et al (2014)) have enjoyed recent success thanks to their ability to have unbounded memory of past transitions when predicting subsequent ones; yet because no conditional independence assumptions are made, inference is typically restricted to greedy search and its variants.
- Graphical models make strong conditional independence assumptions, but enjoy a wealth of inference algorithms, both exact and approximate, as a result.
- The authors focus on this latter approach to modeling
## Methods:

5.1 Dataset and Model Configuration

The authors evaluate Neural SampleRank for sequence tagging models on CoNLL-02 Dutch (Tjong Kim Sang, 2002), and CoNLL-03 English and German NER datasets (Tjong Kim Sang and De Meulder, 2003). 2 Summary statistics of the training set for each language is shown in Table 1.- The authors evaluate Neural SampleRank for sequence tagging models on CoNLL-02 Dutch (Tjong Kim Sang, 2002), and CoNLL-03 English and German NER datasets (Tjong Kim Sang and De Meulder, 2003).
- The authors use GLoVE (Pennington et al, 2014) for English, and Fasttext (Bojanowski et al, 2017) for German and Dutch.
- The authors use Flair (Akbik et al, 2019) in its recommended settings for each language.
## Results:

The authors present the NER results of the neural skipchain CRF model with Flair embedding in Table 2.- The skip-chain CRF has context dependent transition and skip-chain factors, and is trained with Neural SampleRank (NSR), while all other models are trained with standard MLE.
- When trained with Flair embedding, the neural skip-chain CRF model does not improve over baseline for English and German.
- The F1 score difference between baseline and neural skip-chain CRF on German is not statistically significant.
- The authors' skip-chain neural CRF model sig-
## Conclusion:

The authors have proposed Neural SampleRank (NSR), an efficient algorithm for approximate inference and training for CRF models with neural network factors.- NSR is computationally efficient for arbitrarily complex graphical models, applicable to a wide range of structured prediction tasks.
- The authors' proposed method paves the way for new neural graphical models to be designed for these tasks.
- The linear-chain model runs at 8,000 tokens per second for training with MLE.

- Table1: Training sets statistics of CoNLL-03 English and German, and CoNLL-02 Dutch
- Table2: NER F1 score comparison on CoNLL-03 English and German, and CoNLL-02 Dutch dataset, with contextualized embeddings. Bold indicates the highest score, “*” indicates statistical significance compared with baseline
- Table3: NER F1 score comparison for English, without contextualized word embeddings
- Table4: NER F1 score comparison for German, without contextualized word embeddings
- Table5: Ablation results on the development set for English. Each row changes one component while keeping all of the others
- Table6: Number of entities of each type in the 2003 and 2006 version of ground truth labels for CoNLL-03 German
- Table7: NER F1 score comparisons on CoNLL-03 German dataset, between 2003 and 2006 ground truth label versions. Bold indicates the highest score, “*” indicates statistical significance compared with baseline

Related work

- Various approaches have been taken in NLP to combine graphical models and neural architectures. For sequence tagging tasks like NER, it is common to use a linear-chain CRF model (Huang et al, 2015; Lample et al, 2016), for which exact inference can be done in polynomial time with forward-backward. Malaviya et al (2018) adopt a factorized CRF to model the output space of morphological tagging, and the exact inference is tractable with belief propagation. Ganea and Hofmann (2017) propose a fully connected binary CRF to model mention sequence for entity linking task, and they use loopy belief propagation for approximate inference.

Other approaches have been proposed to adopt expressive graphical models while keeping the inference computationally feasible, but have not been applied to deep neural networks. Steinhardt and Liang (2015) propose to select non-local contexts while keeping the model feasible for exact inference. Finkel et al (2005) use Gibbs Sampling with simulated annealing for fast approximate inference for models with non-local factors. Sutton and Mccallum (2004) propose a skip-chain CRF for NER learned with loopy belief propagation. SampleRank (Wick et al, 2011; Zhang et al, 2014) propose a new training objective targeted for samplingbased inference which is efficient both in terms of computation cost and task performance. In prior work, Gibbs sampling has been used with deep neural networks for Bayesian posterior inference (Shi et al, 2017; Tran et al, 2016), and sampling from conditional sequence models (Lin and Eisner, 2018). Gibbs sampling was only widely applied to discriminative models before the prevalence of deep learning, and restricted to generative models when used with neural models (Das et al, 2015; Nguyen et al, 2015; Xun et al, 2017). To the best of knowledge, we are the first to use Gibbs sampling to obtain point estimation for neural network graphical model hybrids, for the task of structured prediction.

Funding

- On English, we are able to achieve comparable F1 scores as other contextualized embedding models, yet unable to match Akbik et al (2019)
- Nificantly improves the Flair model’s performance on Dutch (p < 0.01), achieving new state-of-theart
- As shown in Table 3 and Table 4, we are able to significantly improve F1 over baseline on both languages (p < 0.05)
- We also observe that block Gibbs sampling can improve the performance of the skip-chain model, which effectively leverages long-range context dependencies

Study subjects and analysis

cycles of samples: 10

For training, we use negative Hamming distance for the metric in the SampleRank loss and Adam optimizer (Kingma and Ba, 2014). For Gibbs sampling, at training time we take 10 cycles of samples for each update. (We resample the full label sequence in each cycle.) At decoding time, we set the initial temperature to 10, the annealing rate to 0.95 and take 120 cycles of samples. We ensemble model predictions over 3 runs with majority vote

cycles of samples: 10

For training, we use negative Hamming distance for the metric in the SampleRank loss and Adam optimizer (Kingma and Ba, 2014). For Gibbs sampling, at training time we take 10 cycles of samples for each update. (We resample the full label sequence in each cycle.) At decoding time, we set the initial temperature to 10, the annealing rate to 0.95 and take 120 cycles of samples. We ensemble model predictions over 3 runs with majority vote

samples: 120

Following Keith et al (2018), we approximate the probability of each sample (i.e. tag sequence) with its frequency when calculating the entropy, then plot this empirical entropy against the length of document. We take 120 samples, by collecting the sample at the end of each cycle (i.e. resampling of the full tag sequence), then split the 120 samples into four 30-sample stages. In Figure 5, we compare the sample entropy of standard MCMC (i.e. without annealing), and MCMC decoding with annealing

Reference

- Alan Akbik, Tanja Bergmann, and Roland Vollgraf. 2019. Pooled contextualized embeddings for named entity recognition. In NAACL 2019, 2019
- Annual Conference of the North American Chapter of the Association for Computational Linguistics, page 724728.
- Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2017. Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics, 5:135–146.
- Koby Crammer and Yoram Singer. 2003. Ultraconservative online algorithms for multiclass problems. Journal of Machine Learning Research, 3(Jan):951– 991.
- Rajarshi Das, Manzil Zaheer, and Chris Dyer. 201Gaussian LDA for topic models with word embeddings. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 795–804, Beijing, China. Association for Computational Linguistics.
- Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
- Jenny Rose Finkel, Trond Grenager, and Christopher Manning. 2005. Incorporating non-local information into information extraction systems by gibbs sampling. In Proceedings of the 43rd annual meeting on association for computational linguistics, pages 363–370. Association for Computational Linguistics.
- Octavian-Eugen Ganea and Thomas Hofmann. 2017. Deep joint entity disambiguation with local neural attention. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 2619–2629, Copenhagen, Denmark. Association for Computational Linguistics.
- Zhiheng Huang, Wei Xu, and Kai Yu. 2015. Bidirectional lstm-crf models for sequence tagging. arXiv preprint arXiv:1508.01991.
- Katherine Keith, Su Lin Blodgett, and Brendan O’Connor. 2018. Monte Carlo syntax marginals for exploring and using dependency parses. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 917–928, New Orleans, Louisiana. Association for Computational Linguistics.
- Diederik P. Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. CoRR, abs/1412.6980.
- John Lafferty, Andrew McCallum, and Fernando Pereira. 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the eighteenth international conference on machine learning, ICML, volume 1, pages 282–289.
- Guillaume Lample, Miguel Ballesteros, Sandeep Subramanian, Kazuya Kawakami, and Chris Dyer. 2016. Neural architectures for named entity recognition. In Proceedings of NAACL-HLT, pages 260–270.
- Chu-Cheng Lin and Jason Eisner. 2018. Neural particle smoothing for sampling from conditional sequence models. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), volume 1, pages 929–941.
- Gang Luo, Xiaojiang Huang, Chin-Yew Lin, and Zaiqing Nie. 20Joint entity recognition and disambiguation. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 879–888.
- Xuezhe Ma and Eduard Hovy. 20End-to-end sequence labeling via bi-directional lstm-cnns-crf. arXiv preprint arXiv:1603.01354.
- Chaitanya Malaviya, Matthew R Gormley, and Graham Neubig. 2018. Neural factor graph models for cross-lingual morphological tagging. arXiv preprint arXiv:1805.04570.
- Dat Quoc Nguyen, Richard Billingsley, Lan Du, and Mark Johnson. 2015. Improving topic models with latent feature word representations. Transactions of the Association for Computational Linguistics, 3:299–313.
- Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. 20Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems, pages 8024–8035.
- Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1532–1543, Doha, Qatar. Association for Computational Linguistics.
- Matthew Peters, Waleed Ammar, Chandra Bhagavatula, and Russell Power. 2017. Semi-supervised sequence tagging with bidirectional language models. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, pages 1756–1765.
- Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), volume 1, pages 2227–2237.
- Martin Riedl and Sebastian Pado. 2018. A named entity recognition shootout for German. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 120–125, Melbourne, Australia. Association for Computational Linguistics.
- Jiaxin Shi, Jianfei Chen, Jun Zhu, Shengyang Sun, Yucen Luo, Yihong Gu, and Yuhao Zhou. 2017. Zhusuan: A library for bayesian deep learning. arXiv preprint arXiv:1709.05870.
- Jacob Steinhardt and Percy Liang. 2015. Reified context models. In International Conference on Machine Learning, pages 1043–1052.
- Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pages 3104–3112.
- Charles Sutton and Andrew Mccallum. 2004. Mccallum: Collective segmentation and labeling of distant entities in information extraction. In In ICML Workshop on Statistical Relational Learning and Its Connections. Citeseer.
- Erik F. Tjong Kim Sang. 2002. Introduction to the CoNLL-2002 shared task: Language-independent named entity recognition. In COLING-02: The 6th Conference on Natural Language Learning 2002 (CoNLL-2002).
- Erik F. Tjong Kim Sang and Fien De Meulder. 2003. Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition. In Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003, pages 142–147.
- Dustin Tran, Alp Kucukelbir, Adji B Dieng, Maja Rudolph, Dawen Liang, and David M Blei. 2016. Edward: A library for probabilistic modeling, inference, and criticism. arXiv preprint arXiv:1610.09787.
- Michael L Wick, Khashayar Rohanimanesh, Kedar Bellare, Aron Culotta, and Andrew McCallum. 2011. Samplerank: Training factor graphs with atomic gradients. In ICML, pages 777–784.
- Guangxu Xun, Yaliang Li, Wayne Xin Zhao, Jing Gao, and Aidong Zhang. 2017. A correlated topic model using word embeddings. In Proceedings of the 26th International Joint Conference on Artificial Intelligence, IJCAI17, page 42074213. AAAI Press.
- Zhilin Yang, Ruslan Salakhutdinov, and William Cohen. 2016. Multi-task cross-lingual sequence tagging from scratch. arXiv preprint arXiv:1603.06270.
- Zhilin Yang, Ruslan Salakhutdinov, and William W Cohen. 2017. Transfer learning for sequence tagging with hierarchical recurrent networks. arXiv preprint arXiv:1703.06345.
- Alexander Yeh. 2000. More accurate tests for the statistical significance of result differences. In COLING 2000 Volume 2: The 18th International Conference on Computational Linguistics.
- Yuan Zhang, Tao Lei, Regina Barzilay, Tommi Jaakkola, and Amir Globerson. 2014. Steps to excellence: Simple inference with refined scoring of dependency trees. Association for Computational Linguistics.

Tags

Comments