Improving Multimodal Named Entity Recognition via Entity Span Detection with Unified Multimodal Transformer
ACL, pp. 3342-3352, 2020.
EI
Weibo:
Abstract:
In this paper, we study Multimodal Named Entity Recognition (MNER) for social media posts. Existing approaches for MNER mainly suffer from two drawbacks: (1) despite generating word-aware visual representations, their word representations are insensitive to the visual context; (2) most of them ignore the bias brought by the visual context...More
Code:
Data:
Introduction
- Recent years have witnessed the explosive growth of user-generated contents on social media platforms such as Twitter.
- While empowering users with rich information, the flourish of social media solicits the emerging need of automatically extracting important information from these massive unstructured contents.
- As a crucial component of many information extraction tasks, named entity recognition (NER) aims to discover named entities in free text and classify them into pre-defined types, such as person (PER), location (LOC) and organization (ORG).
- Vote for [King of the Jungle MISC] — [Kian PER] or [David PER] ?
Highlights
- Recent years have witnessed the explosive growth of user-generated contents on social media platforms such as Twitter
- We propose a Multimodal Transformer model for the task of Multimodal Named Entity Recognition, which empowers Transformer with a multimodal interaction module to capture the inter-modality dynamics between words and images
- Modified conditional random field Layer for Multimodal Named Entity Recognition: After obtaining the conversion matrix, we further propose to fully leverage the text-based entity span predictions to guide the final predictions of Multimodal Named Entity Recognition
- We first presented a Multimodal Transformer architecture for the task of Multimodal Named Entity Recognition, which captures the inter-modal interactions with a multimodal interaction module
- To alleviate the bias of the visual context, we further proposed a Unified Multimodal Transformer (UMT), which incorporates an entity span detection module to guide the final predictions for Multimodal Named Entity Recognition
- Experimental results show that our Unified Multimodal Transformer approach can consistently achieve the best performance on two benchmark datasets
Methods
- To reduce the feature engineering efforts, a number of recent studies proposed to couple different neural network architectures with a CRF layer (Lafferty et al, 2001) for word-level predictions, including convolutional neural networks (Collobert et al, 2011), recurrent neural networks (Chiu and Nichols, 2016; Lample et al, 2016), and their hierarchical combinations (Ma and Hovy, 2016)
- These neural approaches have been shown to achieve the state-of-the-art performance on different benchmark datasets based on formal text (Yang et al, 2018).
- To the best of the knowledge, this is the first work to apply Transformer to the task of MNER
Results
- Experimental results show that the Unified Multimodal
Transformer (UMT) brings consistent performance gains over several highly competitive unimodal and multimodal methods, and outperforms the state-of-the-art by a relative improvement of 3.7% and 3.8% on two benchmarks, respectively.
The main contributions of this paper can be summarized as follows:
The authors propose a Multimodal Transformer model for the task of MNER, which empowers Transformer with a multimodal interaction module to capture the inter-modality dynamics between words and images. - It is easy to see that empowering BERT with a CRF layer can further boost the performance
- All these observations indicate that the contextualized word representations are quite helpful for the NER task on social media texts, due to the context-aware characteristics.
- The authors can see that GVATT-HBiLSTM-CRF and AdaCAN-CNN-BiLSTM-CRF can significantly outperform their unimodal baselines, the performance gains become relatively limited when replacing their sentence encoder with BERT
- This suggests the challenge and the necessity of proposing a more effective multimodal approach
Conclusion
- The authors first presented a Multimodal Transformer architecture for the task of MNER, which captures the inter-modal interactions with a multimodal interaction module.
- Despite bringing performance improvements over existing MNER methods, the UMT approach still fails to perform well on social media posts with unmatched text and images, as analyzed in Section 3.5.
- Since the size of existing MNER datasets is relatively small, the authors plan to leverage the large amount of unlabeled social media posts in different platforms, and propose an effective framework to combine them with the small amount of annotated data to obtain a more robust MNER model
Summary
Introduction:
Recent years have witnessed the explosive growth of user-generated contents on social media platforms such as Twitter.- While empowering users with rich information, the flourish of social media solicits the emerging need of automatically extracting important information from these massive unstructured contents.
- As a crucial component of many information extraction tasks, named entity recognition (NER) aims to discover named entities in free text and classify them into pre-defined types, such as person (PER), location (LOC) and organization (ORG).
- Vote for [King of the Jungle MISC] — [Kian PER] or [David PER] ?
Methods:
To reduce the feature engineering efforts, a number of recent studies proposed to couple different neural network architectures with a CRF layer (Lafferty et al, 2001) for word-level predictions, including convolutional neural networks (Collobert et al, 2011), recurrent neural networks (Chiu and Nichols, 2016; Lample et al, 2016), and their hierarchical combinations (Ma and Hovy, 2016)- These neural approaches have been shown to achieve the state-of-the-art performance on different benchmark datasets based on formal text (Yang et al, 2018).
- To the best of the knowledge, this is the first work to apply Transformer to the task of MNER
Results:
Experimental results show that the Unified Multimodal
Transformer (UMT) brings consistent performance gains over several highly competitive unimodal and multimodal methods, and outperforms the state-of-the-art by a relative improvement of 3.7% and 3.8% on two benchmarks, respectively.
The main contributions of this paper can be summarized as follows:
The authors propose a Multimodal Transformer model for the task of MNER, which empowers Transformer with a multimodal interaction module to capture the inter-modality dynamics between words and images.- It is easy to see that empowering BERT with a CRF layer can further boost the performance
- All these observations indicate that the contextualized word representations are quite helpful for the NER task on social media texts, due to the context-aware characteristics.
- The authors can see that GVATT-HBiLSTM-CRF and AdaCAN-CNN-BiLSTM-CRF can significantly outperform their unimodal baselines, the performance gains become relatively limited when replacing their sentence encoder with BERT
- This suggests the challenge and the necessity of proposing a more effective multimodal approach
Conclusion:
The authors first presented a Multimodal Transformer architecture for the task of MNER, which captures the inter-modal interactions with a multimodal interaction module.- Despite bringing performance improvements over existing MNER methods, the UMT approach still fails to perform well on social media posts with unmatched text and images, as analyzed in Section 3.5.
- Since the size of existing MNER datasets is relatively small, the authors plan to leverage the large amount of unlabeled social media posts in different platforms, and propose an effective framework to combine them with the small amount of annotated data to obtain a more robust MNER model
Tables
- Table1: The basic statistics of our two Twitter datasets
- Table2: Performance comparison on our two TWITTER datasets. † indicates that UMT-BERT-CRF is significantly better than GVATT-BERT-CRF and AdaCAN-BERT-CRF with p-value < 0.05 based on paired t-test
- Table3: Ablation Study of Unified Multimodal Transformer
- Table4: The second row shows several representative samples together with their manually labeled entities in the test set of our two TWITTER datasets, and the bottom four rows show predicted entities of different methods on these test samples
Related work
- As a crucial component of many information extraction tasks including entity linking (Derczynski et al, 2015), opinion mining (Maynard et al, 2012), and event detection (Ritter et al, 2012), named entity recognition (NER) has attracted much attention in the research community in the past two decades (Li et al, 2018).
Funding
- This research is supported by the National Research Foundation, Singapore under its International Research Centres in Singapore Funding Initiative, and the Natural Science Foundation of China under Grant 61672288
Reference
- Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. 2016. Layer normalization. arXiv preprint arXiv:1607.06450.
- Timothy Baldwin, Marie-Catherine de Marneffe, Bo Han, Young-Bum Kim, Alan Ritter, and Wei Xu. 2015. Shared tasks of the 2015 workshop on noisy user-generated text: Twitter lexical normalization and named entity recognition. In Proceedings of the Workshop on Noisy User-generated Text, pages 126– 135.
- Hai Leong Chieu and Hwee Tou Ng. 2002. Named entity recognition: a maximum entropy approach using global information. In Proceedings of COLING.
- Jason PC Chiu and Eric Nichols. 2016. Named entity recognition with bidirectional lstm-cnns. Transactions of the Association for Computational Linguistics, 4:357–370.
- Ronan Collobert, Jason Weston, Leon Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel Kuksa. 2011. Natural language processing (almost) from scratch. Journal of Machine Learning Research, 12:2461– 2505.
- Guillaume Lample, Miguel Ballesteros, Sandeep Subramanian, Kazuya Kawakami, and Chris Dyer. 201Neural architectures for named entity recognition. In Proceedings of NAACL-HLT.
- Chenliang Li, Aixin Sun, Jianshu Weng, and Qi He. 2014. Tweet segmentation and its application to named entity recognition. IEEE Transactions on knowledge and data engineering, 27(2):558–570.
- Chenliang Li, Jianshu Weng, Qi He, Yuxia Yao, Anwitaman Datta, Aixin Sun, and Bu-Sung Lee. 2012. Twiner: named entity recognition in targeted twitter stream. In Proceedings of SIGIR.
- Jing Li, Aixin Sun, Jianglei Han, and Chenliang Li. 2018. A survey on deep learning for named entity recognition. arXiv preprint arXiv:1812.09449.
- Nut Limsopatham and Nigel Collier. 2016. Bidirectional LSTM for named entity recognition in twitter messages. In Proceedings of the 2nd Workshop on Noisy User-generated Text (WNUT).
- Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255.
- Leon Derczynski, Diana Maynard, Giuseppe Rizzo, Marieke Van Erp, Genevieve Gorrell, Raphael Troncy, Johann Petrak, and Kalina Bontcheva. 2015. Analysis of named entity recognition and linking for tweets. Information Processing & Management, 51(2):32–49.
- Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
- Jenny Rose Finkel, Trond Grenager, and Christopher Manning. 2005. Incorporating non-local information into information extraction systems by gibbs sampling. In Proceedings of ACL.
- Radu Florian, Abe Ittycheriah, Hongyan Jing, and Tong Zhang. 2003. Named entity recognition through classifier combination. In Proceedings of NAACL.
- Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 20Deep residual learning for image recognition. In Proceedings of CVPR, pages 770–778.
- Zhiheng Huang, Wei Xu, and Kai Yu. 2015. Bidirectional lstm-crf models for sequence tagging. arXiv preprint arXiv:1508.01991.
- John Lafferty, Andrew McCallum, and Fernando CN Pereira. 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of ICML.
- Bill Yuchen Lin, Frank F Xu, Zhiyi Luo, and Kenny Zhu. 2017. Multi-channel bilstm-crf model for emerging named entity recognition in social media. In Proceedings of the 3rd Workshop on Noisy Usergenerated Text.
- Dekang Lin and Xiaoyun Wu. 2009. Phrase clustering for discriminative learning. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP.
- Di Lu, Leonardo Neves, Vitor Carvalho, Ning Zhang, and Heng Ji. 2018. Visual attention model for name tagging in multimodal social media. In Proceedings of ACL, pages 1990–1999.
- Gang Luo, Xiaojiang Huang, Chin-Yew Lin, and Zaiqing Nie. 2015. Joint entity recognition and disambiguation. In Proceedings of EMNLP.
- Xuezhe Ma and Eduard Hovy. 2016. End-to-end sequence labeling via bi-directional lstm-cnns-crf. In Proceedings of ACL.
- Diana Maynard, Kalina Bontcheva, and Dominic Rout. 2012. Challenges in developing opinion mining tools for social media. In Proceedings of the@ NLP can u tag usergeneratedcontent.
- Seungwhan Moon, Leonardo Neves, and Vitor Carvalho. 2018. Multimodal named entity recognition for short social media posts. In Proceedings of NAACL.
- Alexandre Passos, Vineet Kumar, and Andrew McCallum. 2014. Lexicon infused phrase embeddings for named entity resolution. In Proceedings of CoNLL.
- Lev Ratinov and Dan Roth. 2009. Design challenges and misconceptions in named entity recognition. In Proceedings of CoNLL.
- Alan Ritter, Sam Clark, Mausam, and Oren Etzioni. 2011. Named entity recognition in tweets: an experimental study. In Proceedings of ACL.
- Alan Ritter, Oren Etzioni, and Sam Clark. 2012. Open domain event extraction from twitter. In Proceedings of SIGKDD.
- Erik F Sang and Jorn Veenstra. 1999. Representing text chunks. In Proceedings of EACL.
- Kentaro Torisawa et al. 2007. Exploiting wikipedia as external knowledge for named entity recognition. In Proceedings of EMNLP-CoNLL.
- Yao-Hung Hubert Tsai, Shaojie Bai, Paul Pu Liang, J Zico Kolter, Louis-Philippe Morency, and Ruslan Salakhutdinov. 2019. Multimodal transformer for unaligned multimodal language sequences. In Proceedings of ACL.
- Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Proceedings of NIPS, pages 5998– 6008.
- Vikas Yadav and Steven Bethard. 2018. A survey on recent advances in named entity recognition from deep learning models. In Proceedings of COLING.
- Jie Yang, Shuailong Liang, and Yue Zhang. 2018. Design challenges and misconceptions in neural sequence labeling. In Proceedings of COLING.
- Qi Zhang, Jinlan Fu, Xiaoyu Liu, and Xuanjing Huang. 2018. Adaptive co-attention network for named entity recognition in tweets. In Proceedings of AAAI, pages 5674–5681.
- GuoDong Zhou and Jian Su. 2002. Named entity recognition using an hmm-based chunk tagger. In Proceedings of ACL.
Full Text
Tags
Comments