Improving Multimodal Named Entity Recognition via Entity Span Detection with Unified Multimodal Transformer

ACL, pp. 3342-3352, 2020.

Cited by: 0|Bibtex|Views84
EI
Other Links: dblp.uni-trier.de|academic.microsoft.com
Weibo:
To alleviate the bias of the visual context, we further proposed a Unified Multimodal Transformer, which incorporates an entity span detection module to guide the final predictions for Multimodal Named Entity Recognition

Abstract:

In this paper, we study Multimodal Named Entity Recognition (MNER) for social media posts. Existing approaches for MNER mainly suffer from two drawbacks: (1) despite generating word-aware visual representations, their word representations are insensitive to the visual context; (2) most of them ignore the bias brought by the visual context...More

Code:

Data:

0
Introduction
  • Recent years have witnessed the explosive growth of user-generated contents on social media platforms such as Twitter.
  • While empowering users with rich information, the flourish of social media solicits the emerging need of automatically extracting important information from these massive unstructured contents.
  • As a crucial component of many information extraction tasks, named entity recognition (NER) aims to discover named entities in free text and classify them into pre-defined types, such as person (PER), location (LOC) and organization (ORG).
  • Vote for [King of the Jungle MISC] — [Kian PER] or [David PER] ?
Highlights
  • Recent years have witnessed the explosive growth of user-generated contents on social media platforms such as Twitter
  • We propose a Multimodal Transformer model for the task of Multimodal Named Entity Recognition, which empowers Transformer with a multimodal interaction module to capture the inter-modality dynamics between words and images
  • Modified conditional random field Layer for Multimodal Named Entity Recognition: After obtaining the conversion matrix, we further propose to fully leverage the text-based entity span predictions to guide the final predictions of Multimodal Named Entity Recognition
  • We first presented a Multimodal Transformer architecture for the task of Multimodal Named Entity Recognition, which captures the inter-modal interactions with a multimodal interaction module
  • To alleviate the bias of the visual context, we further proposed a Unified Multimodal Transformer (UMT), which incorporates an entity span detection module to guide the final predictions for Multimodal Named Entity Recognition
  • Experimental results show that our Unified Multimodal Transformer approach can consistently achieve the best performance on two benchmark datasets
Methods
  • To reduce the feature engineering efforts, a number of recent studies proposed to couple different neural network architectures with a CRF layer (Lafferty et al, 2001) for word-level predictions, including convolutional neural networks (Collobert et al, 2011), recurrent neural networks (Chiu and Nichols, 2016; Lample et al, 2016), and their hierarchical combinations (Ma and Hovy, 2016)
  • These neural approaches have been shown to achieve the state-of-the-art performance on different benchmark datasets based on formal text (Yang et al, 2018).
  • To the best of the knowledge, this is the first work to apply Transformer to the task of MNER
Results
  • Experimental results show that the Unified Multimodal

    Transformer (UMT) brings consistent performance gains over several highly competitive unimodal and multimodal methods, and outperforms the state-of-the-art by a relative improvement of 3.7% and 3.8% on two benchmarks, respectively.

    The main contributions of this paper can be summarized as follows:

    The authors propose a Multimodal Transformer model for the task of MNER, which empowers Transformer with a multimodal interaction module to capture the inter-modality dynamics between words and images.
  • It is easy to see that empowering BERT with a CRF layer can further boost the performance
  • All these observations indicate that the contextualized word representations are quite helpful for the NER task on social media texts, due to the context-aware characteristics.
  • The authors can see that GVATT-HBiLSTM-CRF and AdaCAN-CNN-BiLSTM-CRF can significantly outperform their unimodal baselines, the performance gains become relatively limited when replacing their sentence encoder with BERT
  • This suggests the challenge and the necessity of proposing a more effective multimodal approach
Conclusion
  • The authors first presented a Multimodal Transformer architecture for the task of MNER, which captures the inter-modal interactions with a multimodal interaction module.
  • Despite bringing performance improvements over existing MNER methods, the UMT approach still fails to perform well on social media posts with unmatched text and images, as analyzed in Section 3.5.
  • Since the size of existing MNER datasets is relatively small, the authors plan to leverage the large amount of unlabeled social media posts in different platforms, and propose an effective framework to combine them with the small amount of annotated data to obtain a more robust MNER model
Summary
  • Introduction:

    Recent years have witnessed the explosive growth of user-generated contents on social media platforms such as Twitter.
  • While empowering users with rich information, the flourish of social media solicits the emerging need of automatically extracting important information from these massive unstructured contents.
  • As a crucial component of many information extraction tasks, named entity recognition (NER) aims to discover named entities in free text and classify them into pre-defined types, such as person (PER), location (LOC) and organization (ORG).
  • Vote for [King of the Jungle MISC] — [Kian PER] or [David PER] ?
  • Methods:

    To reduce the feature engineering efforts, a number of recent studies proposed to couple different neural network architectures with a CRF layer (Lafferty et al, 2001) for word-level predictions, including convolutional neural networks (Collobert et al, 2011), recurrent neural networks (Chiu and Nichols, 2016; Lample et al, 2016), and their hierarchical combinations (Ma and Hovy, 2016)
  • These neural approaches have been shown to achieve the state-of-the-art performance on different benchmark datasets based on formal text (Yang et al, 2018).
  • To the best of the knowledge, this is the first work to apply Transformer to the task of MNER
  • Results:

    Experimental results show that the Unified Multimodal

    Transformer (UMT) brings consistent performance gains over several highly competitive unimodal and multimodal methods, and outperforms the state-of-the-art by a relative improvement of 3.7% and 3.8% on two benchmarks, respectively.

    The main contributions of this paper can be summarized as follows:

    The authors propose a Multimodal Transformer model for the task of MNER, which empowers Transformer with a multimodal interaction module to capture the inter-modality dynamics between words and images.
  • It is easy to see that empowering BERT with a CRF layer can further boost the performance
  • All these observations indicate that the contextualized word representations are quite helpful for the NER task on social media texts, due to the context-aware characteristics.
  • The authors can see that GVATT-HBiLSTM-CRF and AdaCAN-CNN-BiLSTM-CRF can significantly outperform their unimodal baselines, the performance gains become relatively limited when replacing their sentence encoder with BERT
  • This suggests the challenge and the necessity of proposing a more effective multimodal approach
  • Conclusion:

    The authors first presented a Multimodal Transformer architecture for the task of MNER, which captures the inter-modal interactions with a multimodal interaction module.
  • Despite bringing performance improvements over existing MNER methods, the UMT approach still fails to perform well on social media posts with unmatched text and images, as analyzed in Section 3.5.
  • Since the size of existing MNER datasets is relatively small, the authors plan to leverage the large amount of unlabeled social media posts in different platforms, and propose an effective framework to combine them with the small amount of annotated data to obtain a more robust MNER model
Tables
  • Table1: The basic statistics of our two Twitter datasets
  • Table2: Performance comparison on our two TWITTER datasets. † indicates that UMT-BERT-CRF is significantly better than GVATT-BERT-CRF and AdaCAN-BERT-CRF with p-value < 0.05 based on paired t-test
  • Table3: Ablation Study of Unified Multimodal Transformer
  • Table4: The second row shows several representative samples together with their manually labeled entities in the test set of our two TWITTER datasets, and the bottom four rows show predicted entities of different methods on these test samples
Download tables as Excel
Related work
Funding
  • This research is supported by the National Research Foundation, Singapore under its International Research Centres in Singapore Funding Initiative, and the Natural Science Foundation of China under Grant 61672288
Reference
  • Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. 2016. Layer normalization. arXiv preprint arXiv:1607.06450.
    Findings
  • Timothy Baldwin, Marie-Catherine de Marneffe, Bo Han, Young-Bum Kim, Alan Ritter, and Wei Xu. 2015. Shared tasks of the 2015 workshop on noisy user-generated text: Twitter lexical normalization and named entity recognition. In Proceedings of the Workshop on Noisy User-generated Text, pages 126– 135.
    Google ScholarLocate open access versionFindings
  • Hai Leong Chieu and Hwee Tou Ng. 2002. Named entity recognition: a maximum entropy approach using global information. In Proceedings of COLING.
    Google ScholarLocate open access versionFindings
  • Jason PC Chiu and Eric Nichols. 2016. Named entity recognition with bidirectional lstm-cnns. Transactions of the Association for Computational Linguistics, 4:357–370.
    Google ScholarLocate open access versionFindings
  • Ronan Collobert, Jason Weston, Leon Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel Kuksa. 2011. Natural language processing (almost) from scratch. Journal of Machine Learning Research, 12:2461– 2505.
    Google ScholarLocate open access versionFindings
  • Guillaume Lample, Miguel Ballesteros, Sandeep Subramanian, Kazuya Kawakami, and Chris Dyer. 201Neural architectures for named entity recognition. In Proceedings of NAACL-HLT.
    Google ScholarLocate open access versionFindings
  • Chenliang Li, Aixin Sun, Jianshu Weng, and Qi He. 2014. Tweet segmentation and its application to named entity recognition. IEEE Transactions on knowledge and data engineering, 27(2):558–570.
    Google ScholarLocate open access versionFindings
  • Chenliang Li, Jianshu Weng, Qi He, Yuxia Yao, Anwitaman Datta, Aixin Sun, and Bu-Sung Lee. 2012. Twiner: named entity recognition in targeted twitter stream. In Proceedings of SIGIR.
    Google ScholarLocate open access versionFindings
  • Jing Li, Aixin Sun, Jianglei Han, and Chenliang Li. 2018. A survey on deep learning for named entity recognition. arXiv preprint arXiv:1812.09449.
    Findings
  • Nut Limsopatham and Nigel Collier. 2016. Bidirectional LSTM for named entity recognition in twitter messages. In Proceedings of the 2nd Workshop on Noisy User-generated Text (WNUT).
    Google ScholarLocate open access versionFindings
  • Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255.
    Google ScholarLocate open access versionFindings
  • Leon Derczynski, Diana Maynard, Giuseppe Rizzo, Marieke Van Erp, Genevieve Gorrell, Raphael Troncy, Johann Petrak, and Kalina Bontcheva. 2015. Analysis of named entity recognition and linking for tweets. Information Processing & Management, 51(2):32–49.
    Google ScholarLocate open access versionFindings
  • Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
    Findings
  • Jenny Rose Finkel, Trond Grenager, and Christopher Manning. 2005. Incorporating non-local information into information extraction systems by gibbs sampling. In Proceedings of ACL.
    Google ScholarLocate open access versionFindings
  • Radu Florian, Abe Ittycheriah, Hongyan Jing, and Tong Zhang. 2003. Named entity recognition through classifier combination. In Proceedings of NAACL.
    Google ScholarLocate open access versionFindings
  • Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 20Deep residual learning for image recognition. In Proceedings of CVPR, pages 770–778.
    Google ScholarLocate open access versionFindings
  • Zhiheng Huang, Wei Xu, and Kai Yu. 2015. Bidirectional lstm-crf models for sequence tagging. arXiv preprint arXiv:1508.01991.
    Findings
  • John Lafferty, Andrew McCallum, and Fernando CN Pereira. 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of ICML.
    Google ScholarLocate open access versionFindings
  • Bill Yuchen Lin, Frank F Xu, Zhiyi Luo, and Kenny Zhu. 2017. Multi-channel bilstm-crf model for emerging named entity recognition in social media. In Proceedings of the 3rd Workshop on Noisy Usergenerated Text.
    Google ScholarLocate open access versionFindings
  • Dekang Lin and Xiaoyun Wu. 2009. Phrase clustering for discriminative learning. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP.
    Google ScholarLocate open access versionFindings
  • Di Lu, Leonardo Neves, Vitor Carvalho, Ning Zhang, and Heng Ji. 2018. Visual attention model for name tagging in multimodal social media. In Proceedings of ACL, pages 1990–1999.
    Google ScholarLocate open access versionFindings
  • Gang Luo, Xiaojiang Huang, Chin-Yew Lin, and Zaiqing Nie. 2015. Joint entity recognition and disambiguation. In Proceedings of EMNLP.
    Google ScholarLocate open access versionFindings
  • Xuezhe Ma and Eduard Hovy. 2016. End-to-end sequence labeling via bi-directional lstm-cnns-crf. In Proceedings of ACL.
    Google ScholarLocate open access versionFindings
  • Diana Maynard, Kalina Bontcheva, and Dominic Rout. 2012. Challenges in developing opinion mining tools for social media. In Proceedings of the@ NLP can u tag usergeneratedcontent.
    Google ScholarLocate open access versionFindings
  • Seungwhan Moon, Leonardo Neves, and Vitor Carvalho. 2018. Multimodal named entity recognition for short social media posts. In Proceedings of NAACL.
    Google ScholarLocate open access versionFindings
  • Alexandre Passos, Vineet Kumar, and Andrew McCallum. 2014. Lexicon infused phrase embeddings for named entity resolution. In Proceedings of CoNLL.
    Google ScholarLocate open access versionFindings
  • Lev Ratinov and Dan Roth. 2009. Design challenges and misconceptions in named entity recognition. In Proceedings of CoNLL.
    Google ScholarLocate open access versionFindings
  • Alan Ritter, Sam Clark, Mausam, and Oren Etzioni. 2011. Named entity recognition in tweets: an experimental study. In Proceedings of ACL.
    Google ScholarLocate open access versionFindings
  • Alan Ritter, Oren Etzioni, and Sam Clark. 2012. Open domain event extraction from twitter. In Proceedings of SIGKDD.
    Google ScholarLocate open access versionFindings
  • Erik F Sang and Jorn Veenstra. 1999. Representing text chunks. In Proceedings of EACL.
    Google ScholarLocate open access versionFindings
  • Kentaro Torisawa et al. 2007. Exploiting wikipedia as external knowledge for named entity recognition. In Proceedings of EMNLP-CoNLL.
    Google ScholarLocate open access versionFindings
  • Yao-Hung Hubert Tsai, Shaojie Bai, Paul Pu Liang, J Zico Kolter, Louis-Philippe Morency, and Ruslan Salakhutdinov. 2019. Multimodal transformer for unaligned multimodal language sequences. In Proceedings of ACL.
    Google ScholarLocate open access versionFindings
  • Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Proceedings of NIPS, pages 5998– 6008.
    Google ScholarLocate open access versionFindings
  • Vikas Yadav and Steven Bethard. 2018. A survey on recent advances in named entity recognition from deep learning models. In Proceedings of COLING.
    Google ScholarLocate open access versionFindings
  • Jie Yang, Shuailong Liang, and Yue Zhang. 2018. Design challenges and misconceptions in neural sequence labeling. In Proceedings of COLING.
    Google ScholarLocate open access versionFindings
  • Qi Zhang, Jinlan Fu, Xiaoyu Liu, and Xuanjing Huang. 2018. Adaptive co-attention network for named entity recognition in tweets. In Proceedings of AAAI, pages 5674–5681.
    Google ScholarLocate open access versionFindings
  • GuoDong Zhou and Jian Su. 2002. Named entity recognition using an hmm-based chunk tagger. In Proceedings of ACL.
    Google ScholarLocate open access versionFindings
Full Text
Your rating :
0

 

Tags
Comments