Integrating Multimodal Information in Large Pretrained Transformers

Md Kamrul Hasan
Md Kamrul Hasan
Sangwu Lee
Sangwu Lee
Ehsan Hoque
Ehsan Hoque

ACL, pp. 2359-2369, 2020.

Cited by: 0|Bibtex|Views95
EI
Other Links: dblp.uni-trier.de|academic.microsoft.com
Weibo:
Our experiments demonstrated the superior performance of Multimodal Adaptation Gate-BERT and MAGXLNet

Abstract:

Recent Transformer-based contextual word representations, including BERT and XLNet, have shown state-of-the-art performance in multiple disciplines within NLP. Fine-tuning the trained contextual models on task-specific datasets has been the key to achieving superior performance downstream. While finetuning these pre-trained models is stra...More

Code:

Data:

0
Introduction
  • Human face-to-face communication flows as a seamless integration of language, acoustic, and vision modalities.
  • * - Equal contribution intentions and emotions
  • Understanding this faceto-face communication falls within an increasingly growing NLP research area called multimodal language analysis (Zadeh et al, 2018b).
  • The biggest challenge in this area is to efficiently model the three pillars of communication together
  • This gives artificial intelligence systems the capability to comprehend the multi-sensory information without disregarding nonverbal factors.
  • In many applications such as dialogue systems and virtual reality, this capability is crucial to maintain the high quality of user interaction
Highlights
  • Human face-to-face communication flows as a seamless integration of language, acoustic, and vision modalities
  • In all the metrics across the CMU-MOSI dataset, we observe that performance of Multimodal Adaptation Gate-BERT is superior to state-of-the-art multimodal models that use BERT word embeddings
  • MAGBERT performs superior to fine-tuned BERT. This essentially shows that the Multimodal Adaptation Gate component is allowing the BERT model to adapt to multimodal information during fine-tuning, achieving superior performance
  • We train Multimodal Language Sequences using the fine-tuned XLNet embeddings and get the following performance: 83.6 85.3, 82.6 84.2, 0.810, 0.759 which is lower than both Multimodal Adaptation Gate-XLNet and XLNet
  • Using a proposed Multimodal Adaptation Gate (MAG), BERT and XLNet were successfully fine-tuned in presence of vision and acoustic modalities
  • Our experiments demonstrated the superior performance of Multimodal Adaptation Gate-BERT and MAGXLNet
Methods
  • The authors outline the experiments in this paper. The authors first start by describing the datasets, followed by description of extracted features, baselines, and experimental setup.

    5.1 CMU-MOSI Dataset

    CMU-MOSI (CMU Multimodal Opinion Sentiment Intensity) is a dataset of multimodal language focused on multimodal sentiment analysis (Zadeh et al, 2016).
  • The authors first start by describing the datasets, followed by description of extracted features, baselines, and experimental setup.
  • CMU-MOSI (CMU Multimodal Opinion Sentiment Intensity) is a dataset of multimodal language focused on multimodal sentiment analysis (Zadeh et al, 2016).
  • The following computational descriptors are available: Language: The authors transcribe the videos using Youtube API followed by manual correction.
  • Visual: For the visual modality, the Facet library is used to extract a set of visual features including facial action units, facial landmarks, head pose, gaze tracking and HOG features
Results
  • Results and Discussion

    Table 1 shows the results of the experiments in this paper. The authors summarize the observations from the results in this table as following: 6.1 Performance of MAG-BERT

    In all the metrics across the CMU-MOSI dataset, the authors observe that performance of MAG-BERT is superior to state-of-the-art multimodal models that use BERT word embeddings.
  • The authors summarize the observations from the results in this table as following: 6.1 Performance of MAG-BERT.
  • In all the metrics across the CMU-MOSI dataset, the authors observe that performance of MAG-BERT is superior to state-of-the-art multimodal models that use BERT word embeddings.
  • A similar performance trend to MAG-BERT is observed for MAG-XLNet. Besides superior performance than baselines and fine-tuned XLNet, MAG-XLNet achieves near-human level performance for CMU-MOSI dataset.
Conclusion
  • The authors introduced a method for efficiently finetuning large pre-trained Transformer models for multimodal language.
  • MAG essentially poses the nonverbal behavior as a vector with a trajectory and magnitude, which is subsequently used to shift lexical representations within the pre-trained Transformer model.
  • A unique characteristic of MAG is that it makes no change to the original structure of BERT or XLNet, but rather comes as an attachment to both models.
  • The code for both MAG-BERT and MAGXLNet are publicly available here 2
Summary
  • Introduction:

    Human face-to-face communication flows as a seamless integration of language, acoustic, and vision modalities.
  • * - Equal contribution intentions and emotions
  • Understanding this faceto-face communication falls within an increasingly growing NLP research area called multimodal language analysis (Zadeh et al, 2018b).
  • The biggest challenge in this area is to efficiently model the three pillars of communication together
  • This gives artificial intelligence systems the capability to comprehend the multi-sensory information without disregarding nonverbal factors.
  • In many applications such as dialogue systems and virtual reality, this capability is crucial to maintain the high quality of user interaction
  • Methods:

    The authors outline the experiments in this paper. The authors first start by describing the datasets, followed by description of extracted features, baselines, and experimental setup.

    5.1 CMU-MOSI Dataset

    CMU-MOSI (CMU Multimodal Opinion Sentiment Intensity) is a dataset of multimodal language focused on multimodal sentiment analysis (Zadeh et al, 2016).
  • The authors first start by describing the datasets, followed by description of extracted features, baselines, and experimental setup.
  • CMU-MOSI (CMU Multimodal Opinion Sentiment Intensity) is a dataset of multimodal language focused on multimodal sentiment analysis (Zadeh et al, 2016).
  • The following computational descriptors are available: Language: The authors transcribe the videos using Youtube API followed by manual correction.
  • Visual: For the visual modality, the Facet library is used to extract a set of visual features including facial action units, facial landmarks, head pose, gaze tracking and HOG features
  • Results:

    Results and Discussion

    Table 1 shows the results of the experiments in this paper. The authors summarize the observations from the results in this table as following: 6.1 Performance of MAG-BERT

    In all the metrics across the CMU-MOSI dataset, the authors observe that performance of MAG-BERT is superior to state-of-the-art multimodal models that use BERT word embeddings.
  • The authors summarize the observations from the results in this table as following: 6.1 Performance of MAG-BERT.
  • In all the metrics across the CMU-MOSI dataset, the authors observe that performance of MAG-BERT is superior to state-of-the-art multimodal models that use BERT word embeddings.
  • A similar performance trend to MAG-BERT is observed for MAG-XLNet. Besides superior performance than baselines and fine-tuned XLNet, MAG-XLNet achieves near-human level performance for CMU-MOSI dataset.
  • Conclusion:

    The authors introduced a method for efficiently finetuning large pre-trained Transformer models for multimodal language.
  • MAG essentially poses the nonverbal behavior as a vector with a trajectory and magnitude, which is subsequently used to shift lexical representations within the pre-trained Transformer model.
  • A unique characteristic of MAG is that it makes no change to the original structure of BERT or XLNet, but rather comes as an attachment to both models.
  • The code for both MAG-BERT and MAGXLNet are publicly available here 2
Tables
  • Table1: Sentiment prediction results on CMU-MOSI dataset. Best results are highlighted in bold. MAGBERT and MAG-XLNet achieve superior performance than the baselines and their language-only finetuned counterpart. BA denotes binary accuracy (higher is better, same for F1), MAE denotes Mean-absolute Error (lower is better), and Corr is Pearson Correlation (higher is better). For BA and F1, we report two numbers: the number on the left side of “/” is measures calculated based on (<a class="ref-link" id="cZadeh_et+al_2018_c" href="#rZadeh_et+al_2018_c">Zadeh et al, 2018c</a>) and the right side is measures calculated based on (<a class="ref-link" id="cTsai_et+al_2019_a" href="#rTsai_et+al_2019_a">Tsai et al, 2019</a>). Human performance for CMU-MOSI is reported as (<a class="ref-link" id="cZadeh_et+al_2018_a" href="#rZadeh_et+al_2018_a">Zadeh et al, 2018a</a>)
  • Table2: Results of variations of XLNet model: MAG applied at different layers of the XLNet model, inputlevel concatenation and addition of all modalities. “E” denotes application of MAG immediately after embedding layer of the XLNet and “A” denotes applying MAG after the embedding layer and all the subsequent Encoding layers. ⊕ and ⊙ denote input-level addition and concatenation of all modalities respectively. MAG applied at initial layers performs better overall
  • Table3: Examples from the CMU-MOSI dataset. The ground truth sentiment labels are between strongly negative (-3) and strongly positive (+3). For each example, we show the Ground Truth and prediction output of both the MAG-XLNet and XLNet. XLNet seems to be replicating language modality mostly while MAG-XLNet is integrating the non-verbal information successfully
Download tables as Excel
Related work
  • The studies in this paper are related to the following research areas: 2.1 Multimodal Language Analyses

    Multimodal language analyses is a recent research trend in natural language processing (Zadeh et al, 2018b) that helps us understand language from the modalities of text, vision and acoustic. These analyses have particularly focused on the tasks of sentiment analysis (Poria et al, 2018), emotion recognition (Zadeh et al, 2018d), and personality traits recognition (Park et al, 2014). Works in this area often focus on novel multimodal neural architectures (Pham et al, 2019; Hazarika et al, 2018) and multimodal fusion approaches (Liang et al, 2018; Tsai et al, 2018).

    Related to content in this paper, we discuss some of the models in this domain including TFN, MARN, MFN, RMFN and MulT. Tensor Fusion Network (TFN) (Zadeh et al, 2017) creates a multi-dimensional tensor to explicitly capture all possible interactions between the three modalities: unimodal, bimodal and trimodal. Multiattention Recurrent Network (MARN) (Zadeh et al, 2018c) uses three separate hybrid LSTM memories that have the ability to propagate the cross-modal interactions. Memory Fusion Network (Zadeh et al, 2018a) synchronizes the information from three separate LSTMs through a multi-view gated memory. Recurrent Memory Fusion Network (RMFN) (Liang et al, 2018) captures the nuanced interactions among the modalities in a multi-stage manner, giving each stage the ability to focus on a subset of signals. Multimodal Transformer for Unaligned Multimodal Language Sequences (MulT) (Tsai et al, 2019) deploys three Transformers – each for one modality – to capture the interactions with the other two modalities in a selfattentive manner. The information from the three Transformers are aggregated through late-fusion.
Funding
  • This research was supported in part by grant W911NF-15-1-0542 and W911NF-19-1-0029 with the US Defense Advanced Research Projects Agency (DARPA) and the Army Research Office (ARO)
  • Authors AZ and LM were supported by the National Science Foundation (Awards #1750439 #1722822) and National Institutes of Health
Reference
  • Minghai Chen, Sen Wang, Paul Pu Liang, Tadas Baltrusaitis, Amir Zadeh, and Louis-Philippe Morency. 2017. Multimodal sentiment analysis with wordlevel fusion and reinforcement learning. In Proceedings of the 19th ACM International Conference on Multimodal Interaction, pages 163–17ACM.
    Google ScholarLocate open access versionFindings
  • Andy Coenen, Emily Reif, Ann Yuan, Been Kim, Adam Pearce, Fernanda Viegas, and Martin Wattenberg. 2019. Visualizing and measuring the geometry of bert. arXiv preprint arXiv:1906.02715.
    Findings
  • Zihang Dai, Zhilin Yang, Yiming Yang, William W Cohen, Jaime Carbonell, Quoc V Le, and Ruslan Salakhutdinov. 2019. Transformer-xl: Attentive language models beyond a fixed-length context. arXiv preprint arXiv:1901.02860.
    Findings
  • Gilles Degottex, John Kane, Thomas Drugman, Tuomo Raitio, and Stefan Scherer. 201Covarep—a collaborative voice analysis repository for speech technologies. In 2014 ieee international conference on acoustics, speech and signal processing (icassp), pages 960–964. IEEE.
    Google ScholarLocate open access versionFindings
  • Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep
    Google ScholarFindings
  • Devamanyu Hazarika, Soujanya Poria, Amir Zadeh, Erik Cambria, Louis-Philippe Morency, and Roger Zimmermann. 2018. Conversational memory network for emotion recognition in dyadic dialogue videos. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), volume 1, pages 2122–2132.
    Google ScholarLocate open access versionFindings
  • iMotions. 201Facial expression analysis.
    Google ScholarFindings
  • Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
    Findings
  • Paul Pu Liang, Ziyin Liu, Amir Zadeh, and LouisPhilippe Morency. 2018. Multimodal language analysis with recurrent multistage fusion. arXiv preprint arXiv:1808.03920.
    Findings
  • Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pages 3111–3119.
    Google ScholarLocate open access versionFindings
  • Sunghyun Park, Han Suk Shim, Moitreya Chatterjee, Kenji Sagae, and Louis-Philippe Morency. 2014. Computational analysis of persuasiveness in social multimedia: A novel dataset and multimodal prediction approach. In Proceedings of the 16th International Conference on Multimodal Interaction, pages 50–57. ACM.
    Google ScholarLocate open access versionFindings
  • Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 1532–1543.
    Google ScholarLocate open access versionFindings
  • Matthew E Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations. arXiv preprint arXiv:1802.05365.
    Findings
  • Hai Pham, Paul Pu Liang, Thomas Manzini, LouisPhilippe Morency, and Barnabas Poczos. 2019. Found in translation: Learning robust joint representations by cyclic translations between modalities. arXiv preprint arXiv:1812.07809.
    Findings
  • Soujanya Poria, Amir Hussain, and Erik Cambria. 2018. Multimodal Sentiment Analysis, volume 8. Springer.
    Google ScholarLocate open access versionFindings
  • Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Improving language understanding by generative pre-training. URL https://s3us-west-2.amazonaws.com/openai-assets/researchcovers/languageunsupervised/language understanding paper.pdf.
    Findings
  • Chen Sun, Austin Myers, Carl Vondrick, Kevin Murphy, and Cordelia Schmid. 2019. Videobert: A joint model for video and language representation learning. arXiv preprint arXiv:1904.01766.
    Findings
  • Yao-Hung Hubert Tsai, Shaojie Bai, Paul Pu Liang, J Zico Kolter, Louis-Philippe Morency, and Ruslan Salakhutdinov. 2019. Multimodal transformer for unaligned multimodal language sequences. arXiv preprint arXiv:1906.00295.
    Findings
  • Yao-Hung Hubert Tsai, Paul Pu Liang, Amir Zadeh, Louis-Philippe Morency, and Ruslan Salakhutdinov. 2018. Learning factorized multimodal representations. arXiv preprint arXiv:1806.06176.
    Findings
  • Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems, pages 5998–6008.
    Google ScholarLocate open access versionFindings
  • Yansen Wang, Ying Shen, Zhun Liu, Paul Pu Liang, Amir Zadeh, and Louis-Philippe Morency. 2018. Words can shift: Dynamically adjusting word representations using nonverbal behaviors. arXiv preprint arXiv:1811.09362.
    Findings
  • Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, and Quoc V Le. 2019. Xlnet: Generalized autoregressive pretraining for language understanding. arXiv preprint arXiv:1906.08237.
    Findings
  • Jiahong Yuan and Mark Liberman. 2008. Speaker identification on the scotus corpus. Journal of the Acoustical Society of America, 123(5):3878.
    Google ScholarLocate open access versionFindings
  • Amir Zadeh, Minghai Chen, Soujanya Poria, Erik Cambria, and Louis-Philippe Morency. 2017. Tensor fusion network for multimodal sentiment analysis. arXiv preprint arXiv:1707.07250.
    Findings
  • Amir Zadeh, Paul Pu Liang, Navonil Mazumder, Soujanya Poria, Erik Cambria, and Louis-Philippe Morency. 2018a. Memory fusion network for multiview sequential learning. In Thirty-Second AAAI Conference on Artificial Intelligence.
    Google ScholarLocate open access versionFindings
  • Amir Zadeh, Paul Pu Liang, Louis-Philippe Morency, Soujanya Poria, Erik Cambria, and Stefan Scherer. 2018b. Proceedings of grand challenge and workshop on human multimodal language (challengehml). In Proceedings of Grand Challenge and Workshop on Human Multimodal Language (ChallengeHML).
    Google ScholarLocate open access versionFindings
  • Amir Zadeh, Paul Pu Liang, Soujanya Poria, Prateek Vij, Erik Cambria, and Louis-Philippe Morency. 2018c. Multi-attention recurrent network for human communication comprehension. In Thirty-Second AAAI Conference on Artificial Intelligence.
    Google ScholarLocate open access versionFindings
  • Amir Zadeh, Rowan Zellers, Eli Pincus, and LouisPhilippe Morency. 2016. Mosi: multimodal corpus of sentiment intensity and subjectivity analysis in online opinion videos. arXiv preprint arXiv:1606.06259.
    Findings
  • AmirAli Bagher Zadeh, Paul Pu Liang, Soujanya Poria, Erik Cambria, and Louis-Philippe Morency. 2018d. Multimodal language analysis in the wild: Cmu-mosei dataset and interpretable dynamic fusion graph. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, pages 2236– 2246.
    Google ScholarLocate open access versionFindings
Full Text
Your rating :
0

 

Tags
Comments