Multi modal Multi label Emotion Detection with Modality and Label Dependence

EMNLP 2020, pp. 3584-3593, 2020.

Cited by: 0|Views146
Weibo:
We propose a multi-modal sequenceto-set approach to simultaneously handle the modality and label dependence in multi-modal multi-label emotion detection

Abstract:

As an important research issue in the natural language processing community, multi-label emotion detection has been drawing more and more attention in the last few years. However, almost all existing studies focus on one modality (e.g., textual modality). In this paper, we focus on multi-label emotion detection in a multi-modal scenario. ...More

Code:

Data:

0
ZH
Full Text
Bibtex
Weibo
Introduction
  • Emotion detection is to predict emotion categories, such as angry, happy, and surprise, expressed by an utterance of a speaker and has largely encompassed a variety of applications, such as online chatting (Galik and Rank, 2012; Zhang et al, c), news analysis (Li et al, 2015; Zhu et al, 2019) and dialogue systems (Ghosal et al, 2019; Zhang et al, d).
  • In the utterance as shown in Figure 1, both the Sad and Disgust emotions are more likely to exist, rather than the conflicting emotions of Sad and Happy.
  • Recent studies, such as (Yang et al, 2019) and (Xiao et al, 2019), have begun to address this challenge.
  • A residual connection (He et al, 2016) is employed around each Ă Softmax t Add&Norm t-1
Highlights
  • Emotion detection is to predict emotion categories, such as angry, happy, and surprise, expressed by an utterance of a speaker and has largely encompassed a variety of applications, such as online chatting (Galik and Rank, 2012; Zhang et al, c), news analysis (Li et al, 2015; Zhu et al, 2019) and dialogue systems (Ghosal et al, 2019; Zhang et al, d)
  • We address the above challenges in multi-modal multi-label emotion detection by proposing a multi-modal seq2set (MMS2S) approach to model both the modality and label dependence simultaneously
  • DRS2S outperforms RAkLA by 19.4%, 16.1% and 12.6% with respect to Acc, Hamming Loss (HL) and F1, respectively
  • (3) Among all the approaches, our proposed MMS2S performs best in terms of all metrics
  • The detailed evaluation demonstrates that our proposed model significantly outperforms several state-ofthe-art baselines
  • We propose a multi-modal sequenceto-set approach to simultaneously handle the modality and label dependence in multi-modal multi-label emotion detection
Results
  • Table 2 shows the results of different approaches to multi-modal multi-label emotion detection
  • From this table, the authors can see that (1) the classical multi-label approaches BR, CC and RAkLA perform much worse than the deep learning baselines AC, LSAN and DRS2S.
  • DRS2S outperforms RAkLA by 19.4%, 16.1% and 12.6% with respect to Acc, HL and F1, respectively
  • This indicates that the approaches with deep representation do have more advantages than the classical approaches towards multi-label problem.
  • The t-test demonstrates that the approach significantly outperforms LSAN, DRS2S, , and MulT, respectively (p-value < 0.05)
Conclusion
  • The authors propose a multi-modal sequenceto-set approach to simultaneously handle the modality and label dependence in multi-modal multi-label emotion detection.
  • The authors will extend the approach to more multi-modal multi-label scenarios, such as intention detection in video conversations and aspect analysis in multi-modal reviews.
  • The authors would like to investigate other approaches to better model the modality and label dependence in multi-modal multi-label emotion detection
Summary
  • Introduction:

    Emotion detection is to predict emotion categories, such as angry, happy, and surprise, expressed by an utterance of a speaker and has largely encompassed a variety of applications, such as online chatting (Galik and Rank, 2012; Zhang et al, c), news analysis (Li et al, 2015; Zhu et al, 2019) and dialogue systems (Ghosal et al, 2019; Zhang et al, d).
  • In the utterance as shown in Figure 1, both the Sad and Disgust emotions are more likely to exist, rather than the conflicting emotions of Sad and Happy.
  • Recent studies, such as (Yang et al, 2019) and (Xiao et al, 2019), have begun to address this challenge.
  • A residual connection (He et al, 2016) is employed around each Ă Softmax t Add&Norm t-1
  • Objectives:

    The research community has become increasingly aware of the need on multi-modal emotion detection (Zadeh et al, 2018b) due to its wide potential applications, e.g., with the massively growing importance of analyzing conversations in speech (Gu et al, 2019) and video (Majumder et al, 2019).
  • The authors aim to tackle multi-modal multi-label emotion detection
  • Results:

    Table 2 shows the results of different approaches to multi-modal multi-label emotion detection
  • From this table, the authors can see that (1) the classical multi-label approaches BR, CC and RAkLA perform much worse than the deep learning baselines AC, LSAN and DRS2S.
  • DRS2S outperforms RAkLA by 19.4%, 16.1% and 12.6% with respect to Acc, HL and F1, respectively
  • This indicates that the approaches with deep representation do have more advantages than the classical approaches towards multi-label problem.
  • The t-test demonstrates that the approach significantly outperforms LSAN, DRS2S, , and MulT, respectively (p-value < 0.05)
  • Conclusion:

    The authors propose a multi-modal sequenceto-set approach to simultaneously handle the modality and label dependence in multi-modal multi-label emotion detection.
  • The authors will extend the approach to more multi-modal multi-label scenarios, such as intention detection in video conversations and aspect analysis in multi-modal reviews.
  • The authors would like to investigate other approaches to better model the modality and label dependence in multi-modal multi-label emotion detection
Tables
  • Table1: The statistics on the CMU-MOSEI dataset
  • Table2: Performance of different approaches to multimodal multi-label emotion detection
  • Table3: The impact of random label order as groundtruth. ⇓: Significant decrease, ↓: Insignificant decrease, −: No decrease
  • Table4: Performance of single-modal and multi-modal seq2set approaches
Download tables as Excel
Related work
  • As an interdisciplinary research field, emotion detection has been drawing more and more attention in both natural language processing and multimodal communication (Zadeh et al, 2018c). In the NLP community, almost all existing studies of multi-label emotion detection rely on special knowledge of emotion, such as context information (Li et al, 2015), cross-domain transferring (Yu et al, 2018) and external resource (Ying et al, 2019). In fact, when there is no special knowledge (Kim et al, 2018), it can be normally handled by multi-label text classification approaches. In the multi-modal community, related studies normally focus on single-label emotion task and the studies for multi-label emotion task are much less and limited to be transformed to multiple binary classification (Zadeh et al, 2018b; Wang et al, 2019; Akhtar et al, 2019; Chauhan et al, 2019). In the following, we give an overview of multi-label emotion/text classification and multi-modal emotion detection.

    Multi-label Emotion/Text Classification. Recent studies normally cast multi-label emotion detection task as a classification problem and leverage the special knowledge as auxiliary information (Yu et al, 2018; Ying et al, 2019). These approaches may not be easily extended to those tasks without external knowledge. At this time, the multi-label text classification approaches can be quickly applied to emotion detection. There have been a large number of representative studies for that. Kant et al (2018) leverage the pre-trained BERT to perform multi-label emotion task and Kim et al (2018) propose an attention-based classifier that predicts multiple emotions of a given sentence. More recently, Yang et al (2018) propose a sequence generation model and Yang et al (2019) leverage a reinforced approach to find a better sequence than a baseline sequence, but it still relies on the pretrained seq2seq model with a pre-defined order of ground-truth.
Funding
  • This work was supported by three NSFC grants, i.e., No 61672366, No 61876120 and No 61836007
  • This work was also supported by a Project Funded by the Priority Academic Program Development of Jiangsu Higher Education Institutions (PAPD)
Reference
  • Muhammad Abdul-Mageed and Lyle Ungar. 2017. Emonet: Fine-grained emotion detection with gated recurrent neural networks. In Proceedings of ACL 2017, pages 718–728.
    Google ScholarLocate open access versionFindings
  • Md. Shad Akhtar, Dushyant Singh Chauhan, Deepanway Ghosal, Soujanya Poria, Asif Ekbal, and Pushpak Bhattacharyya. 2019. Multi-task learning for multi-modal emotion recognition and sentiment analysis. In Proceedings of NAACL-HLT 2019, pages 370–379.
    Google ScholarLocate open access versionFindings
  • Rami Aly, Steffen Remus, and Chris Biemann. 2019. Hierarchical multi-label classification of text with capsule networks. In Proceedings of ACL 2019, pages 323–330.
    Google ScholarLocate open access versionFindings
  • Lei Jimmy Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton. 2016. Layer normalization. CoRR, abs/1607.06450.
    Findings
  • Dushyant Singh Chauhan, Md Shad Akhtar, Asif Ekbal, and Pushpak Bhattacharyya. 2019. Context-aware interactive attention for multi-modal sentiment and emotion analysis. In Proceedings of EMNLP 2019, pages 5651–5661.
    Google ScholarLocate open access versionFindings
  • Gilles Degottex, John Kane, Thomas Drugman, Tuomo Raitio, and Stefan Scherer. 2014. COVAREP - A collaborative voice analysis repository for speech technologies. In Proceedings of IEEE ICASSP 2014, pages 960–964.
    Google ScholarLocate open access versionFindings
  • Thomas Drugman and Abeer Alwan. 2011. Joint robust voicing detection and pitch estimation based on residual harmonics. In Proceedings of INTERSPEECH 2011, pages 1973–1976.
    Google ScholarLocate open access versionFindings
  • Thomas Drugman, Mark R. P. Thomas, Jon Gunason, Patrick A. Naylor, and Thierry Dutoit. 2012. Detection of glottal closure instants from speech signals: A quantitative review. IEEE TASLP, 20(3):994– 1006.
    Google ScholarLocate open access versionFindings
  • Maros Galik and Stefan Rank. 2012. Modelling emotional trajectories of individuals in an online chat. In Proceedings of MATES 2012, pages 96–105.
    Google ScholarLocate open access versionFindings
  • Deepanway Ghosal, Navonil Majumder, Soujanya Poria, Niyati Chhaya, and Alexander Gelbukh. 2019. Dialoguegcn: A graph convolutional neural network for emotion recognition in conversation. In Proceedings of EMNLP 2019, pages 154–164.
    Google ScholarLocate open access versionFindings
  • Yue Gu, Xinyu Lyu, Weijia Sun, Weitian Li, Shuhong Chen, Xinyu Li, and Ivan Marsic. 2019. Mutual correlation attentive factors in dyadic fusion networks for speech emotion recognition. In Proceedings of ACM MM 2019, pages 157–166.
    Google ScholarLocate open access versionFindings
  • Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of CVPR 2016, pages 770– 778.
    Google ScholarLocate open access versionFindings
  • John Kane and Christer Gobl. 20Wavelet maxima dispersion for breathy to tense voice discrimination. IEEE TASLP, 21(6):1170–1179.
    Google ScholarLocate open access versionFindings
  • Neel Kant, Raul Puri, Nikolai Yakovenko, and Bryan Catanzaro. 2018. Practical text classification with large pre-trained language models. arXiv preprint arXiv:1812.01207.
    Findings
  • Yanghoon Kim, Hwanhee Lee, and Kyomin Jung. 2018. Attnconvnet at semeval-2018 task 1: Attentionbased convolutional neural networks for multilabel emotion classification. In Proceedings of SemEval@NAACL-HLT 2018, pages 141–145.
    Google ScholarLocate open access versionFindings
  • Diederik P. Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. CoRR, abs/1412.6980.
    Findings
  • Shoushan Li, Lei Huang, Rong Wang, and Guodong Zhou. 2015. Sentence-level emotion classification with label and context dependence. In Proceedings of ACL 2015, pages 1045–1053.
    Google ScholarLocate open access versionFindings
  • Navonil Majumder, Soujanya Poria, Devamanyu Hazarika, Rada Mihalcea, Alexander F. Gelbukh, and Erik Cambria. 2019. Dialoguernn: An attentive RNN for emotion detection in conversations. In Proceedings of AAAI 2019, pages 6818–6825.
    Google ScholarLocate open access versionFindings
  • Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. Glove: Global vectors for word representation. In Proceedings of EMNLP 2014, pages 1532–1543.
    Google ScholarLocate open access versionFindings
  • Kechen Qin, Cheng Li, Virgil Pavlu, and Javed A. Aslam. 2019. Adapting RNN sequence prediction model to multi-label set prediction. In Proceedings of NAACL-HLT 2019, pages 3181–3190.
    Google ScholarLocate open access versionFindings
  • Jesse Read, Bernhard Pfahringer, Geoff Holmes, and Eibe Frank. 2011. Classifier chains for multi-label classification. Machine Learning, 85(3):333–359.
    Google ScholarLocate open access versionFindings
  • Xipeng Shen, Matthew R. Boutell, Jiebo Luo, and Christopher M. Brown. 2004. Multilabel machine learning and its application to semantic scene classification. In Proceedings of SPIESR 2004, pages 188–199.
    Google ScholarLocate open access versionFindings
  • Yao-Hung Hubert Tsai, Shaojie Bai, Paul Pu Liang, J. Zico Kolter, Louis-Philippe Morency, and Ruslan Salakhutdinov. 2019. Multimodal transformer for unaligned multimodal language sequences. In Proceedings of ACL 2019, pages 6558–6569.
    Google ScholarLocate open access versionFindings
  • Grigorios Tsoumakas and Ioannis Katakis. 2007. Multi-label classification: An overview. IJDWM, 3(3):1–13.
    Google ScholarLocate open access versionFindings
  • Grigorios Tsoumakas, Ioannis Katakis, and Ioannis P. Vlahavas. 2011. Random k-labelsets for multilabel classification. IEEE Trans. Knowl. Data Eng., 23(7):1079–1089.
    Google ScholarLocate open access versionFindings
  • Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Ł ukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Proceedings of NIPS 2017, pages 5998–6008.
    Google ScholarLocate open access versionFindings
  • Yansen Wang, Ying Shen, Zhun Liu, Paul Pu Liang, Amir Zadeh, and Louis-Philippe Morency. 2019. Words can shift: Dynamically adjusting word representations using nonverbal behaviors. In Proceedings of AAAI 2019, pages 7216–7223.
    Google ScholarLocate open access versionFindings
  • Jiawei Wu, Wenhan Xiong, and William Yang Wang. 2019. Learning to learn and predict: A metalearning approach for multi-label classification. In Proceedings of EMNLP 2019, pages 4345–4355.
    Google ScholarLocate open access versionFindings
  • Lin Xiao, Xin Huang, Boli Chen, and Liping Jing. 2019. Label-specific document representation for multi-label text classification. In Proceedings of EMNLP 2019, pages 466–475.
    Google ScholarLocate open access versionFindings
  • Pengcheng Yang, Fuli Luo, Shuming Ma, Junyang Lin, and Xu Sun. 2019. A deep reinforced sequence-toset model for multi-label classification. In Proceedings of ACL 2019, pages 5252–5258.
    Google ScholarLocate open access versionFindings
  • Pengcheng Yang, Xu Sun, Wei Li, Shuming Ma, Wei Wu, and Houfeng Wang. 2018. SGM: sequence generation model for multi-label classification. In Proceedings of COLING 2018, pages 3915–3926.
    Google ScholarLocate open access versionFindings
  • Wenhao Ying, Rong Xiang, and Qin Lu. 2019. Improving multi-label emotion classification by integrating both general and domain-specific knowledge. In Proceedings of W-NUT 2019, pages 316–321.
    Google ScholarLocate open access versionFindings
  • Jianfei Yu, Luıs Marujo, Jing Jiang, Pradeep Karuturi, and William Brendel. 2018. Improving multilabel emotion classification via sentiment classification with dual attention transfer network. In Proceedings of EMNLP 2018, pages 1097–1102.
    Google ScholarLocate open access versionFindings
  • Jiahong Yuan and Mark Liberman. 2008. Speaker identification on the scotus corpus. Journal of the Acoustical Society of America, 123(123):3878.
    Google ScholarLocate open access versionFindings
  • Amir Zadeh, Paul Pu Liang, Navonil Mazumder, Soujanya Poria, Erik Cambria, and Louis-Philippe Morency. 2018a. Memory fusion network for multiview sequential learning. In Proceedings of AAAI 2018, pages 5634–5641.
    Google ScholarLocate open access versionFindings
  • Amir Zadeh, Paul Pu Liang, Soujanya Poria, Erik Cambria, and Louis-Philippe Morency. 2018b. Multimodal language analysis in the wild: CMU-MOSEI dataset and interpretable dynamic fusion graph. In Proceedings of ACL 2018, pages 2236–2246.
    Google ScholarLocate open access versionFindings
  • Amir Zadeh, Paul Pu Liang, Soujanya Poria, Prateek Vij, Erik Cambria, and Louis-Philippe Morency. 2018c. Multi-attention recurrent network for human communication comprehension. In Proceedings of AAAI 2018, pages 5642–5649.
    Google ScholarLocate open access versionFindings
  • Dong Zhang, Shoushan Li, Qiaoming Zhu, and Guodong Zhou. a. Effective sentiment-relevant word selection for multi-modal sentiment analysis in spoken language. In Proceedings of ACM MM 2019, pages 148–156.
    Google ScholarLocate open access versionFindings
  • Dong Zhang, Shoushan Li, Qiaoming Zhu, and Guodong Zhou. b. Modeling the clause-level structure to multimodal sentiment analysis via reinforcement learning. In Proceedings of IEEE ICME 2019, pages 730–735.
    Google ScholarLocate open access versionFindings
  • Dong Zhang, Liangqing Wu, Shoushan Li, Qiaoming Zhu, and Guodong Zhou. c. Multi-modal language analysis with hierarchical interaction-level and selection-level attentions. In Proceedings of IEEE ICME 2019, pages 724–729.
    Google ScholarLocate open access versionFindings
  • Dong Zhang, Liangqing Wu, Changlong Sun, Shoushan Li, Qiaoming Zhu, and Guodong Zhou. d. Modeling both context- and speaker-sensitive dependence for emotion detection in multi-speaker conversations. In Proceedings of IJCAI 2019, pages 5415–5421.
    Google ScholarLocate open access versionFindings
  • Xiabing Zhou, Zhongqing Wang, Shoushan Li, Guodong Zhou, and Min Zhang. 2019. Emotion detection with neural personal discrimination. In Proceedings of EMNLP 2019, pages 5502–5510.
    Google ScholarLocate open access versionFindings
  • Suyang Zhu, Shoushan Li, and Guodong Zhou. 2019. Adversarial attention modeling for multidimensional emotion regression. In Proceedings of ACL 2019, pages 471–480.
    Google ScholarLocate open access versionFindings
Your rating :
0

 

Tags
Comments