Multimodal Routing: Improving Local and Global Interpretability of Multimodal Language Analysis

EMNLP 2020, 2020.

被引用0|引用|浏览15
关键词
multimodal languagerelative importancetrimodal explanatory featureEarly Fusion LSTMexplanatory feature更多(12+)
微博一下
We presented Multimodal Routing to identify the contributions from unimodal, bimodal and trimodal explanatory features to predictions in a locally manner

摘要

The human language can be expressed through multiple sources of information known as modalities, including tones of voice, facial gestures, and spoken language. Recent multimodal learning with strong performances on human-centric tasks such as sentiment analysis and emotion recognition are often black-box, with very limited interpretabili...更多

代码

数据

0
简介
  • The human language contains multimodal cues, including textual, visual, and acoustic modalities.
  • It allows them to identify crucial explanatory features for predictions
  • Such interpretability knowledge could be used to provide insights into multimodal learning, improve the model design, or debug a dataset.
  • The local interpretation is arguably harder but can give a highresolution insight of feature importance depending on each individual samples during training and inference
  • These two levels of interpretability should provide them an understanding of unimodal, bimodal and trimodal explanatory features
重点内容
  • The human language contains multimodal cues, including textual, visual, and acoustic modalities
  • In this paper we address both local and global interpretability of unimodal, bimodal and trimodal explanatory featuress by presenting Multimodal Routing
  • Our experiments focus on two tasks of sentiment analysis and emotion recognition tasks using two benchmark multimodal language datasets, IEMOCAP (Busso et al, 2008) and CMU-MOSEI (Zadeh et al, 2018)
  • We provide two ablation studies for interpretable methods as baselines: The first is based on Generalized Additive Model (GAM) (Hastie, 2017) which directly sums over unimodal, bimodal, and trimodal features and applies a linear transformation to obtain a prediction
  • We presented Multimodal Routing to identify the contributions from unimodal, bimodal and trimodal explanatory features to predictions in a locally manner
  • We conduct global interpretation over the whole datasets, and show that the acoustic features are crucial for predicting negative sentiment or emotions, and the acoustic-visual interactions are crucial for predicting emotion angry
方法
  • Interpretable methods, Multimodal Routing outperforms EF-LSTM, LF-LSTM and RAVEN models and performs competitively when compared with MulT (Tsai et al, 2019a).
  • Compared with all the baselines, Multimodal Routing performs again competitively on most of the results metrics.
  • The authors note that the distribution of labels is skewed
  • This skewness somehow results in the fact that all models end up predicting not “surprise”, the same accuracy for “surprise” across all different approaches
结果
  • The authors trained the model on 1 RTX 2080 GPU.
  • The model is trained with initial learning rate of 10−4 and Adam optimizer.
  • The authors first compare all the interpretable methods.
  • The authors see that Multimodal Routing enjoys performance improvement over both GAM (Hastie, 2017), a linear model on encoded features, and Multimodal Routing∗, a noniterative feed-forward net with same parameters as Multimodal Routing.
  • When comparing to the non-
结论
  • The authors presented Multimodal Routing to identify the contributions from unimodal, bimodal and trimodal explanatory features to predictions in a locally manner.
  • The authors conduct global interpretation over the whole datasets, and show that the acoustic features are crucial for predicting negative sentiment or emotions, and the acoustic-visual interactions are crucial for predicting emotion angry.
  • These observations align with prior work in psychological research.
  • The authors believe that this work sheds light on the advantages of understanding human behaviors from a multimodal perspective, and makes a step towards introducing more interpretable multimodal language models
总结
  • Introduction:

    The human language contains multimodal cues, including textual, visual, and acoustic modalities.
  • It allows them to identify crucial explanatory features for predictions
  • Such interpretability knowledge could be used to provide insights into multimodal learning, improve the model design, or debug a dataset.
  • The local interpretation is arguably harder but can give a highresolution insight of feature importance depending on each individual samples during training and inference
  • These two levels of interpretability should provide them an understanding of unimodal, bimodal and trimodal explanatory features
  • Objectives:

    The authors' goal is to find the relative importance of the contributions from unimodal, bimodal, and trimodal features to the model prediction y.
  • Methods:

    Interpretable methods, Multimodal Routing outperforms EF-LSTM, LF-LSTM and RAVEN models and performs competitively when compared with MulT (Tsai et al, 2019a).
  • Compared with all the baselines, Multimodal Routing performs again competitively on most of the results metrics.
  • The authors note that the distribution of labels is skewed
  • This skewness somehow results in the fact that all models end up predicting not “surprise”, the same accuracy for “surprise” across all different approaches
  • Results:

    The authors trained the model on 1 RTX 2080 GPU.
  • The model is trained with initial learning rate of 10−4 and Adam optimizer.
  • The authors first compare all the interpretable methods.
  • The authors see that Multimodal Routing enjoys performance improvement over both GAM (Hastie, 2017), a linear model on encoded features, and Multimodal Routing∗, a noniterative feed-forward net with same parameters as Multimodal Routing.
  • When comparing to the non-
  • Conclusion:

    The authors presented Multimodal Routing to identify the contributions from unimodal, bimodal and trimodal explanatory features to predictions in a locally manner.
  • The authors conduct global interpretation over the whole datasets, and show that the acoustic features are crucial for predicting negative sentiment or emotions, and the acoustic-visual interactions are crucial for predicting emotion angry.
  • These observations align with prior work in psychological research.
  • The authors believe that this work sheds light on the advantages of understanding human behaviors from a multimodal perspective, and makes a step towards introducing more interpretable multimodal language models
表格
  • Table1: Left: CMU-MOSEI sentiment prediction. Right: IEMOCAP emotion recognition. Multimodal Routing∗ denotes our method without iterative routing. Our results are better or close to the state-of-the-art (Tsai et al, 2019a). We make our results bold if it is SOTA or close to SOTA (≤ 1%)
  • Table2: CMU-MOSEI emotion recognition. Multimodal Routing∗ denotes our method without iterative routing. We make our results bold if it is the best or close to the best (≤ 1%)
  • Table3: Global interpretation (quantitative results) for Multimodal Routing. Confidence Interval of rij, sampled from CMU-MOSEI sentiment task (top) and emotion task (bottom). We bold the values that have the largest mean in each emotion and are significantly larger than a uniform routing (1/J = 1/7 = 0.143)
  • Table4: Global interpretation (quantitative results) for Multimodal Routing. Confidence Interval of pirij, sampled from CMU-MOSEI sentiment task
  • Table5: Global interpretation (quantitative results) for Multimodal Routing. Confidence Interval of pirij, sampled from CMU-MOSEI emotion task
  • Table6: Global interpretation (quantitative results) for Multimodal Routing. Confidence interval of pi, sampled from CMU-MOSEI sentiment task
  • Table7: Global interpretation (quantitative results) for Multimodal Routing. Confidence interval of pi, sampled from CMU-MOSEI emotion task
Download tables as Excel
相关工作
  • Multimodal language learning is based on the fact that human integrates multiple sources such as acoustic, textual, and visual information to learn language (McGurk and MacDonald, 1976; Ngiam et al, 2011; Baltrusaitis et al, 2018). Recent advances in modeling multimodal language using deep neural networks are not interpretable (Wang et al, 2019; Tsai et al, 2019a). Linear method like the Generalized Additive Models (GAMs) (Hastie, 2017) do not offer local interpretability. Even though we could use post hoc (interpret predictions given an arbitrary model) methods such as LIME (Ribeiro et al, 2016), SHAP (Lundberg and Lee, 2017), and L2X (Chen et al, 2018) to interpret these black-box models, these interpretation methods are designed to detect the contributions only from unimodal features but not bimodal or trimodal explanatory features. It is shown that in human communication, modality interactions are more important than individual modalities (Engle, 1998).

    Two recent methods, Graph-MFN (Zadeh et al, 2018) and Multimodal Factorized Model (MFM) (Tsai et al, 2019b), attempted to interpret relationships between modality interactions and learning for human language. Nonetheless, GraphMFN did not separate the contributions among unimodal and multimodal explanatory features, and MFM only provided the analysis on trimodal interaction feature. Both of them cannot interpret how both single modality and modality interactions contribute to final prediction at the same time.
基金
  • This work was supported in part by the DARPA grants FA875018C0150 HR00111990016, NSF IIS1763562, NSF Awards #1750439 #1722822, National Institutes of Health, and Apple
引用论文
  • Sami Abu-El-Haija, Nisarg Kothari, Joonseok Lee, Paul Natsev, George Toderici, Balakrishnan Varadarajan, and Sudheendra Vijayanarasimhan. 2016. Youtube-8m: A large-scale video classification benchmark. arXiv preprint arXiv:1609.08675.
    Findings
  • Tadas Baltrusaitis, Chaitanya Ahuja, and LouisPhilippe Morency. 2018. Multimodal machine learning: A survey and taxonomy. IEEE transactions on pattern analysis and machine intelligence, 41(2):423–443.
    Google ScholarLocate open access versionFindings
  • Christian Buchel, Cathy Price, and Karl Friston. 1998. A multimodal language region in the ventral visual pathway. Nature, 394(6690):274.
    Google ScholarLocate open access versionFindings
  • Carlos Busso, Murtaza Bulut, Chi-Chun Lee, Abe Kazemzadeh, Emily Mower, Samuel Kim, Jeannette N Chang, Sungbok Lee, and Shrikanth S Narayanan. 2008. Iemocap: Interactive emotional dyadic motion capture database. Language resources and evaluation, 42(4):335.
    Google ScholarLocate open access versionFindings
  • Jianbo Chen, Le Song, Martin J Wainwright, and Michael I Jordan. 2018. Learning to explain: An information-theoretic perspective on model interpretation. arXiv preprint arXiv:1802.07814.
    Findings
  • Gilles Degottex, John Kane, Thomas Drugman, Tuomo Raitio, and Stefan Scherer. 2014. Covarep—a collaborative voice analysis repository for speech technologies. In 2014 ieee international conference on acoustics, speech and signal processing (icassp), pages 960–964. IEEE.
    Google ScholarLocate open access versionFindings
  • Jean-Baptist Du Prel, Gerhard Hommel, Bernd Rohrig, and Maria Blettner. 2009. Confidence interval or p-value?: part 4 of a series on evaluation of scientific publications. Deutsches Arzteblatt International, 106(19):335.
    Google ScholarLocate open access versionFindings
  • Randi A Engle. 199Not channels but composite signals: Speech, gesture, diagrams and object demonstrations are integrated in multimodal explanations. In Proceedings of the twentieth annual conference of the cognitive science society, pages 321–326.
    Google ScholarLocate open access versionFindings
  • Trevor J Hastie. 2017. Generalized additive models. In Statistical models in S, pages 249–307. Routledge.
    Google ScholarLocate open access versionFindings
  • Geoffrey E Hinton, Sara Sabour, and Nicholas Frosst. 2018. Matrix capsules with em routing.
    Google ScholarFindings
  • Sepp Hochreiter and Jurgen Schmidhuber. 1997. Long short-term memory. Neural computation, 9(8):1735–1780.
    Google ScholarLocate open access versionFindings
  • iMotions. 2019. imotions: Unpack human behavior.
    Google ScholarFindings
  • Rachael E Jack, Oliver GB Garrod, and Philippe G Schyns. 2014. Dynamic facial expressions of emotion transmit an evolving hierarchy of signals over time. Current biology, 24(2):187–192.
    Google ScholarLocate open access versionFindings
  • Diederik P Kingma and Jimmy Ba. 20Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
    Findings
  • Cesar F Lima, Sao Luıs Castro, and Sophie K Scott. 2013. When voices get emotional: a corpus of nonverbal vocalizations for research on emotion processing. Behavior research methods, 45(4):1234–1245.
    Google ScholarLocate open access versionFindings
  • Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollar, and C Lawrence Zitnick. 2014. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer.
    Google ScholarLocate open access versionFindings
  • Zhun Liu, Ying Shen, Varun Bharadhwaj Lakshminarasimhan, Paul Pu Liang, Amir Zadeh, and LouisPhilippe Morency. 2018. Efficient low-rank multimodal fusion with modality-specific factors. arXiv preprint arXiv:1806.00064.
    Findings
  • Steven R Livingstone and Frank A Russo. 20The ryerson audio-visual database of emotional speech and song (ravdess): A dynamic, multimodal set of facial and vocal expressions in north american english. PloS one, 13(5).
    Google ScholarLocate open access versionFindings
  • Scott M Lundberg and Su-In Lee. 2017. A unified approach to interpreting model predictions. In Advances in Neural Information Processing Systems, pages 4765–4774.
    Google ScholarLocate open access versionFindings
  • Jiayuan Mao, Chuang Gan, Pushmeet Kohli, Joshua B Tenenbaum, and Jiajun Wu. 2019. The neurosymbolic concept learner: Interpreting scenes, words, and sentences from natural supervision. arXiv preprint arXiv:1904.12584.
    Findings
  • Harry McGurk and John MacDonald. 1976. Hearing lips and seeing voices. Nature, 264(5588):746–748.
    Google ScholarLocate open access versionFindings
  • Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee, and Andrew Y Ng. 2011. Multimodal deep learning.
    Google ScholarFindings
  • Juan DS Ortega, Mohammed Senoussaoui, Eric Granger, Marco Pedersoli, Patrick Cardinal, and Alessandro L Koerich. 2019. Multimodal fusion with deep neural networks for audio-video emotion recognition. arXiv preprint arXiv:1907.03196.
    Findings
  • Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 1532–1543.
    Google ScholarLocate open access versionFindings
  • Jonas Ranstam. 2012. Why the p-value culture is bad and confidence intervals a better alternative.
    Google ScholarFindings
  • Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. 2016. Why should i trust you?: Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, pages 1135–1144. ACM.
    Google ScholarLocate open access versionFindings
  • Sara Sabour, Nicholas Frosst, and Geoffrey E Hinton. 2017. Dynamic routing between capsules. In Advances in neural information processing systems, pages 3856–3866.
    Google ScholarLocate open access versionFindings
PDF 全文
您的评分 :
0

 

标签
评论