AI helps you reading Science
AI generates interpretation videos
AI extracts and analyses the key points of the paper to generate videos automatically
AI parses the academic lineage of this thesis
AI extracts a summary of this paper
We propose a new diagnostic tool, empirical multimodally-additive function projection, for isolating whether or not cross-modal interactions improve performance for a given model on a given task
Does my multimodal model learn cross modal interactions? It’s harder to tell than you might think!
EMNLP 2020, pp.861-877, (2020)
Modeling expressive cross-modal interactions seems crucial in multimodal tasks, such as visual question answering. However, sometimes high-performing black-box algorithms turn out to be mostly exploiting unimodal signals in the data. We propose a new diagnostic tool, empirical multimodally-additive function projection (EMAP), for isolatin...More
PPT (Upload PPT)
- Given the presumed importance of reasoning across modalities in multimodal machine learning tasks, the authors should evaluate a model’s ability to leverage cross-modal interactions.
- Such evaluation is not straightforward; for example, an early Visual Question-Answering (VQA) challenge was later “broken” by a high-performing method that ignored the image entirely (Jabri et al, 2016).
- Given the presumed importance of reasoning across modalities in multimodal machine learning tasks, we should evaluate a model’s ability to leverage cross-modal interactions
- Our goal is to explore what additional insights empirical multimodally-additive function projection (EMAP) can add on top of standard model comparisons
- The last question on our FAQ list in §6 leaves us with the following conundrum: 1) Additive models are incapable of most cross-modal reasoning; but 2) for most of the unbalanced tasks we consider, EMAP finds an additive approximation that makes nearly identical predictions to the full, interactive model
- Hypothesis 1: These unbalanced tasks don’t require complex cross-modal reasoning. This purported conclusion cannot account for gaps between human and machine performance: if an additive model underperforms relative to human judgment, the gap could plausibly be explained by cross-modal feature interactions
- Without expressive enough single-modal processing methods, opportunities for learning cross-modal interaction patterns may not be present during training
- We first achieve state-of-the-art performance for all of the datasets using a linear model
- To demonstrate the potential utility of EMAP in qualitative examinations, we identified the individual instances in T-VIS for which EMAP changes the test-time predictions of the LXMERT + Linear Logits model
- The authors' main prediction results are summarized in Table 4.
- The performance of the baseline additive linear model is strong, but the authors are usually able to find an interactive model that outperforms this linear baseline, e.g., in the case of TST2, a polynomial kernel SVM outperforms the linear model by 4 accuracy points.
- This observation alone seems to provide evidence that models.
- Conclusion and Future Work
The last question on the FAQ list in §6 leaves them with the following conundrum: 1) Additive models are incapable of most cross-modal reasoning; but 2) for most of the unbalanced tasks the authors consider, EMAP finds an additive approximation that makes nearly identical predictions to the full, interactive model.
- Improvements in unimodal modeling could feasibly improve feature interaction learning
- Table1: Prediction accuracy on synthetic dataset using additive (A) models, interactive (I) models, and their EMAP projections. Random guessing achieves 50% accuracy. Under EMAP, the interactive models degrade to (close to) random, as desired. See §5 for training details
- Table2: As expected, for VQA2 and GQA, the mean accuracy of LXMERT is substantially higher than its empirical multimodally additive projection (EMAP). Shown are averages over k = 15 random subsamples of 500 dev-set instances
- Table3: The tasks we consider are not specifically balanced to force the learning of cross-modal interactions
- Table4: Prediction results for 7 multimodal classification tasks. First block: the evaluation metric, setup, constant prediction performance, and previous state-of-the-art results (we outperform these baselines mostly because we use RoBERTa). Second block: the performance of our image only/text only linear models. Third block: the predictive performance of our (I)nteractive models. Fourth block: comparison of the performance of the best (I)nteractive model to the (A)dditive linear baseline. Crucially, we also report the EMAP of the best interactive model, which reveals whether or not the performance gains of the (I)nteractive model are due to modeling crossmodal interactions, or not. Italics=computed using 15 fold cross-validation over each cross-validation split (see footnote 5). Bolded values are within half a point of the best model
- Table5: Consistency results. The first block provides details about the task and the model that performed best on it. The second block gives the performance (italicized results represent cross-validation EMAP computation results; see footnote 5). The third block gives the percent of time the original model’s prediction is the same as for EMAP, and, for comparison, the percent of time the original model’s predictions match the identical model trained with a different random seed: in all cases except for T-VIS, the original model and the EMAP make the same prediction in more than 95% of cases. The final row gives the percent of instances (among instances for which the original model and the EMAP disagree) that the original model is correct. Except for T-VIS, when the EMAP and the original model disagree, each is right around half the time
- Constructed multimodal classification tasks. In addition to image question answering/reasoning datasets already mentioned in §1, other multimodal tasks have been constructed, e.g., video QA (Lei et al, 2018; Zellers et al, 2019), visual entailment (Xie et al, 2018), hateful multimodal meme detection (Kiela et al, 2020), and tasks related to visual dialog (de Vries et al, 2017). In these cases, unimodal baselines are shown to achieve lower performance relative to their expressive multimodal counterparts. Collected multimodal corpora. Recent computational work has examined diverse multimodal corpora collected from in-vivo social processes, e.g., visual/textual advertisements (Hussain et al, 2017; Ye and Kovashka, 2018; Zhang et al, 2018), images with non-literal captions in news articles (Weiland et al, 2018), and image/text instructions in cooking how-to documents (Alikhani et al, 2019). In these cases, multimodal classification tasks are often proposed over these corpora as a means of testing different theories from semiotics (Barthes, 1988; O’Toole, 1994; Lemke, 1998; O’Halloran, 2004, inter alia); unlike many VQA-style datasets, they are generally not specifically balanced to force models to learn crossmodal interactions.
- Partial support for this work was kindly provided by a grant from The Zillow Corporation, and a Google Focused Research Award
Study subjects and analysis
The raw images are not available, so we queried the Twitter API for them. The corpus has 4472 tweets in it initially, but we were only able to re-collect 3905 tweets (87%) when we re-queried the API. Tweets can be missing for a variety of reasons, e.g., the tweet being permanently deleted, or the account’s owner making the their account private at the time of the API request
T-ST1. This data is available from http: //www.ee.columbia.edu/ln/dvmm/ vso/download/twitter_dataset.html and consists of 603 tweets (470 positive, 133 negative). The authors distribute data with 5 folds pre-specified for cross-validation performance reporting
Across the 10 cross-validation splits, EMAP incorrectly maps the original model’s correct prediction of INTR → IDTN 255 times. For reference, there are 165 cases where EMAP maps the incorrect INTR prediction of the original model to the correct IDTN label. So, when EMAP makes the change INTR → IDTN, in 60% of cases the full model is correct
input pairs: 3
Finally, we can sum these two results to compute [f11, f22, f33] = [−.8, 2.1, .5]. These predictions are the closest approximations to the full evaluations [f11, f22, f33] = [−1.3, 3, .7] for which the generation function obeys the additivity constraint over the three input pairs. Alakananda Vempala and Daniel Preotiuc-Pietro. 2019
- Malihe Alikhani, Sreyasi Nag Chowdhury, Gerard de Melo, and Matthew Stone. 2019. CITE: A corpus of image–text discourse relations. In NAACL.
- Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. 2018. Bottom-up and top-down attention for image captioning and visual question answering. In CVPR.
- Roland Barthes. 1988. Image-music-text. Macmillan.
- John Bateman. 201Text and image: A critical introduction to the visual/verbal divide. Routledge.
- Mathieu Blondel and Fabian Pedregosa. 2016.
- Damian Borth, Rongrong Ji, Tao Chen, Thomas Breuel, and Shih-Fu Chang. 2013. Large-scale visual sentiment ontology and detectors using adjective noun pairs. In ACM MM.
- Leo Breiman. 2001. Random forests. Machine Learning, 45:5–32.
- Qian Chen, Xiaodan Zhu, Zhenhua Ling, Si Wei, Hui Jiang, and Diana Inkpen. 2017. Enhanced LSTM for natural language inference. In ACL.
- Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In NAACL.
- Jesse Dodge, Suchin Gururangan, Dallas Card, Roy Schwartz, and Noah A. Smith. 2019. Show your work: Improved reporting of experimental results. In EMNLP.
- Yoav Freund and Robert E. Schapire. 1995. A decision-theoretic generalization of on-line learning and an application to boosting. In European conference on computational learning theory. Springer.
- Jerome H. Friedman. 2001. Greedy function approximation: A gradient boosting machine. Annals of statistics, 29(5).
- Jerome H. Friedman and Bogdan E. Popescu. 2008. Predictive learning via rule ensembles. The Annals of Applied Statistics, 2(3):916–954.
- Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. 2017. Making the V in VQA matter: Elevating the role of image understanding in visual question answering. In CVPR.
- Trevor Hastie and Robert Tibshirani. 1987. Generalized additive models: some applications. Journal of the American Statistical Association, 82(398):371– 386.
- Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In CVPR.
- Jack Hessel and Lillian Lee. 2019. Something’s brewing! Early prediction of controversy-causing posts from discussion features. In NAACL.
- Jack Hessel, Lillian Lee, and David Mimno. 2017. Cats and captions vs. creators and the clock: Comparing multimodal content to context in predicting relative popularity. In The Web Conference.
- Giles Hooker. 2004. Discovering additive structure in black box functions. In KDD.
- Drew A. Hudson and Christopher D. Manning. 2019. GQA: A new dataset for real-world visual reasoning and compositional question answering. CVPR.
- Zaeem Hussain, Mingda Zhang, Xiaozhong Zhang, Keren Ye, Christopher Thomas, Zuha Agha, Nathan Ong, and Adriana Kovashka. 2017. Automatic understanding of image and video advertisements. In CVPR.
- Allan Jabri, Armand Joulin, and Laurens Van Der Maaten. 2016. Revisiting visual question answering baselines. In ECCV. Springer.
- Siddhant M. Jayakumar, Wojciech M. Czarnecki, Jacob Menick, Jonathan Schwarz, Jack Rae, Simon Osindero, Yee Whye Teh, Tim Harley, and Razvan Pascanu. 2020. Multiplicative interactions and where to find them. In ICLR.
- James M. Jones, Gary Alan Fine, and Robert G. Brust. 1979. Interaction effects of picture and caption on humor ratings of cartoons. The Journal of Social Psychology, 108(2):193–198.
- Douwe Kiela, Hamed Firooz, Aravind Mohan, Vedanuj Goswami, Amanpreet Singh, Pratik Ringshia, and Davide Testuggine. 2020. The hateful memes challenge: Detecting hate speech in multimodal memes. arXiv preprint arXiv:2005.04790.
- Diederik P. Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In ICLR.
- Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A. Shamma, Michael S. Bernstein, and Li Fei-Fei. 2017. Visual genome: Connecting language and vision using crowdsourced dense image annotations. IJCV, 123(1):32–73.
- Julia Kruk, Jonah Lubin, Karan Sikka, Xiao Lin, Dan Jurafsky, and Ajay Divakaran. 2019. Integrating text and image: Determining multimodal document intent in instagram posts. In EMNLP.
- Himabindu Lakkaraju, Stephen H. Bach, and Jure Leskovec. 2016. Interpretable decision sets: A joint framework for description and prediction. In KDD.
- Jie Lei, Licheng Yu, Mohit Bansal, and Tamara L. Berg. 2018. TVQA: Localized, compositional video question answering. In EMNLP.
- Jay Lemke. 1998. Multiplying meaning. Reading science: Critical and functional perspectives on discourses of science, pages 87–113.
- Ruixue Liu and Art B. Owen. 2006. Estimating mean dimensionality of analysis of variance decompositions. Journal of the American Statistical Association, 101(474):712–721.
- Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. RoBERTa: A robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692.
- Emily E Marsh and Marilyn Domas White. 2003. A taxonomy of relationships between images and text. Journal of Documentation, 59(6):647–672.
- Radan Martinec and Andrew Salway. 2005. A system for image–text relations in new (and old) media. Visual communication, 4(3):337–371.
- Teng Niu, Shiai Zhu, Lei Pang, and Abdulmotaleb ElSaddik. 2016. Sentiment analysis on multi-view social data. In MultiMedia Modeling, page 15–27.
- Kay O’Halloran. 2004. Multimodal discourse analysis: Systemic functional perspectives. A&C Black.
- Michael O’Toole. 1994. The language of displayed art. Fairleigh Dickinson Univ Press.
- Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster R-CNN: Towards real-time object detection with region proposal networks. In NeurIPS.
- Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. 2016. Model-agnostic interpretability of machine learning. In Human Interpretability in Machine Learning Workshop at ICML.
- Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. 2018. Anchors: High-precision modelagnostic explanations. In AAAI.
- Cees G.M. Snoek, Marcel Worring, and Arnold W.M. Smeulders. 2005. Early versus late fusion in semantic video analysis. In ACM Multimedia.
- Ilya M Sobol. 2001. Global sensitivity indices for nonlinear mathematical models and their Monte Carlo estimates. Mathematics and computers in simulation, 55(1-3):271–280.
- Erik Strumbelj and Igor Kononenko. 2010. An efficient explanation of individual classifications using game theory. JMLR.
- Sanjay Subramanian, Sameer Singh, and Matt Gardner. 2019. Analyzing compositionality of visual question answering. In NeurIPS Workshop on Visually Grounded Interaction and Language (ViGIL).
- Alane Suhr, Stephanie Zhou, Ally Zhang, Iris Zhang, Huajun Bai, and Yoav Artzi. 2019. A corpus for reasoning about natural language grounded in photographs. In ACL.
- Hao Tan and Mohit Bansal. 2019. LXMERT: Learning cross-modality encoder representations from transformers. In EMNLP.
- Mingxing Tan and Quoc V. Le. 2019. EfficientNet: Rethinking model scaling for convolutional neural networks. arXiv preprint arXiv:1905.11946.
- Berk Ustun and Cynthia Rudin. 2016. Supersparse linear integer models for optimized medical scoring systems. Machine Learning.
- Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In NeurIPS.
- Alakananda Vempala and Daniel Preotiuc-Pietro. 2019. Categorizing and inferring the relationship between the text and image of Twitter posts. In ACL.
- Harm de Vries, Florian Strub, Sarath Chandar, Olivier Pietquin, Hugo Larochelle, and Aaron Courville. 2017. Guesswhat?! Visual object discovery through multi-modal dialogue. In CVPR.
- Lydia Weiland, Ioana Hulpus, Simone Paolo Ponzetto, Wolfgang Effelsberg, and Laura Dietz. 2018. Knowledge-rich image gist understanding beyond literal meaning. Data & Knowledge Engineering, 117:114–132.
- Ning Xie, Farley Lai, Derek Doran, and Asim Kadav. 2018. Visual entailment task for visuallygrounded language learning. arXiv preprint arXiv:1811.10582.
- Keren Ye and Adriana Kovashka. 2018. Advise: Symbolism and external knowledge for decoding advertisements. In ECCV.
- Dani Yogatama and Noah A. Smith. 2015. Bayesian optimization of text representations. In EMNLP.
- Rowan Zellers, Yonatan Bisk, Ali Farhadi, and Yejin Choi. 2019. From recognition to cognition: Visual commonsense reasoning. In CVPR.
- Mingda Zhang, Rebecca Hwa, and Adriana Kovashka. 2018. Equal but not the same: Understanding the implicit relationship between persuasive images and text. In The British Machine Vision Conference (BMVC).
- This data is available from https://github.com/karansikka1/documentIntent_emnlp19. We use the same 5 random splits provided by the authors for evaluation. The authors provide ResNet18 features, which we use for our non-LXMERT experiments instead of EfficientNet-B4 features. After contacting the authors, they extracted bottom-up-top-down FasterRCNN features for us, so we were able to compare to LXMERT. State of the art performance numbers are derived from the above github repo; these differ slightly from the values reported in the original paper because the github versions are computed without image data augmentation.
- This data is available from http://www.cs.cornell.edu/̃jhessel/cats/cats.html. We just use the pics subreddit data. We attempted to rescrape the pics images from the imgur urls. We were able to re-collect 87215/88686 of the images (98%). Images can be missing if they have been, e.g., deleted from imgur. We removed any pairs with missing images from the ranking task; we trained on 42864/44343 (97%)of the original pairs. The data is distributed with training/test splits. From the training set for each split, we reserve 3K pairs for validation. The state of the art performance numbers are taken from the original releasing work.
- This data is available from http://www.ee.columbia.edu/ln/dvmm/vso/download/twitter_dataset.html and consists of 603 tweets (470 positive, 133 negative). The authors distribute data with 5 folds pre-specified for cross-validation performance reporting. However, we note that the original paper’s best model achieves 72%accuracy in this setting, but a constant prediction baseline achieves higher performance:470/(470+133)= 78%. Note that the constant prediction baseline likely performs worse according to metrics other than accuracy, but only accuracy is reported. We attempted to contact the authors of this study but did not receive a reply. We also searched for additional baselines for this dataset, but were unable to find additional work that uses this dataset in the same fashion. Thus, given the small size of the dataset, lack of reliable measure of SOTA performance, and label imbalance, we decided to report ROC AUC prediction performance.
- This data is available from https://www.mcrlab.net/research/mvsasentiment-analysis-on-multiview-social-data/. We use the MVSASingle dataset because human annotators examine both the text and image simultaneously; we chose not to use MVSA-Multiple because human annotators do not see the tweet’s image and text at the same time. However, the dataset download link only comes with 4870 labels, instead of the 5129 described in the original paper. We contacted the authors of the original work about the missing data, but did not receive a reply.
- We follow the preprocessing steps detailed in Xu and Mao (2017) to derive a training dataset. After preprocessing, we are left with 4041 data points, whereas prior work compares with 4511 points after preprocessing. The preprocessing consists of removing points that are (positive, negative), (negative, positive), or (neutral, neutral), which we believe matches the description of the preprocessing in that work. We contacted the authors for details, but did not receive a reply. The state-of-the-art performance number for this dataset is from Xu et al. (2018).