Visual question answering requires to answer a given question-image pair. observed that the original split of the VQAv2 dataset permits to leverage language priors
Removing Bias in Multi-modal Classifiers: Regularization by Maximizing Functional Entropies
NIPS 2020, (2020)
Many recent datasets contain a variety of different data modalities, for instance, image, question, and answer data in visual question answering (VQA). When training deep net classifiers on those multi-modal datasets, the modalities get exploited at different scales, i.e., some modalities can more easily contribute to the classification...更多
下载 PDF 全文
- Multi-modal data is ubiquitous and commonly used in many real-world applications. For instance, discriminative visual question answering systems take into account the question, the image and a variety of answers.
- Training of discriminative classifiers on multi-modal datasets like discriminative visual question answering almost always follows the classical machine learning paradigm: use a common loss function like cross-entropy and employ a standard 2-norm regularizer (a.k.a. weight decay).
- The regularizer favors ‘simple’ classifiers over more complex ones
- These classical regularizers are suitable in traditional machine learning settings that predominantly use a single data modality.
- E.g., answering ‘how many...?’ questions with ‘2’ regardless of the question
- Another popular example consists of colored images whose label is correlated with their color modality and their shape modality.
- The cross-entropy loss between two distributions pw(y|x), q(y) is
- Multi-modal data is ubiquitous and commonly used in many real-world applications
- Discriminative visual question answering systems take into account the question, the image and a variety of answers
- To address the computational challenges of computing the functional entropy we develop a method based on the log-Sobolev inequality which bounds the functional entropy with the functional Fisher information
- Visual question answering (VQA) requires to answer a given question-image pair.  observed that the original split of the VQAv2 dataset permits to leverage language priors
- visual question answering (VQA)-CPv2 consist of 438,183 samples in the train set and 219,928 samples in the test set
- Classical regularizers applied on multi-modal datasets lead to models which may ignore one or more of the modalities. This is sub-optimal as we expect all modalities to contribute to classification. To alleviate this concern we study regularization via the functional entropy
- The authors evaluate the proposed regularization on four different datasets.
- One of the datasets is a synthetic dataset (Colored MNIST), which permits to study whether a classifier leverages the wrong features.
- The authors show that adding the discussed regularization improves the generalization of a given classifier.
- The authors briefly describe each dataset and discuss the results of the proposed method.
- Model Convg.
- Adding the proposed regularization encourages to exploit information from both shape and color modalities.
- The authors evaluated the method by adding functional Fisher information regularization to the current state-of-the-art .
- Functional Fisher information regularization results in 70% accuracy on the train set, it improves validation set accuracy to 67.93% accuracy.
- Dataset: Following the settings of Kim et al , the authors evaluate the models on the biased “Dogs and Cats” dataset
- This dataset comes in two splits: The TB1 set consists of bright dogs and dark cats and contains 10,047 samples.
- Classical regularizers applied on multi-modal datasets lead to models which may ignore one or more of the modalities.
- This is sub-optimal as the authors expect all modalities to contribute to classification.
- To alleviate this concern the authors study regularization via the functional entropy.
- It encourages the model to more uniformly exploit the available modalities
- Table1: Comparison between our proposed regularization terms on the Colored MNIST (multi-modal settings, gray-scale test set), SocialIQ [<a class="ref-link" id="c7" href="#r7">7</a>] and Dogs & Cats [<a class="ref-link" id="c6" href="#r6">6</a>] datasets. We report maximum accuracy observed and accuracy after convergence of the model (Convg). We compare the 4 regularizers specified by the equation numbers. We underline the highest maximum accuracy and bold the highest results after convergence. Using functional Fisher information regularization (Eq (12)) leads to a smaller difference between the maximum accuracy and accuracy after convergence. * refers to results we achieve without using our proposed regularization. ** denotes training with weight-decay ( 2 regularization)
- Table2: Comparison between the state-of-the-art on the VQA-CPv2 test set. The best results for each category are in bold. * denotes models that make use of external data
- Multi-modal datasets. Over the years, the amount and variety of data that has been used across tasks has grown significantly. Unsurprisingly, present-day tasks are increasingly sophisticated and combine multiple data modalities like vision, text, and audio. In particular, in the past few years, many large-scale multi-modal datasets have been proposed [2, 3, 7, 8, 9, 10]. Subsequently, multiple works developed strong models to address these datasets [11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26]. However, recent work also suggests that many of these advanced models predict by leveraging one of the modalities more than the others, e.g., utilizing question type to determine the answer in VQA problems [1, 27, 28, 29]. This property is undesirable since multi-modal tasks consider all data essential to solve the challenge without overfitting to the dataset.
- This work is supported in part by NSF under Grant # 1718221, 2008387, NIFA award 2020-67021-32799 and, BSF under Grant# 2019783
challenging multi-modal datasets: 3
Prediction: fire truck. We illustrate the efficacy of the proposed approach on the three challenging multi-modal datasets Colored MNIST, VQA-CPv2, and SocialIQ. We find that our regularization maximizes the utilization of essential information
Dataset: Colored MNIST [5, 6] is a synthetic dataset based on MNIST . The train and validation set consist of 60,000 and 10,000 samples, respectively. Each sample is biased with a color that correlates with its digit
To challenge models to not use these priors, the question type distributions of the train and validation set were changed to differ from one another. VQA-CPv2 consist of 438,183 samples in the train set and 219,928 samples in the test set. Results: We evaluated our method by adding functional Fisher information regularization to the current state-of-the-art 
training samples: 37191
The task is to predict whether the answer is correct or not given this tuple. The dataset is split into 37,191 training samples, and 5,320 validation set samples. Note that an inherent bias exists in this dataset: specifically the sentiment of the answer provides a good cue
Dataset: Following the settings of Kim et al , we evaluate our models on the biased “Dogs and Cats” dataset. This dataset comes in two splits: The TB1 set consists of bright dogs and dark cats and contains 10,047 samples. The TB2 set consist of dark dogs and bright cats and contains 6,738 samples
This dataset comes in two splits: The TB1 set consists of bright dogs and dark cats and contains 10,047 samples. The TB2 set consist of dark dogs and bright cats and contains 6,738 samples. We use the image as a single-modality
- Aishwarya Agrawal, Dhruv Batra, Devi Parikh, and Aniruddha Kembhavi. Don’t just assume; look and answer: Overcoming priors for visual question answering. In CVPR, 2018.
- Justin Johnson, Bharath Hariharan, Laurens van der Maaten, Li Fei-Fei, C. Lawrence Zitnick, and Ross B. Girshick. CLEVR: A diagnostic dataset for compositional language and elementary visual reasoning. In CVPR, 2017.
- Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the V in VQA matter: Elevating the role of image understanding in Visual Question Answering. In CVPR, 2017.
- Aishwarya Agrawal, Dhruv Batra, and Devi Parikh. Analyzing the behavior of visual question answering models. In EMNLP, 2016.
- Yi Li and Nuno Vasconcelos. Repair: Removing representation bias by dataset resampling. In CVPR, 2019.
- Byungju Kim, Hyunwoo Kim, Kyungsu Kim, Sungjin Kim, and Junmo Kim. Learning not to learn: Training deep neural networks with biased data. In CVPR, 2018.
- Amir Zadeh, Michael Chan, Paul Pu Liang, Edmund Tong, and Louis-Philippe Morency. Social-iq: A question answering benchmark for artificial social intelligence. In CVPR, 2019.
- Rowan Zellers, Yonatan Bisk, Ali Farhadi, and Yejin Choi. From recognition to cognition: Visual commonsense reasoning. In CVPR, 2019.
- Drew A. Hudson and Christopher D. Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In CVPR, 2019.
- Abhishek Das, Satwik Kottur, Khushi Gupta, Avi Singh, Deshraj Yadav, José M.F. Moura, Devi Parikh, and Dhruv Batra. Visual Dialog. In CVPR, 2017.
- Idan Schwartz, Alexander Schwing, and Tamir Hazan. High-order attention models for visual question answering. In NIPS, 2017.
- Jin-Hwa Kim, Jaehyun Jun, and Byoung-Tak Zhang. Bilinear Attention Networks. In NeurIPS, 2018.
- Idan Schwartz, Seunghak Yu, Tamir Hazan, and Alexander G. Schwing. Factor graph attention. In CVPR, 2019.
- Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In NeurIPS, 2019.
- Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. Bottom-up and top-down attention for image captioning and visual question answering. In CVPR, 2018.
- Hao Tan and Mohit Bansal. Lxmert: Learning cross-modality encoder representations from transformers. In EMNLP, 2019.
- Jiasen Lu, Jianwei Yang, Dhruv Batra, and Devi Parikh. Hierarchical question-image coattention for visual question answering. In NeurIPS, 2016.
- Akira Fukui, Dong Huk Park, Daylen Yang, Anna Rohrbach, Trevor Darrell, and Marcus Rohrbach. Multimodal compact bilinear pooling for visual question answering and visual grounding. CVPR, 2016.
- Drew A Hudson and Christopher D Manning. Compositional attention networks for machine reasoning. In ICLR, 2018.
- J. Aneja∗, H. Agrawal∗, D. Batra, and A. G. Schwing. Sequential Latent Spaces for Modeling the Intention During Diverse Image Captioning. In Proc. ICCV, 2019. ∗ equal contribution.
- U. Jain∗, Z. Zhang∗, and A. G. Schwing. Creativity: Generating Diverse Questions using Variational Autoencoders. In Proc. CVPR, 2017. ∗ equal contribution.
- U. Jain, S. Lazebnik, and A. G. Schwing. Two can play this Game: Visual Dialog with Discriminative Question Generation and Answering. In Proc. CVPR, 2018.
- M. Chatterjee and A. G. Schwing. Diverse and Coherent Paragraph Generation from Images. In Proc. ECCV, 2018.
- M. Narasimhan, S. Lazebnik, and A. G. Schwing. Out of the Box: Reasoning with Graph Convolution Nets for Factual Visual Question Answering. In Proc. NeurIPS, 2018.
- M. Narasimhan and A. G. Schwing. Straight to the Facts: Learning Knowledge Base Retrieval for Factual Visual Question Answering. In Proc. ECCV, 2018.
- J. Lin, U. Jain, and A. G. Schwing. TAB-VCR: Tags and Attributes based VCR Baselines. In Proc. NeurIPS, 2019.
- Idan Schwartz, Alexander G Schwing, and Tamir Hazan. A simple baseline for audio-visual scene-aware dialog. In CVPR, 2019.
- Suchin Gururangan, Swabha Swayamdipta, Omer Levy, Roy Schwartz, Samuel Bowman, and Noah A. Smith. Annotation artifacts in natural language inference data. In NAACL (Short Papers), 2018.
- Roy Schwartz, Maarten Sap, Ioannis Konstas, Leila Zilles, Yejin Choi, and Noah A. Smith. The effect of different writing tasks on linguistic style: A case study of the ROC story cloze task. In CoNLL, 2017.
- Martin Arjovsky, Léon Bottou, Ishaan Gulrajani, and David Lopez-Paz. Invariant risk minimization. https://arxiv.org/abs/1907.02893, 2019.
- Weiyao Wang, Du Tran, and Matt Feiszli. What makes training multi-modal classification networks hard? In CVPR, 2020.
- Christopher Clark, Mark Yatskar, and Luke Zettlemoyer. Don’t take the easy way out: Ensemble based methods for avoiding known dataset biases. In EMNLP, 2019.
- Remi Cadene, Corentin Dancette, Hedi Ben younes, Matthieu Cord, and Devi Parikh. Rubi: Reducing unimodal biases for visual question answering. In NeurIPS, 2019.
- Michael Jordan, Zoubin Ghahramani, Tommi Jaakkola, and Lawrence Saul. An introduction to variational methods for graphical models. Learning in Graphical Models, 1999.
- Naftali Tishby and Noga Zaslavsky. Deep learning and the information bottleneck principle. In ITW, 2015.
- Ravid Shwartz-Ziv and Naftali Tishby. Opening the black box of deep neural networks via information. https://arxiv.org/abs/1703.00810, 2017.
- Ranjay Krishna, Michael Bernstein, and Li Fei-Fei. Information maximizing visual question generation. In CVPR, 2019.
- Sainandan Ramakrishnan, Aishwarya Agrawal, and Stefan Lee. Overcoming language priors in visual question answering with adversarial regularization. In NeurIPS, 2018.
- Zhibin Liao, Tom Drummond, Ian Reid, and Gustavo Carneiro. Approximate fisher information matrix to characterize the training of deep neural networks. TPAMI, 2020.
- Dominique Bakry, Ivan Gentil, and Michel Ledoux. Analysis and geometry of Markov diffusion operators. SBM, 2013.
- OS Rothaus. Analytic inequalities, isoperimetric inequalities and logarithmic sobolev inequalities. JFA, 1985.
- Michel Ledoux. The Concentration of Measure Phenomenon. AMS, 2001.
- Ramprasaath R Selvaraju, Stefan Lee, Yilin Shen, Hongxia Jin, Shalini Ghosh, Larry Heck, Dhruv Batra, and Devi Parikh. Taking a hint: Leveraging explanations to make vision and language models more grounded. In ICCV, 2019.
- Jialin Wu and Raymond Mooney. Self-critical reasoning for robust visual question answering. In NeurIPS, 2019.
- Yann Lecun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. In IEEE, 1998.
- Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi Parikh. VQA: Visual Question Answering. In ICCV, 2015.
- Kushal Kafle Robik Shrestha and Christopher Kanan. A negative case analysis of visual grounding methods for vqa. https://arxiv.org/abs/2004.05704, 2020.
- Gabriel Grand and Yonatan Belinkov. Adversarial regularization for visual question answering: Strengths, shortcomings, and side effects. In NAACL, 2019.
- Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016.