AI帮你理解科学

AI 生成解读视频

AI抽取解析论文重点内容自动生成视频


pub
生成解读视频

AI 溯源

AI解析本论文相关学术脉络


Master Reading Tree
生成 溯源树

AI 精读

AI抽取本论文的概要总结


微博一下
Visual question answering requires to answer a given question-image pair. observed that the original split of the VQAv2 dataset permits to leverage language priors

Removing Bias in Multi-modal Classifiers: Regularization by Maximizing Functional Entropies

NIPS 2020, (2020)

被引用0|浏览86
EI
下载 PDF 全文
引用
微博一下

摘要

Many recent datasets contain a variety of different data modalities, for instance, image, question, and answer data in visual question answering (VQA). When training deep net classifiers on those multi-modal datasets, the modalities get exploited at different scales, i.e., some modalities can more easily contribute to the classification...更多
0
简介
  • Multi-modal data is ubiquitous and commonly used in many real-world applications. For instance, discriminative visual question answering systems take into account the question, the image and a variety of answers.
  • Training of discriminative classifiers on multi-modal datasets like discriminative visual question answering almost always follows the classical machine learning paradigm: use a common loss function like cross-entropy and employ a standard 2-norm regularizer (a.k.a. weight decay).
  • The regularizer favors ‘simple’ classifiers over more complex ones
  • These classical regularizers are suitable in traditional machine learning settings that predominantly use a single data modality.
  • E.g., answering ‘how many...?’ questions with ‘2’ regardless of the question
  • Another popular example consists of colored images whose label is correlated with their color modality and their shape modality.
  • The cross-entropy loss between two distributions pw(y|x), q(y) is
重点内容
  • Multi-modal data is ubiquitous and commonly used in many real-world applications
  • Discriminative visual question answering systems take into account the question, the image and a variety of answers
  • To address the computational challenges of computing the functional entropy we develop a method based on the log-Sobolev inequality which bounds the functional entropy with the functional Fisher information
  • Visual question answering (VQA) requires to answer a given question-image pair. [1] observed that the original split of the VQAv2 dataset permits to leverage language priors
  • visual question answering (VQA)-CPv2 consist of 438,183 samples in the train set and 219,928 samples in the test set
  • Classical regularizers applied on multi-modal datasets lead to models which may ignore one or more of the modalities. This is sub-optimal as we expect all modalities to contribute to classification. To alleviate this concern we study regularization via the functional entropy
方法
  • The authors evaluate the proposed regularization on four different datasets.
  • One of the datasets is a synthetic dataset (Colored MNIST), which permits to study whether a classifier leverages the wrong features.
  • The authors show that adding the discussed regularization improves the generalization of a given classifier.
  • The authors briefly describe each dataset and discuss the results of the proposed method.
  • Model Convg.
结果
  • Adding the proposed regularization encourages to exploit information from both shape and color modalities.
  • The authors evaluated the method by adding functional Fisher information regularization to the current state-of-the-art [32].
  • Functional Fisher information regularization results in 70% accuracy on the train set, it improves validation set accuracy to 67.93% accuracy.
  • Dataset: Following the settings of Kim et al [6], the authors evaluate the models on the biased “Dogs and Cats” dataset
  • This dataset comes in two splits: The TB1 set consists of bright dogs and dark cats and contains 10,047 samples.
结论
  • Classical regularizers applied on multi-modal datasets lead to models which may ignore one or more of the modalities.
  • This is sub-optimal as the authors expect all modalities to contribute to classification.
  • To alleviate this concern the authors study regularization via the functional entropy.
  • It encourages the model to more uniformly exploit the available modalities
表格
  • Table1: Comparison between our proposed regularization terms on the Colored MNIST (multi-modal settings, gray-scale test set), SocialIQ [<a class="ref-link" id="c7" href="#r7">7</a>] and Dogs & Cats [<a class="ref-link" id="c6" href="#r6">6</a>] datasets. We report maximum accuracy observed and accuracy after convergence of the model (Convg). We compare the 4 regularizers specified by the equation numbers. We underline the highest maximum accuracy and bold the highest results after convergence. Using functional Fisher information regularization (Eq (12)) leads to a smaller difference between the maximum accuracy and accuracy after convergence. * refers to results we achieve without using our proposed regularization. ** denotes training with weight-decay ( 2 regularization)
  • Table2: Comparison between the state-of-the-art on the VQA-CPv2 test set. The best results for each category are in bold. * denotes models that make use of external data
Download tables as Excel
相关工作
  • Multi-modal datasets. Over the years, the amount and variety of data that has been used across tasks has grown significantly. Unsurprisingly, present-day tasks are increasingly sophisticated and combine multiple data modalities like vision, text, and audio. In particular, in the past few years, many large-scale multi-modal datasets have been proposed [2, 3, 7, 8, 9, 10]. Subsequently, multiple works developed strong models to address these datasets [11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26]. However, recent work also suggests that many of these advanced models predict by leveraging one of the modalities more than the others, e.g., utilizing question type to determine the answer in VQA problems [1, 27, 28, 29]. This property is undesirable since multi-modal tasks consider all data essential to solve the challenge without overfitting to the dataset.
基金
  • This work is supported in part by NSF under Grant # 1718221, 2008387, NIFA award 2020-67021-32799 and, BSF under Grant# 2019783
研究对象与分析
challenging multi-modal datasets: 3
Prediction: fire truck. We illustrate the efficacy of the proposed approach on the three challenging multi-modal datasets Colored MNIST, VQA-CPv2, and SocialIQ. We find that our regularization maximizes the utilization of essential information

samples: 10000
Dataset: Colored MNIST [5, 6] is a synthetic dataset based on MNIST [45]. The train and validation set consist of 60,000 and 10,000 samples, respectively. Each sample is biased with a color that correlates with its digit

samples: 438183
To challenge models to not use these priors, the question type distributions of the train and validation set were changed to differ from one another. VQA-CPv2 consist of 438,183 samples in the train set and 219,928 samples in the test set. Results: We evaluated our method by adding functional Fisher information regularization to the current state-of-the-art [32]

training samples: 37191
The task is to predict whether the answer is correct or not given this tuple. The dataset is split into 37,191 training samples, and 5,320 validation set samples. Note that an inherent bias exists in this dataset: specifically the sentiment of the answer provides a good cue

samples: 10047
Dataset: Following the settings of Kim et al [6], we evaluate our models on the biased “Dogs and Cats” dataset. This dataset comes in two splits: The TB1 set consists of bright dogs and dark cats and contains 10,047 samples. The TB2 set consist of dark dogs and bright cats and contains 6,738 samples

samples: 6738
This dataset comes in two splits: The TB1 set consists of bright dogs and dark cats and contains 10,047 samples. The TB2 set consist of dark dogs and bright cats and contains 6,738 samples. We use the image as a single-modality

引用论文
  • Aishwarya Agrawal, Dhruv Batra, Devi Parikh, and Aniruddha Kembhavi. Don’t just assume; look and answer: Overcoming priors for visual question answering. In CVPR, 2018.
    Google ScholarLocate open access versionFindings
  • Justin Johnson, Bharath Hariharan, Laurens van der Maaten, Li Fei-Fei, C. Lawrence Zitnick, and Ross B. Girshick. CLEVR: A diagnostic dataset for compositional language and elementary visual reasoning. In CVPR, 2017.
    Google ScholarLocate open access versionFindings
  • Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the V in VQA matter: Elevating the role of image understanding in Visual Question Answering. In CVPR, 2017.
    Google ScholarLocate open access versionFindings
  • Aishwarya Agrawal, Dhruv Batra, and Devi Parikh. Analyzing the behavior of visual question answering models. In EMNLP, 2016.
    Google ScholarLocate open access versionFindings
  • Yi Li and Nuno Vasconcelos. Repair: Removing representation bias by dataset resampling. In CVPR, 2019.
    Google ScholarLocate open access versionFindings
  • Byungju Kim, Hyunwoo Kim, Kyungsu Kim, Sungjin Kim, and Junmo Kim. Learning not to learn: Training deep neural networks with biased data. In CVPR, 2018.
    Google ScholarLocate open access versionFindings
  • Amir Zadeh, Michael Chan, Paul Pu Liang, Edmund Tong, and Louis-Philippe Morency. Social-iq: A question answering benchmark for artificial social intelligence. In CVPR, 2019.
    Google ScholarLocate open access versionFindings
  • Rowan Zellers, Yonatan Bisk, Ali Farhadi, and Yejin Choi. From recognition to cognition: Visual commonsense reasoning. In CVPR, 2019.
    Google ScholarLocate open access versionFindings
  • Drew A. Hudson and Christopher D. Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In CVPR, 2019.
    Google ScholarLocate open access versionFindings
  • Abhishek Das, Satwik Kottur, Khushi Gupta, Avi Singh, Deshraj Yadav, José M.F. Moura, Devi Parikh, and Dhruv Batra. Visual Dialog. In CVPR, 2017.
    Google ScholarLocate open access versionFindings
  • Idan Schwartz, Alexander Schwing, and Tamir Hazan. High-order attention models for visual question answering. In NIPS, 2017.
    Google ScholarLocate open access versionFindings
  • Jin-Hwa Kim, Jaehyun Jun, and Byoung-Tak Zhang. Bilinear Attention Networks. In NeurIPS, 2018.
    Google ScholarLocate open access versionFindings
  • Idan Schwartz, Seunghak Yu, Tamir Hazan, and Alexander G. Schwing. Factor graph attention. In CVPR, 2019.
    Google ScholarLocate open access versionFindings
  • Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In NeurIPS, 2019.
    Google ScholarLocate open access versionFindings
  • Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. Bottom-up and top-down attention for image captioning and visual question answering. In CVPR, 2018.
    Google ScholarLocate open access versionFindings
  • Hao Tan and Mohit Bansal. Lxmert: Learning cross-modality encoder representations from transformers. In EMNLP, 2019.
    Google ScholarLocate open access versionFindings
  • Jiasen Lu, Jianwei Yang, Dhruv Batra, and Devi Parikh. Hierarchical question-image coattention for visual question answering. In NeurIPS, 2016.
    Google ScholarLocate open access versionFindings
  • Akira Fukui, Dong Huk Park, Daylen Yang, Anna Rohrbach, Trevor Darrell, and Marcus Rohrbach. Multimodal compact bilinear pooling for visual question answering and visual grounding. CVPR, 2016.
    Google ScholarLocate open access versionFindings
  • Drew A Hudson and Christopher D Manning. Compositional attention networks for machine reasoning. In ICLR, 2018.
    Google ScholarLocate open access versionFindings
  • J. Aneja∗, H. Agrawal∗, D. Batra, and A. G. Schwing. Sequential Latent Spaces for Modeling the Intention During Diverse Image Captioning. In Proc. ICCV, 2019. ∗ equal contribution.
    Google ScholarLocate open access versionFindings
  • U. Jain∗, Z. Zhang∗, and A. G. Schwing. Creativity: Generating Diverse Questions using Variational Autoencoders. In Proc. CVPR, 2017. ∗ equal contribution.
    Google ScholarLocate open access versionFindings
  • U. Jain, S. Lazebnik, and A. G. Schwing. Two can play this Game: Visual Dialog with Discriminative Question Generation and Answering. In Proc. CVPR, 2018.
    Google ScholarLocate open access versionFindings
  • M. Chatterjee and A. G. Schwing. Diverse and Coherent Paragraph Generation from Images. In Proc. ECCV, 2018.
    Google ScholarLocate open access versionFindings
  • M. Narasimhan, S. Lazebnik, and A. G. Schwing. Out of the Box: Reasoning with Graph Convolution Nets for Factual Visual Question Answering. In Proc. NeurIPS, 2018.
    Google ScholarLocate open access versionFindings
  • M. Narasimhan and A. G. Schwing. Straight to the Facts: Learning Knowledge Base Retrieval for Factual Visual Question Answering. In Proc. ECCV, 2018.
    Google ScholarLocate open access versionFindings
  • J. Lin, U. Jain, and A. G. Schwing. TAB-VCR: Tags and Attributes based VCR Baselines. In Proc. NeurIPS, 2019.
    Google ScholarLocate open access versionFindings
  • Idan Schwartz, Alexander G Schwing, and Tamir Hazan. A simple baseline for audio-visual scene-aware dialog. In CVPR, 2019.
    Google ScholarLocate open access versionFindings
  • Suchin Gururangan, Swabha Swayamdipta, Omer Levy, Roy Schwartz, Samuel Bowman, and Noah A. Smith. Annotation artifacts in natural language inference data. In NAACL (Short Papers), 2018.
    Google ScholarLocate open access versionFindings
  • Roy Schwartz, Maarten Sap, Ioannis Konstas, Leila Zilles, Yejin Choi, and Noah A. Smith. The effect of different writing tasks on linguistic style: A case study of the ROC story cloze task. In CoNLL, 2017.
    Google ScholarLocate open access versionFindings
  • Martin Arjovsky, Léon Bottou, Ishaan Gulrajani, and David Lopez-Paz. Invariant risk minimization. https://arxiv.org/abs/1907.02893, 2019.
    Findings
  • Weiyao Wang, Du Tran, and Matt Feiszli. What makes training multi-modal classification networks hard? In CVPR, 2020.
    Google ScholarLocate open access versionFindings
  • Christopher Clark, Mark Yatskar, and Luke Zettlemoyer. Don’t take the easy way out: Ensemble based methods for avoiding known dataset biases. In EMNLP, 2019.
    Google ScholarLocate open access versionFindings
  • Remi Cadene, Corentin Dancette, Hedi Ben younes, Matthieu Cord, and Devi Parikh. Rubi: Reducing unimodal biases for visual question answering. In NeurIPS, 2019.
    Google ScholarLocate open access versionFindings
  • Michael Jordan, Zoubin Ghahramani, Tommi Jaakkola, and Lawrence Saul. An introduction to variational methods for graphical models. Learning in Graphical Models, 1999.
    Google ScholarFindings
  • Naftali Tishby and Noga Zaslavsky. Deep learning and the information bottleneck principle. In ITW, 2015.
    Google ScholarLocate open access versionFindings
  • Ravid Shwartz-Ziv and Naftali Tishby. Opening the black box of deep neural networks via information. https://arxiv.org/abs/1703.00810, 2017.
    Findings
  • Ranjay Krishna, Michael Bernstein, and Li Fei-Fei. Information maximizing visual question generation. In CVPR, 2019.
    Google ScholarLocate open access versionFindings
  • Sainandan Ramakrishnan, Aishwarya Agrawal, and Stefan Lee. Overcoming language priors in visual question answering with adversarial regularization. In NeurIPS, 2018.
    Google ScholarLocate open access versionFindings
  • Zhibin Liao, Tom Drummond, Ian Reid, and Gustavo Carneiro. Approximate fisher information matrix to characterize the training of deep neural networks. TPAMI, 2020.
    Google ScholarLocate open access versionFindings
  • Dominique Bakry, Ivan Gentil, and Michel Ledoux. Analysis and geometry of Markov diffusion operators. SBM, 2013.
    Google ScholarLocate open access versionFindings
  • OS Rothaus. Analytic inequalities, isoperimetric inequalities and logarithmic sobolev inequalities. JFA, 1985.
    Google ScholarLocate open access versionFindings
  • Michel Ledoux. The Concentration of Measure Phenomenon. AMS, 2001.
    Google ScholarLocate open access versionFindings
  • Ramprasaath R Selvaraju, Stefan Lee, Yilin Shen, Hongxia Jin, Shalini Ghosh, Larry Heck, Dhruv Batra, and Devi Parikh. Taking a hint: Leveraging explanations to make vision and language models more grounded. In ICCV, 2019.
    Google ScholarLocate open access versionFindings
  • Jialin Wu and Raymond Mooney. Self-critical reasoning for robust visual question answering. In NeurIPS, 2019.
    Google ScholarLocate open access versionFindings
  • Yann Lecun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. In IEEE, 1998.
    Google ScholarLocate open access versionFindings
  • Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi Parikh. VQA: Visual Question Answering. In ICCV, 2015.
    Google ScholarLocate open access versionFindings
  • Kushal Kafle Robik Shrestha and Christopher Kanan. A negative case analysis of visual grounding methods for vqa. https://arxiv.org/abs/2004.05704, 2020.
    Findings
  • Gabriel Grand and Yonatan Belinkov. Adversarial regularization for visual question answering: Strengths, shortcomings, and side effects. In NAACL, 2019.
    Google ScholarLocate open access versionFindings
  • Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016.
    Google ScholarLocate open access versionFindings
作者
Itai Gat
Itai Gat
Idan Schwartz
Idan Schwartz
您的评分 :
0

 

标签
评论
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn
小科