GFSNet: Gaussian Fourier with Sparse Attention Network for Visual Question Answering

Xiang Shen,Dezhi Han, Chin-Chen Chang Chang,Ammar Oad,Huafeng Wu


Cited 0|Views14
No score
Abstract A profound understanding and reasoning of the relationship between images and question are crucial in Visual Question Answering (VQA) tasks. However, traditional self-attention mechanisms exhibit limitations , primarily confined to spatial domain modeling of images, lacking 20 the capability to adequately model and analyze visual information at different scales in the frequency domain. Additionally, the traditional self-attention-based image feature modeling introduces noise when capturing long-distance dependencies, causing the model to overly focus on irrelevant details, thereby reducing robustness. To address these issues, 25 this paper proposes a novel Gaussian Fourier with Sparse Attention Network (GFSNet). GFSNet utilizes Fourier transform techniques to represent image attention weights obtained through self-attention in the frequency domain, facilitating the effective modeling of different scale information by analyzing attention weights in the frequency domain. 30 Recognizing that different scale information in images often manifests as distinct frequency components, the model can better capture and 1 Springer Nature 2021 L A T E X template Gaussian Fourier with Sparse Attention Network for VQA adapt to the complex structures and correlations of these various scale details. To mitigate high-frequency noise in the frequency domain, we design an adaptive Gaussian filter to effectively suppress or filter noise in 35 the images. Finally, a novel sparse attention mechanism is introduced to select optimized key frequency domain features. This enables the model to more effectively focus on critical image regions, reducing the processing of irrelevant or redundant information, while enhancing interpretability and robustness. The proposed GFSNet model aims to achieve effective 40 modeling of visual information at different scales without increasing model parameters or altering computational complexity. Extensive experiments on the VQAv2 and GQA benchmark datasets unequivocally demonstrate the superiority and effectiveness of the GFSNet approach. Source code is available at
Translated text
AI Read Science
Must-Reading Tree
Generate MRT to find the research sequence of this paper
Chat Paper
Summary is being generated by the instructions you defined