DctViT: Discrete Cosine Transform meet vision transformers

Keke Su,Lihua Cao,Botong Zhao,Ning Li,Di Wu,Xiyu Han,Yangfan Liu

NEURAL NETWORKS（2024）

Cited 0|Views7

No score

Abstract

Vision transformers (ViTs) have become one of the dominant frameworks for vision tasks in recent years because of their ability to efficiently capture long-range dependencies in image recognition tasks using selfattention. In fact, both CNNs and ViTs have advantages and disadvantages in vision tasks, and some studies suggest that the use of both may be an effective way to balance performance and computational cost. In this paper, we propose a new hybrid network based on CNN and transformer, using CNN to extract local features and transformer to capture long-distance dependencies. We also proposed a new feature map resolution reduction based on Discrete Cosine Transform and self -attention, named DCT-Attention Down -sample (DAD). Our DctViT-L achieves 84.8% top -1 accuracy on ImageNet 1K, far outperforming CMT, Next-ViT, SpectFormer and other state-of-the-art models, with lower computational costs. Using DctViT-B as the backbone, RetinaNet can achieve 46.8% mAP on COCO val2017, which improves mAP by 2.5% and 1.1% with less calculation cost compared with CMT-S and SpectFormer as the backbone.

Translated text

Key words

Deep learning,Computer vision,Image classification,Vision transformer,Discrete cosine transform

AI Read Science

Must-Reading Tree

Example

Generate MRT to find the research sequence of this paper

Chat Paper

Summary is being generated by the instructions you defined