Token Pooling in Vision Transformers for Image Classification

Dmitrii Marin,Jen-Hao Rick Chang,Anurag Ranjan,Anish Prabhu,Mohammad Rastegari,Oncel Tuzel

2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)（2023）

引用 4|浏览41

暂无评分

摘要

Pooling is commonly used to improve the computation-accuracy trade-off of convolutional networks. By aggregating neighboring feature values on the image grid, pooling layers downsample feature maps while maintaining accuracy. In standard vision transformers, however, tokens are processed individually and do not necessarily lie on regular grids. Utilizing pooling methods designed for image grids (e.g., average pooling) thus can be sub-optimal for transformers, as shown by our experiments. In this paper, we propose Token Pooling to downsample token sets in vision transformers. We take a new perspective - instead of assuming tokens form a regular grid, we treat them as discrete (and irregular) samples of an implicit continuous signal. Given a target number of tokens, Token Pooling finds the set of tokens that best approximates the underlying continuous signal. We rigorously evaluate the proposed method on the standard transformer architecture (ViT/DeiT) and on the image classification problem using ImageNet-1k. Our experiments show that Token Pooling significantly improves the computation-accuracy trade-off without any further modifications to the architecture. Token Pooling enables DeiT-Ti to achieve the same top-1 accuracy while using 42% fewer computations.

查看译文

关键词

vision transformers,image classification

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要