Token Pooling in Vision Transformers for Image Classification

2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)(2023)

引用 4|浏览41
暂无评分
摘要
Pooling is commonly used to improve the computation-accuracy trade-off of convolutional networks. By aggregating neighboring feature values on the image grid, pooling layers downsample feature maps while maintaining accuracy. In standard vision transformers, however, tokens are processed individually and do not necessarily lie on regular grids. Utilizing pooling methods designed for image grids (e.g., average pooling) thus can be sub-optimal for transformers, as shown by our experiments. In this paper, we propose Token Pooling to downsample token sets in vision transformers. We take a new perspective - instead of assuming tokens form a regular grid, we treat them as discrete (and irregular) samples of an implicit continuous signal. Given a target number of tokens, Token Pooling finds the set of tokens that best approximates the underlying continuous signal. We rigorously evaluate the proposed method on the standard transformer architecture (ViT/DeiT) and on the image classification problem using ImageNet-1k. Our experiments show that Token Pooling significantly improves the computation-accuracy trade-off without any further modifications to the architecture. Token Pooling enables DeiT-Ti to achieve the same top-1 accuracy while using 42% fewer computations.
更多
查看译文
关键词
vision transformers,image classification
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要