Global Vision, Local Focus: The Semantic Enhancement Transformer Network for Crowd Counting

Mingtao Wang,Xin Zhou, Yuanyuan Chen

crossref(2024)

引用 0|浏览0
暂无评分
摘要
Abstract Automatic crowd counting has made significant progress in recent years. However, due to the challenge of multi-scale variations, convolutional neural networks (CNNs) with fixed-size kernels cannot effectively address this difficulty, leading to a severe limitation on counting performance. To alleviate this issue, we propose a semantic enhancement Transformer crowd counting network (named SET) to improve the semantic encoding relationships in crowd scenes. The SET integrates global attention from Transformer, learnable local attention, and inductive bias from CNNs into a counting model. Firstly, we introduce an efficient Transformer encoder to extract low-level global features of crowd scenes. Secondly, we propose a learnable ViTBlock to dynamically learn appropriate weights for different regions, aiding in enhancing the model’s global visual understanding. Finally, to guide the model to focus better on crowd regions, we jointly employ a segmentation attention module and a feature aggregation module to aggregate semantic and spatial features at multiple levels, thus obtaining finer-grained features. We conduct extensive experiments on four challenging datasets, including ShanghaiTech Part A/B, UCF-QNRF, and JHU-CROWD++, achieving excellent results.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要