AI helps you reading Science

AI generates interpretation videos

AI extracts and analyses the key points of the paper to generate videos automatically


pub
Go Generating

AI Traceability

AI parses the academic lineage of this thesis


Master Reading Tree
Generate MRT

AI Insight

AI extracts a summary of this paper


Weibo:
We introduce a novel Deep Recurrent SpatialAware Network for crowd counting which simultaneously models the variations of crowd density as well as the pose changes in a unified learnable module

Crowd Counting using Deep Recurrent Spatial-Aware Network.

IJCAI, (2018): 849-855

Cited: 145|Views88
EI
Full Text
Bibtex
Weibo

Abstract

Crowd counting from unconstrained scene images is a crucial task in many real-world applications like urban surveillance and management, but it is greatly challenged by the camerau0027s perspective that causes huge appearance variations in peopleu0027s scales and rotations. Conventional methods address such challenges by resorting to fixe...More

Code:

Data:

0
Introduction
  • Crowd counting, which aims at estimating the total number of people in unconstrained crowded scenes, has received increasing research interests in recent years due to its potential application in many real-world scenarios, such as video surveillance [Xiong et al, 2017] and traffic monitoring [Zhang et al, 2017].
  • All of these methods, without exception, obtain the density estimation of the whole image by merging the prediction results of a number of pre-designed fixed subnetwork structures
  • They are either proposed to fuses the features from multiple convolutional neural networks with different receptive fields to handle the scale variation of people groups [Zhang et al, 2016], or directly divide crowd scene into multiple non-overlapping patches and provide a pool of regression networks for each patch selection [Sam et al, 2017].
  • As shown in Figure 1, the camera viewpoints in various scenes create different perspective effects and may result in large variation of scales, in-plane and out-plane rotation of people
Highlights
  • Crowd counting, which aims at estimating the total number of people in unconstrained crowded scenes, has received increasing research interests in recent years due to its potential application in many real-world scenarios, such as video surveillance [Xiong et al, 2017] and traffic monitoring [Zhang et al, 2017]
  • Deep convolutional neural networks have been widely used in crowd counting and have made substantial progress [Sindagi and Patel, 2017a; Onoro-Rubio and Lopez-Sastre, 2016; Sam et al, 2017; Xiong et al, 2017; Zhang et al, 2015]
  • As illustrated in Figure 2, we propose a novel Deep Recurrent Spatial-Aware Network for crowd counting, which is composed of two modules, including a Global Feature Embedding (GFE) module and a Recurrent Spatial-Aware Refinement (RSAR) module
  • As discussed in section 3.3, pi can be calculated by summing over the estimated crowd density map
  • Our method achieves a significant improvement of 49.7% in mean absolute error (MAE) and 39.5% in mean squared error (MSE) over the existing bestperforming algorithm CP-CNN on Part B
  • We introduce a novel Deep Recurrent SpatialAware Network for crowd counting which simultaneously models the variations of crowd density as well as the pose changes in a unified learnable module
Methods
  • As illustrated in Figure 2, the authors propose a novel Deep Recurrent Spatial-Aware Network for crowd counting, which is composed of two modules, including a Global Feature Embedding (GFE) module and a Recurrent Spatial-Aware Refinement (RSAR) module.
  • The GFE module takes the whole image as input for global feature extraction, which is further used to estimate an initial crowd density map.
  • The RSAR module is applied to iteratively locate image regions with a spatial transformer-based attention mechanism and refine the attended density map region with residual learning.
  • The goal of the Global Feature Embedding module is to transform the input image into high-dimensional feature maps, which is further used to generate an initial crowd density map of the image.
  • As shown in Figure 3(a), the GFE module is composed of three columns of CNNs, each of which has seven convolutional layers with different kernel sizes and channel numbers as well as three max-pooling
Results
  • ShanghaiTech [Zhang et al, 2016]
  • This dataset contains 1,198 images of unconstrained scenes with a total of 330,165 annotated people.
  • It is divided into two parts: Part A with 482 images crawled from the Internet, and Part B with 716 images taken from the busy shopping streets.
  • The authors' method achieves a significant improvement of 49.7% in MAE and 39.5% in MSE over the existing bestperforming algorithm CP-CNN on Part B
Conclusion
  • The authors introduce a novel Deep Recurrent SpatialAware Network for crowd counting which simultaneously models the variations of crowd density as well as the pose changes in a unified learnable module.
  • It can be regarded as a general framework for crowd map refinement.
  • The authors plan to delve into the research of incorporating the model in other existing crowd flow prediction framework
Tables
  • Table1: Performance evaluation of different methods on the ShanghaiTech dataset. Our proposed method outperforms the existing state-of-the-art methods on both parts of the ShanghaiTech dataset with a margin
  • Table2: Performance evaluation of different methods on the UCF CC 50 dataset
  • Table3: Performance evaluation on the MALL dataset
  • Table4: Mean Absolute Error of different methods on the WorldExpo’10 dataset. Our method achieves superior performance with respect to the average MAE of five scenes. The best results and the second best results are highlighted in red and blue, respectively. Best viewed in color
  • Table5: Comparison of the performance of our model with different constraints of the spatial transformer on ShanghaiTech dataset. T, S, and R correspond to translation, scale, and rotation respectively
  • Table6: Effectiveness verification of global context on the ShanghaiTech dataset
  • Table7: ShanghaiTech dataset experimental results on the variants of our model using different refinement steps. Our method has the best performance when the density map is refined by n = 30 steps
Download tables as Excel
Related work
  • Deep learning Methods for Crowd Counting: Inspired by the significant progress of deep learning on various computer vision tasks (such as image classification[Zhu et al, 2017; Chen et al, 2017] and salient object detection[Chen et al, 2016; Li et al, 2017]), many researchers also have attempted to adapt the deep neural network to the task of crowd counting and achieved great success. Most of the existing methods addressed the scale variation of people with multi-scale architectures. Boominathan et al [Boominathan et al, 2016] proposed to tackle the issue of scale variation using a combination of shallow and deep networks along with an extensive data augmentation by sampling patches from multi-scale image representations. HydraCNN [Onoro-Rubio and LopezSastre, 2016] proposed to learn a non-linear regression model which used a pyramid of image patches extracted at multiple scales to perform the final density prediction. A pioneering work was proposed by Zhang et al [Zhang et al, 2016], in which they utilized multi-column convolutional neural networks with different receptive fields to learn scale-robust features. Sam et al [Sam et al, 2017] proposed a SwitchingCNN to map a patch from the crowd scene to one of the three regression networks, each of which can handle the particular range of scale variation. CP-CNN [Sindagi and Patel, 2017b] proposed a Contextual Pyramid CNN to generate high-quality crowd density estimation by incorporating global and local contextual information into the multi-column networks. However, the above-mentioned methods have two significant limitations. First, as a neural network with the fixed static receptive field can only handle a limited enumeration of scale variation, these methods do not scale well to large scale change and cannot well cope with all level of scale variation in diverse scenarios. Second, they did not take the rotation variation of people into consideration, which limits the models’ robustness towards camera perspective variations. To the best of our knowledge, we are the first work to simultaneously address the issue of large scale and rotation variations in an adaptively learnable mode on this task.
Funding
  • This work was supported by State Key Development Program under Grant 2016YFB1001004, National Natural Science Foundation of China under Grant 61622214 and Grant 61702565, and Guangdong Natural Science Foundation Project for Research Teams under Grant 2017A030312006
  • This work was also sponsored by CCF-Tencent Open Research Fund
  • Wanli Ouyang is supported by the SenseTime Group Limited
Study subjects and analysis
public challenging datasets: 4
We optimize our networks parameters with Adam optimization [Kingma and Ba, 2014] by minimizing the loss function Eq(8). In this section, we first compare our method with recent stateof-the-art methods of the crowd counting task on four public challenging datasets. We further conduct extensive ablation studies to demonstrate the effectiveness of each component of our model

annotated people: 330165
ShanghaiTech [Zhang et al, 2016]. This dataset contains 1,198 images of unconstrained scenes with a total of 330,165 annotated people. And it is divided into two parts: Part A with 482 images crawled from the Internet, and Part B with 716 images taken from the busy shopping streets

Reference
  • [Abadi et al., 2016] Martın Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, et al. Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467, 2016.
    Findings
  • [Boominathan et al., 2016] Lokesh Boominathan, Srinivas SS Kruthiventi, and R Venkatesh Babu. Crowdnet: A deep convolutional network for dense crowd counting. In ACM MM, pages 640–644. ACM, 2016.
    Google ScholarLocate open access versionFindings
  • [Chan et al., 2008] Antoni B Chan, Zhang-Sheng John Liang, and Nuno Vasconcelos. Privacy preserving crowd monitoring: Counting people without people models or tracking. In CVPR, pages 1–7. IEEE, 2008.
    Google ScholarLocate open access versionFindings
  • [Chen et al., 2012] Ke Chen, Chen Change Loy, Shaogang Gong, and Tony Xiang. Feature mining for localised crowd counting. In BMVC, volume 1, page 3, 2012.
    Google ScholarLocate open access versionFindings
  • [Chen et al., 2013] Ke Chen, Shaogang Gong, Tao Xiang, and Chen Change Loy. Cumulative attribute space for age and crowd density estimation. In CVPR, pages 2467–2474, 2013.
    Google ScholarLocate open access versionFindings
  • [Chen et al., 2016] Tianshui Chen, Liang Lin, Lingbo Liu, Xiaonan Luo, and Xuelong Li. Disc: Deep image saliency computing via progressive representation learning. IEEE TNNLS, 27(6):1135–1149, 2016.
    Google ScholarLocate open access versionFindings
  • [Chen et al., 2017] Tianshui Chen, Zhouxia Wang, Guanbin Li, and Liang Lin. Recurrent attentional reinforcement learning for multi-label image recognition. arXiv preprint arXiv:1712.07465, 2017.
    Findings
  • [Hochreiter and Schmidhuber, 1997] Sepp Hochreiter and Jurgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
    Google ScholarLocate open access versionFindings
  • [Idrees et al., 2013] Haroon Idrees, Imran Saleemi, Cody Seibert, and Mubarak Shah. Multi-source multi-scale counting in extremely dense crowd images. In CVPR, pages 2547–2554, 2013.
    Google ScholarLocate open access versionFindings
  • [Jaderberg et al., 2015] Max Jaderberg, Karen Simonyan, Andrew Zisserman, et al. Spatial transformer networks. In NIPS, pages 2017–2025, 2015.
    Google ScholarLocate open access versionFindings
  • [Kingma and Ba, 2014] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv:1412.6980, 2014.
    Findings
  • [Kuen et al., 2016] Jason Kuen, Zhenhua Wang, and Gang Wang. Recurrent attentional networks for saliency detection. arXiv preprint arXiv:1604.03227, 2016.
    Findings
  • [Lempitsky and Zisserman, 2010] Victor Lempitsky and Andrew Zisserman. Learning to count objects in images. In NIPS, 2010.
    Google ScholarLocate open access versionFindings
  • [Li et al., 2017] Guanbin Li, Yuan Xie, Liang Lin, and Yizhou Yu. Instance-level salient object segmentation. In CVPR, pages 247–256. IEEE, 2017.
    Google ScholarLocate open access versionFindings
  • [Onoro-Rubio and Lopez-Sastre, 2016] Daniel
    Google ScholarFindings
  • [Pham et al., 2015] Viet-Quoc Pham, Tatsuo Kozakaya, Osamu Yamaguchi, and Ryuzo Okada. Count forest: Covoting uncertain number of targets using random forest for crowd density estimation. In ICCV, pages 3253–3261, 2015.
    Google ScholarLocate open access versionFindings
  • [Sam et al., 2017] Deepak Babu Sam, Shiv Surya, and R Venkatesh Babu. Switching convolutional neural network for crowd counting. In CVPR, volume 1, page 6, 2017.
    Google ScholarLocate open access versionFindings
  • [Shang et al., 2016] Chong Shang, Haizhou Ai, and Bo Bai. End-to-end crowd counting via joint learning local and global count. In ICIP, pages 1215–1219. IEEE, 2016.
    Google ScholarLocate open access versionFindings
  • [Sindagi and Patel, 2017a] Vishwanath A Sindagi and Vishal M Patel. Cnn-based cascaded multi-task learning of high-level prior and density estimation for crowd counting. In AVSS, pages 1–6. IEEE, 2017.
    Google ScholarLocate open access versionFindings
  • [Sindagi and Patel, 2017b] Vishwanath A Sindagi and Vishal M Patel. Generating high-quality crowd density maps using contextual pyramid cnns. In ICCV, pages 1879–1888. IEEE, 2017.
    Google ScholarLocate open access versionFindings
  • [Walach and Wolf, 2016] Elad Walach and Lior Wolf. Learning to count with cnn boosting. In ECCV, pages 660–676.
    Google ScholarLocate open access versionFindings
  • [Wang et al., 2017] Zhouxia Wang, Tianshui Chen, Guanbin Li, Ruijia Xu, and Liang Lin. Multi-label image recognition by recurrently discovering attentional regions. In ICCV, 2017.
    Google ScholarLocate open access versionFindings
  • [Xiong et al., 2017] Feng Xiong, Xingjian Shi, and Dit-Yan Yeung. Spatiotemporal modeling for crowd counting in videos. In ICCV, pages 5161–5169. IEEE, 2017.
    Google ScholarLocate open access versionFindings
  • [Zhang et al., 2015] Cong Zhang, Hongsheng Li, Xiaogang Wang, and Xiaokang Yang. Cross-scene crowd counting via deep convolutional neural networks. In CVPR, pages 833–841, 2015.
    Google ScholarLocate open access versionFindings
  • [Zhang et al., 2016] Yingying Zhang, Desen Zhou, Siqin Chen, Shenghua Gao, and Yi Ma. Single-image crowd counting via multi-column convolutional neural network. In CVPR, pages 589–597, 2016.
    Google ScholarLocate open access versionFindings
  • [Zhang et al., 2017] Shanghang Zhang, Guanhang Wu, Joao P Costeira, and Jose MF Moura. Fcn-rlstm: Deep spatio-temporal neural networks for vehicle counting in city cameras. In ICCV, pages 3687–3696. IEEE, 2017.
    Google ScholarLocate open access versionFindings
  • [Zhu et al., 2017] Feng Zhu, Hongsheng Li, Wanli Ouyang, Nenghai Yu, and Xiaogang Wang. Learning spatial regularization with image-level supervisions for multi-label image classification. In CVPR, pages 5513–5522, 2017.
    Google ScholarLocate open access versionFindings
0
Your rating :

No Ratings

Tags
Comments
avatar
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn