## AI helps you reading Science

## AI Insight

AI extracts a summary of this paper

Weibo:

# Spatially Adaptive Inference with Stochastic Feature Sampling and Interpolation

european conference on computer vision, pp.531-548, (2020)

Keywords

Abstract

In the feature maps of CNNs, there commonly exists considerable spatial redundancy that leads to much repetitive processing. Towards reducing this superfluous computation, we propose to compute features only at sparsely sampled locations, which are probabilistically chosen according to activation responses, and then densely reconstruct th...More

Code:

Data:

Introduction

- On many computer vision tasks, significant improvements in accuracy have been achieved through increasing model capacity in convolutional neural networks (CNNs) [12,32].
- A common approach to this problem is to prune weights and neurons that are not needed to maintain the networks performance [19,11,10,34,14,20,27,38]
- Orthogonal to these architectural changes are methods that eliminate computation at inference time conditioned on the input.
- As illustrated in Fig. 1(b), this approach deterministically samples predicted foreground areas while avoiding computational expenditure on the background

Highlights

- On many computer vision tasks, significant improvements in accuracy have been achieved through increasing model capacity in convolutional neural networks (CNNs) [12,32]
- A common approach to this problem is to prune weights and neurons that are not needed to maintain the networks performance [19,11,10,34,14,20,27,38]. Orthogonal to these architectural changes are methods that eliminate computation at inference time conditioned on the input. These techniques are typically based on feature map sparsity, where the locations of zero-valued activations are predicted so that the computation at those positions can be skipped [7,30,1]
- We present a stochastic sampling and interpolation scheme to avoid expensive computation at spatial locations that can be effectively interpolated
- To overcome the challenge of training binary decision variables for representing discrete sampling locations, Gumbel-Softmax is introduced to our sampling module
- The effectiveness of this approach is verified on a variety of computer vision tasks

Methods

- (a) Object Detection (b) Semantic Segmentation (c) Image Classification achieves comparable accuracy with less FLOPs. Compared with SFP [13] and FPGM [14], the approach obtains a smaller accuracy drop with similar FLOPs. The authors further remove the interpolation module from the method and fill the features of unsampled points with 0.
- Results show that removing interpolation does not affect performance on the ImageNet validation set
- This is inconsistent with object detection and semantic segmentation.
- In the image classification task, it is not important to reconstruct the features of unsampled points by interpolation

Results

**Results are shown in**

Fig. 4(b) and present the numerical results in Appendix. For the method and deterministic Gumbel-Softmax, the authors draw curves according to different sparse loss weights {0.3, 0.2, 0.1, 0.05}.- Λ is quite large in most cases, which means the effect of the interpolation module is limited
- This phenomenon is consistent with the experimental results of the “w/o Interp” entry in Table 4, that the results without interpolation are almost identical to the full model, further indicating that interpolation is not important for image classification.
- The reason is still unclear but the authors suspect that this phenomenon may be related to the receptive field of operators

Conclusion

- A method for reducing computation in convolutional networks was proposed that exploits the intrinsic sparsity and spatial redundancy in feature maps.
- The authors present a stochastic sampling and interpolation scheme to avoid expensive computation at spatial locations that can be effectively interpolated.
- To overcome the challenge of training binary decision variables for representing discrete sampling locations, Gumbel-Softmax is introduced to the sampling module.
- The effectiveness of this approach is verified on a variety of computer vision tasks

Summary

## Introduction:

On many computer vision tasks, significant improvements in accuracy have been achieved through increasing model capacity in convolutional neural networks (CNNs) [12,32].- A common approach to this problem is to prune weights and neurons that are not needed to maintain the networks performance [19,11,10,34,14,20,27,38]
- Orthogonal to these architectural changes are methods that eliminate computation at inference time conditioned on the input.
- As illustrated in Fig. 1(b), this approach deterministically samples predicted foreground areas while avoiding computational expenditure on the background
## Methods:

(a) Object Detection (b) Semantic Segmentation (c) Image Classification achieves comparable accuracy with less FLOPs. Compared with SFP [13] and FPGM [14], the approach obtains a smaller accuracy drop with similar FLOPs. The authors further remove the interpolation module from the method and fill the features of unsampled points with 0.- Results show that removing interpolation does not affect performance on the ImageNet validation set
- This is inconsistent with object detection and semantic segmentation.
- In the image classification task, it is not important to reconstruct the features of unsampled points by interpolation
## Results:

**Results are shown in**

Fig. 4(b) and present the numerical results in Appendix. For the method and deterministic Gumbel-Softmax, the authors draw curves according to different sparse loss weights {0.3, 0.2, 0.1, 0.05}.- Λ is quite large in most cases, which means the effect of the interpolation module is limited
- This phenomenon is consistent with the experimental results of the “w/o Interp” entry in Table 4, that the results without interpolation are almost identical to the full model, further indicating that interpolation is not important for image classification.
- The reason is still unclear but the authors suspect that this phenomenon may be related to the receptive field of operators
## Conclusion:

A method for reducing computation in convolutional networks was proposed that exploits the intrinsic sparsity and spatial redundancy in feature maps.- The authors present a stochastic sampling and interpolation scheme to avoid expensive computation at spatial locations that can be effectively interpolated.
- To overcome the challenge of training binary decision variables for representing discrete sampling locations, Gumbel-Softmax is introduced to the sampling module.
- The effectiveness of this approach is verified on a variety of computer vision tasks

- Table1: Comparison of different interpolation kernels on COCO2017 validation
- Table2: Validation of the interpolation module on COCO2017 validation
- Table3: Comparison of different grid prior settings on COCO2017 validation s = 9 s = 11 s = 13 w/o Grid Prior
- Table4: Performance comparison on the ImageNet validation set. All the methods are based on ResNet-34. Our models are trained with a loss weight of 0.01 and 0.015 to achieve accuracy or FLOPs similar to other methods for fair comparison. “w/o Interp” indicates removing the interpolation module and filling the features of unsampled positions with 0
- Table5: Comparison of theoretical and realistic speedups on E5-2650 and I7-6650U. Baseline model is trained and evaluated on images with a shorter side of 1000 pixels. The CPU run-time is calculated on the COCO2017 validation set
- Table6: The numerical results of Fig. 4 (a) in the main paper. Experiments are conducted on object detection (COCO2017 validation)
- Table7: Numerical results of Fig. 4 (b) in the main paper. Experiments are conducted on semantic segmentation (Cityscapes validation)
- Table8: Evaluation of inference stability on object detection (COCO2017 validation)
- Table9: Numerical results of Fig. 6. ResNeXt is chosen as the backbone model. Experiments are conducted on object detection (COCO2017 validation)

Related work

- In this section, we briefly review related approaches for reducing computation in convolutional neural networks. Model pruning A widely investigated approach for improving network efficiency is to remove connections or filters that are unimportant for achieving high performance. The importance of these network elements has been approximated in various ways, including by connection weight magnitudes [11,10], filter norms [20,34,13], and filter redundancy within a layer [14]. To reflect network sensitivity to these elements, importance has also been measured based on their effects on the loss [19,27] or the reconstruction error of the final response layer [38] when removing them. Alternatively, sparsity learning techniques identify what to prune in conjunction with network training, through constraints that zero out some filters [34], cause some filters to become redundant and removable [6], scale some filter or block outputs to zero [22], or sparsify batch normalization scaling factors [25,36]. Model pruning techniques as well as other architecture-based acceleration schemes, such as low-rank factorizations of convolutional filters [17] and knowledge distillation of networks [16], are orthogonal to our approach and could potentially be employed in a complementary manner. Early stopping Rather than prune network elements, early stopping techniques reduce computation by skipping the processing at later stages whenever it is deemed to be unnecessary. In [8], an adaptive number of ResNet layers are skipped within a residual block for unimportant regions in object classification. The skipping mechanism is controlled by halting scores predicted at branches to the output of each residual unit. In [21], a deep model for semantic segmentation is turned into a cascade of sub-models where earlier sub-models handle easy regions and harder cases are progressively fed forward to the next sub-model for further processing. Like our method, these techniques spatially adapt the processing to the input content. However, they process all spatial positions at least to some degree, which limits the achievable computational savings. Activation sparsity The activations of rectified linear units (ReLUs) are commonly sparse. This property has been exploited for network acceleration by excluding the zero values from subsequent convolutions [31,28]. This approach has been extended by estimating the activation sparsity and skipping the computation for predicted insignificant activations. The sparsity has been predicted from prior knowledge of road and sidewalk locations in autonomous driving applications [30], from model-predicted foreground masks at low resolution [30], from a small auxiliary layer that supplements each convolutional layer [7], and from a highly quantized version of the convolutional layer [1]. Our work instead reconstructs activation maps by interpolation from a sparse set of samples selected in a content-aware fashion, thus avoiding computation at locations where features can be easily reconstructed. Moreover, our probabilistic sampling distributes computation among feature map locations with varying levels of predicted activation, providing greater robustness to activation prediction errors. Sparse sampling To reduce processing cost, PerforatedCNNs compute only sparse samples of a convolutional layer’s outputs and interpolate the remaining values [9]. The sampling follows a predefined pattern, and the interpolation is done by nearest neighbors. Our method also takes a sparse sampling and interpolation approach, but in contrast to the input-independent sampling and generic interpolation of PerforatedCNNs, the sampling in our network is adaptively determined from the input such that the sampling density reflects predicted activation values, and the interpolation parameters are learned. As shown later in the experiments, this approach allows for much greater sparsity in the sampling. Gumbel-based selection Random selection based on the Gumbel distribution has been used in making discrete decisions for network acceleration. The Gumbel-Softmax trick was utilized in adaptively choosing network layers to apply on an input image [33] and in selecting channels or layers to skip [15]. In contrast to these techniques which determine computation based on image-level semantics for image classification, our sampling is driven by the spatial organization of features and is geared towards accurately reconstructing positional content. As a result, our method is well-suited to spatial understanding tasks such as object detection and semantic segmentation.

Funding

- In our experiments, the sparsity of M can be greater than 70% on average

Reference

- Cao, S., Ma, L., Xiao, W., Zhang, C., Liu, Y., Zhang, L., Nie, L., Yang, Z.: Seernet: Predicting convolutional neural network feature-map sparsity through low-bit quantization. In: CVPR (2019)
- Chen, K., Pang, J., Wang, J., Xiong, Y., Li, X., Sun, S., Feng, W., Liu, Z., Shi, J., Ouyang, W., Loy, C.C., Lin, D.: mmdetection. https://github.com/open-mmlab/mmdetection (2018)
- Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR (2016)
- Dai, J., Qi, H., Xiong, Y., Li, Y., Zhang, G., Hu, H., Wei, Y.: Deformable convolutional networks. In: CVPR (2017)
- Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: CVPR (2009)
- Ding, X., Ding, G., Guo, Y., Han, J.: Centripetal sgd for pruning very deep convolutional networks with complicated structure. In: CVPR (2019)
- Dong, X., Huang, J., Yang, Y., Yan, S.: More is less: A more complicated network with less inference complexity. In: CVPR (2017)
- Figurnov, M., Collins, M.D., Zhu, Y., Zhang, L., Huang, J., Vetrov, D., Salakhutdinov, R.: Spatially adaptive computation time for residual networks. In: CVPR (2017)
- Figurnov, M., Ibraimova, A., Vetrov, D., Kohli, P.: Perforatedcnns: Acceleration through elimination of redundant convolutions. In: NIPS (2016)
- Guo, Y., Yao, A., Chen, Y.: Dynamic network surgery for efficient dnns. In: NIPS (2016)
- Han, S., Pool, J., Tran, J., Dally, W.J.: Learning both weights and connections for efficient neural network. In: NIPS (2015)
- He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)
- He, Y., Kang, G., Dong, X., Fu, Y., Yang, Y.: Soft filter pruning for accelerating deep convolutional neural networks. arXiv preprint arXiv:1808.06866 (2018)
- He, Y., Liu, P., Wang, Z., Hu, Z., Yang, Y.: Filter pruning via geometric median for deep convolutional neural networks acceleration. In: CVPR (2019)
- Herrmann, C., Bowen, R.S., Zabih, R.: An end-to-end approach for speeding up neural network inference. arXiv preprint arXiv:1812.04180v3 (2019)
- Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 (2015)
- Jaderberg, M., Vedaldi, A., Zisserman, A.: Speeding up convolutional neural networks with low rank expansions. In: BMVC (2014)
- Jang, E., Gu, S., Poole, B.: Categorical reparameterization with gumbel-softmax. In: ICLR (2017)
- LeCun, Y., Denker, J.S., Sol1a, S.A.: Optimal brain damage. In: NIPS (1989)
- Li, H., Kadav, A., Durdanovic, I., Samet, H., Graf, H.P.: Pruning filters for efficient convnets. In: ICLR (2017)
- Li, X., Liu, Z., Luo, P., Loy, C.C., Tang, X.: Not all pixels are equal: Difficultyaware semantic segmentation via deep layer cascade. In: CVPR (2017)
- Lin, S., Ji, R., Yan, C., Zhang, B., Cao, L., Ye, Q., Huang, F., Doermann, D.: Towards optimal structured cnn pruning via generative adversarial learning. In: CVPR (2019)
- Lin, T.Y., Dollar, P., Girshick, R., He, K., Hariharan, B., Belongie, S.: Feature pyramid networks for object detection. In: CVPR (2017)
- Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollar, P., Zitnick, C.L.: Microsoft coco: Common objects in context. In: ECCV (2014)
- Liu, Z., Li, J., Shen, Z., Huang, G., Yan, S., Zhang, C.: Learning efficient convolutional networks through network slimming. In: ICCV (2017)
- Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR (2015)
- Molchanov, P., Mallya, A., Tyree, S., Frosio, I., Kautz, J.: Importance estimation for neural network pruning. In: CVPR (2019)
- Parashar, A., Rhu, M., Mukkara, A., Puglielli, A., Venkatesan, R., Khailany, B., Emer, J., Keckler, S.W., Dally, W.J.: Scnn: An accelerator for compressed-sparse convolutional neural networks. In: International Symposium on Computer Architecture (2017)
- Peng, C., Xiao, T., Li, Z., Jiang, Y., Zhang, X., Jia, K., Yu, G., Sun, J.: Megdet: A large mini-batch object detector. In: CVPR (2018)
- Ren, M., Pokrovsky, A., Yang, B., Urtasun, R.: Sbnet: Sparse blocks network for fast inference. In: CVPR (2018)
- Shi, S., Chu, X.: Speeding up convolutional neural networks by exploiting the sparsity of rectifier units. arXiv preprint arXiv:1704.07724 (2017)
- Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A.: Going deeper with convolutions. In: CVPR (2015)
- Veit, A., Belongie, S.: Convolutional networks with adaptive inference graphs. In: ECCV (2017)
- Wen, W., Wu, C., Wang, Y., Chen, Y., Li, H.: Learning structured sparsity in deep neural networks. In: NIPS (2016)
- Xie, S., Girshick, R., Dollar, P., Tu, Z., He, K.: Aggregated residual transformations for deep neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 1492–1500 (2017)
- Ye, J., Lu, X., Lin, Z., Wang, J.Z.: Rethinking the smaller-norm-less-informative assumption in channel pruning of convolution layers. In: ICLR (2018)
- You, A., Li, X., Zhu, Z., Tong, Y.: Torchcv: A pytorch-based framework for deep learning in computer vision. https://github.com/donnyyou/torchcv (2019)
- Yu, R., Li, A., Chen, C., Lai, J., Morariu, V.I., Han, X., Gao, M., Lin, C., Davis, L.S.: NISP: Pruning networks using neuron importance score propagation. In: CVPR (2018)
- Zhu, X., Hu, H., Lin, S., Dai, J.: Deformable convnets v2: More deformable, better results. In: CVPR (2019)

Tags

Comments