# Deformable Convolutional Networks

ICCV, 2017.

EI

Weibo:

Abstract:

Convolutional neural networks (CNNs) are inherently limited to model geometric transformations due to the fixed geometric structures in its building modules. In this work, we introduce two new modules to enhance the transformation modeling capacity of CNNs, namely, deformable convolution and deformable RoI pooling. Both are based on the...More

Code:

Data:

Summary

- A key challenge in visual recognition is how to accommodate geometric variations or model geometric transformations in object scale, pose, viewpoint, and part deformation.
- The offsets are learned from the preceding feature maps, via additional convolutional layers.
- Both the convolutional kernels for generating the output features and the offsets are learned simultaneously.
- Both deformable convolution and RoI pooling modules have the same input and output as their plain versions.
- This work is built on the idea of augmenting the spatial sampling locations in convolution and RoI pooling with additional offsets and learning the offsets from target tasks.
- The receptive field and the sampling locations in the standard convolution are fixed all over the top feature map.
- The offset learning in deformable convolution can be considered as an extremely light-weight spatial transformer in STN [23].
- The offsets in deformable convolution are dynamic model outputs that vary per image location.
- They model the dense spatial transformations in the images and are effective fordense prediction tasks such as object detection and semantic segmentation.
- Deformable convolution is capable of learning receptive fields adaptively, as shown in Figure 5, 6 and Table 2.
- Deformable Part Models (DPM) [10] Deformable RoI pooling is similar to DPM because both methods learn the spatial deformation of object parts to maximize the classification score.
- Some works learn invariant CNN representations with respect to different types of transformations such as [45], scattering networks [2], convolutional jungles [28], and TI-pooling [29].
- CNN and R-FCN, 256 and 128 RoIs are sampled for the region proposal and the object detection networks, respectively.
- Faster R-CNN and R-FCN, without feature sharing between the region proposal and the object detection networks.
- Accuracy steadily improves when more deformable convolution layers are used, especially for DeepLab and class-aware RPN.
- We empirically observed that the learned offsets in the deformable convolution layers are highly adaptive to the image content, as illustrated in Figure 5 and Figure 6.
- It shows that: 1) accuracy increases for all tasks when using larger dilation values, indicating that the default networks have too small receptive fields; 2) the optimal dilation values vary for different tasks, e.g., 6 for DeepLab but 4 for Faster R-CNN; 3) deformable convolution has the best accuracy.
- The deformable versions of class-aware RPN, Faster R-CNN and R-FCN achieve mAP@[0.5:0.95] scores of 25.8%, 33.1%, and 34.5% respectively, which are 11%, 13%, and 12% relatively higher than their plain-ConvNets counterparts respectively.
- By further testing on multiple image scales and performing iterative bounding box average [12], the mAP@[0.5:0.95] scores are increased to 37.5% for the deformable version of R-FCN.

Funding

- Introduces two new modules to enhance the transformation modeling capability of CNNs, namely, deformable convolution and deformable RoI pooling
- Shows that learning dense spatial transformation in deep CNNs is effective for sophisticated vision tasks such as object detection and semantic segmentation
- Introduces two new modules that greatly enhance CNNs’ capability of modeling geometric transformations

Full Text

Tags

Comments