Do Convnets Learn Correspondence?
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 27 (NIPS 2014), pp. 1601-1609, 2014.
Despite their large receptive fields and weak label training, we have found in all cases that convnet features are at least as useful than conventional ones for extracting local visual information
Convolutional neural nets (convnets) trained from massive labeled datasets  have substantially improved the state-of-the-art in image classification  and object detection . However, visual understanding requires establishing correspondence on a finer level than object category. Given their large pooling regions and training from ...More
PPT (Upload PPT)
- Recent advances in convolutional neural nets  dramatically improved the state-of-the-art in image classification.
- The feature representations learned from large data sets have been found to generalize well to other image classification tasks  and even to object detection [3, 21].
- We approach this difficult task in the style of SIFT flow : we retrieve near neighbors using a coarse similarity measure, and compute dense correspondences on which we impose an MRF smoothness prior which allows all images to be warped into alignment.
- Since we are testing the quality of alignment, we use the same nearest neighbors for convnet or conventional features, and we compute both types of features at the same locations, the grid of convnet rf centers in the response to a single image.
- Based on its performance we use conv4 as our convnet feature, and SIFT with descriptor radius 20 as our conventional feature.
- Figure 3 gives examples of alignment quality for a few different seed images, using both SIFT and convnet features.
- Convnet learned features are at least as capable as SIFT at alignment, and better than might have been expected given the size of their receptive fields.
- While the SIFT classifiers do not seem to be sensitive to the precise locations of the keypoints, in many cases the convnet ones seem to be capable of localization finer than their strides, not just their receptive field sizes.
- We have seen that despite their large receptive field sizes, convnets work as well as the handengineered feature SIFT for alignment and slightly better than SIFT for keypoint classification.
- The mean of each Gaussian is taken to be the location of the keypoint in the nearest neighbor in the training set found using cosine similarity on pool5 features, and we use a fixed standard deviation of 22 pixels.
- We can see that local part detectors trained on the conv5 feature outperform SIFT by a large margin and that the prior information is helpful in both cases.
- Each set consists of rescaled bounding box images with ground truth keypoint annotations and predicted keypoints using SIFT and conv5 features, where each color corresponds to one keypoint.
- Alignment, and keypoint prediction, we have studied the ability of the intermediate features implicitly learned in a state-of-the-art convnet classifier to understand specific, local correspondence.
- Despite their large receptive fields and weak label training, we have found in all cases that convnet features are at least as useful than conventional ones for extracting local visual information.