OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks

international conference on learning representations, 2014.

Cited by: 3727|Bibtex|Views471
EI
Other Links: dblp.uni-trier.de|academic.microsoft.com|arxiv.org
Weibo:
We have presented a multi-scale, sliding window approach that can be used for classification, localization and detection

Abstract:

We present an integrated framework for using Convolutional Networks for classification, localization and detection. We show how a multiscale and sliding window approach can be efficiently implemented within a ConvNet. We also introduce a novel deep learning approach to localization by learning to predict object boundaries. Bounding boxe...More

Code:

Data:

0
Introduction
  • Recognizing the category of the dominant object in an image is a tasks to which Convolutional Networks (ConvNets) [17] have been applied for many years, whether the objects were handwritten characters [16], house numbers [24], textureless toys [18], traffic signs [3, 26], objects from the Caltech-101 dataset [14], or objects from the 1000-category ImageNet dataset [15].
  • The main point of this paper is to show that training a convolutional network to simultaneously classify, locate and detect objects in images can boost the classification accuracy and the detection and localization accuracy of all tasks.
  • Not training on background lets the network focus solely on positive classes for higher accuracy
Highlights
  • Recognizing the category of the dominant object in an image is a tasks to which Convolutional Networks (ConvNets) [17] have been applied for many years, whether the objects were handwritten characters [16], house numbers [24], textureless toys [18], traffic signs [3, 26], objects from the Caltech-101 dataset [14], or objects from the 1000-category ImageNet dataset [15]
  • In Fig. 11, we report the results of the ILSVRC 2013 competition where our detection system ranked 3rd with 19.4% mean average precision
  • We have presented a multi-scale, sliding window approach that can be used for classification, localization and detection
  • A second important contribution of our paper is explaining how ConvNets can be effectively used for detection and localization tasks
  • We have proposed an integrated pipeline that can perform different tasks while sharing a common feature extraction base, entirely learned directly from the pixels
Methods
  • Experiments are conducted on the ImageNet

    ILSVRC 2012 and 2013 datasets and establish state of the art results on the ILSVRC 2013 localization and detection tasks.

    While images from the ImageNet classification dataset are largely chosen to contain a roughlycentered object that fills much of the image, objects of interest sometimes vary significantly in size and position within the image.
  • Many viewing windows may contain a perfectly identifiable portion of the object, but not the entire object, nor even the center of the object
  • This leads to decent classification but poor localization and detection.
  • The authors apply our network to the Imagenet 2012 validation set using the localization criterion specified for the competition.
  • The results for this are shown in Fig. 9.
  • Adding a third and fourth scale further improves performance to 30.0% error
Results
  • In Table 2, the authors experiment with different approaches, and compare them to the single network model of Krizhevsky et al [15] for reference.
  • The authors report the test set results of the 2013 competition in Fig. 4 where the model (OverFeat) obtained 14.2% accuracy by voting of 7 ConvNets and ranked 5th out of 18 teams.
  • In post-competition work, the authors improve the OverFeat results down to 13.6% error by using bigger models.
  • These bigger models are not fully trained, more improvements are expected to appear in time
Conclusion
  • The authors have presented a multi-scale, sliding window approach that can be used for classification, localization and detection.
  • The authors applied it to the ILSVRC 2013 datasets, and it currently ranks 4th in classification, 1st in localization and 1st in detection.
  • A second important contribution of the paper is explaining how ConvNets can be effectively used for detection and localization tasks.
  • The authors have proposed an integrated pipeline that can perform different tasks while sharing a common feature extraction base, entirely learned directly from the pixels
Summary
  • Introduction:

    Recognizing the category of the dominant object in an image is a tasks to which Convolutional Networks (ConvNets) [17] have been applied for many years, whether the objects were handwritten characters [16], house numbers [24], textureless toys [18], traffic signs [3, 26], objects from the Caltech-101 dataset [14], or objects from the 1000-category ImageNet dataset [15].
  • The main point of this paper is to show that training a convolutional network to simultaneously classify, locate and detect objects in images can boost the classification accuracy and the detection and localization accuracy of all tasks.
  • Not training on background lets the network focus solely on positive classes for higher accuracy
  • Methods:

    Experiments are conducted on the ImageNet

    ILSVRC 2012 and 2013 datasets and establish state of the art results on the ILSVRC 2013 localization and detection tasks.

    While images from the ImageNet classification dataset are largely chosen to contain a roughlycentered object that fills much of the image, objects of interest sometimes vary significantly in size and position within the image.
  • Many viewing windows may contain a perfectly identifiable portion of the object, but not the entire object, nor even the center of the object
  • This leads to decent classification but poor localization and detection.
  • The authors apply our network to the Imagenet 2012 validation set using the localization criterion specified for the competition.
  • The results for this are shown in Fig. 9.
  • Adding a third and fourth scale further improves performance to 30.0% error
  • Results:

    In Table 2, the authors experiment with different approaches, and compare them to the single network model of Krizhevsky et al [15] for reference.
  • The authors report the test set results of the 2013 competition in Fig. 4 where the model (OverFeat) obtained 14.2% accuracy by voting of 7 ConvNets and ranked 5th out of 18 teams.
  • In post-competition work, the authors improve the OverFeat results down to 13.6% error by using bigger models.
  • These bigger models are not fully trained, more improvements are expected to appear in time
  • Conclusion:

    The authors have presented a multi-scale, sliding window approach that can be used for classification, localization and detection.
  • The authors applied it to the ILSVRC 2013 datasets, and it currently ranks 4th in classification, 1st in localization and 1st in detection.
  • A second important contribution of the paper is explaining how ConvNets can be effectively used for detection and localization tasks.
  • The authors have proposed an integrated pipeline that can perform different tasks while sharing a common feature extraction base, entirely learned directly from the pixels
Tables
  • Table1: Architecture specifics for fast model. The spatial size of the feature maps depends on the input image size, which varies during our inference step (see Table 5 in the Appendix). Here we show training spatial sizes. Layer 5 is the top convolutional layer. Subsequent layers are fully connected, and applied in sliding window fashion at test time. The fully-connected layers can also be seen as 1x1 convolutions in a spatial setting. Similar sizes for accurate model can be found in the Appendix
  • Table2: Classification experiments on validation set. Fine/coarse stride refers to the number of ∆ values used when applying the classifier. Fine: ∆ = 0, 1, 2; coarse: ∆ = 0
  • Table3: Architecture specifics for accurate model. It differs from the fast model mainly in the stride of the first convolution, the number of stages and the number of feature maps
  • Table4: Number of parameters and connections for different models
  • Table5: Spatial dimensions of our multi-scale approach. 6 different sizes of input images are used, resulting in layer 5 unpooled feature maps of differing spatial resolution (although not indicated in the table, all have 256 feature channels). The (3x3) results from our dense pooling operation with (∆x, ∆y) = {0, 1, 2}. See text and Fig. 3 for details for how these are converted into output maps
Download tables as Excel
Funding
  • Presents an integrated framework for using Convolutional Networks for classification, localization and detection
  • Shows how a multiscale and sliding window approach can be efficiently implemented within a ConvNet
  • Introduces a novel deep learning approach to localization by learning to predict object boundaries
  • Shows that different tasks can be learned simultaneously using a single shared network
  • The main point of this paper is to show that training a convolutional network to simultaneously classify, locate and detect objects in images can boost the classification accuracy and the detection and localization accuracy of all tasks
Reference
  • J. Carreira, F. Li, and C. Sminchisescu. Object recognition by sequential figure-ground ranking. International journal of computer vision, 98(3):243–262, 2012.
    Google ScholarLocate open access versionFindings
  • J. Carreira and C. Sminchisescu. Constrained parametric min-cuts for automatic object segmentation, release 1. http://sminchisescu.ins.uni-bonn.de/code/cpmc/.
    Findings
  • D. C. Ciresan, J. Meier, and J. Schmidhuber. Multi-column deep neural networks for image classification. In CVPR, 2012.
    Google ScholarLocate open access versionFindings
  • M. Delakis and C. Garcia. Text detection with convolutional neural networks. In International Conference on Computer Vision Theory and Applications (VISAPP 2008), 2008.
    Google ScholarLocate open access versionFindings
  • J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A Large-Scale Hierarchical Image Database. In CVPR09, 2009.
    Google ScholarLocate open access versionFindings
  • I. Endres and D. Hoiem. Category independent object proposals. In Computer Vision–ECCV 2010, pages 575–588.
    Google ScholarLocate open access versionFindings
  • C. Farabet, C. Couprie, L. Najman, and Y. LeCun. Learning hierarchical features for scene labeling. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2013. in press.
    Google ScholarLocate open access versionFindings
  • C. Garcia and M. Delakis. Convolutional face finder: A neural architecture for fast and robust face detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2004.
    Google ScholarLocate open access versionFindings
  • A. Giusti, D. C. Ciresan, J. Masci, L. M. Gambardella, and J. Schmidhuber. Fast image scanning with deep max-pooling convolutional neural networks. In International Conference on Image Processing (ICIP), 2013.
    Google ScholarLocate open access versionFindings
  • R. Hadsell, P. Sermanet, M. Scoffier, A. Erkan, K. Kavackuoglu, U. Muller, and Y. LeCun. Learning long-range vision for autonomous off-road driving. Journal of Field Robotics, 26(2):120–144, February 2009.
    Google ScholarLocate open access versionFindings
  • G. Hinton, N. Srivastave, A. Krizhevsky, I. Sutskever, and R. R. Salakhutdinov. Improving neural networks by preventing co-adaptation of feature detectors. arXiv:1207.0580, 2012.
    Findings
  • G. E. Hinton, A. Krizhevsky, and S. D. Wang. Transforming auto-encoders. In Artificial Neural Networks and Machine Learning–ICANN 2011, pages 44–51. Springer Berlin Heidelberg, 2011.
    Google ScholarLocate open access versionFindings
  • V. Jain, J. F. Murray, F. Roth, S. Turaga, V. Zhigulin, K. Briggman, M. Helmstaedter, W. Denk, and H. S. Seung. Supervised learning of image restoration with convolutional networks. In ICCV’07.
    Google ScholarFindings
  • K. Jarrett, K. Kavukcuoglu, M. Ranzato, and Y. LeCun. What is the best multi-stage architecture for object recognition? In Proc. International Conference on Computer Vision (ICCV’09). IEEE, 2009.
    Google ScholarLocate open access versionFindings
  • A. Krizhevsky, I. Sutskever, and G. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, 2012.
    Google ScholarLocate open access versionFindings
  • Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel. Handwritten digit recognition with a back-propagation network. In D. Touretzky, editor, Advances in Neural Information Processing Systems (NIPS 1989), volume 2, Denver, CO, 1990. Morgan Kaufman.
    Google ScholarLocate open access versionFindings
  • Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, November 1998.
    Google ScholarLocate open access versionFindings
  • Y. LeCun, F.-J. Huang, and L. Bottou. Learning methods for generic object recognition with invariance to pose and lighting. In Proceedings of CVPR’04. IEEE Press, 2004.
    Google ScholarLocate open access versionFindings
  • S. Manen, M. Guillaumin, and L. Van Gool. Prime object proposals with randomized prims algorithm. In International Conference on Computer Vision (ICCV), 2013.
    Google ScholarLocate open access versionFindings
  • O. Matan, J. Bromley, C. Burges, J. Denker, L. Jackel, Y. LeCun, E. Pednault, W. Satterfield, C. Stenard, and T. Thompson. Reading handwritten digits: A zip code recognition system. IEEE Computer, 25(7):59– 63, July 1992.
    Google ScholarLocate open access versionFindings
  • F. Ning, D. Delhomme, Y. LeCun, F. Piano, L. Bottou, and P. Barbano. Toward automatic phenotyping of developing embryos from videos. IEEE Transactions on Image Processing, 14(9):1360–1371, September 2005. Special issue on Molecular and Cellular Bioimaging.
    Google ScholarLocate open access versionFindings
  • S. Nowlan and J. Platt. A convolutional neural network hand tracker. pages 901–908, San Mateo, CA, 1995. Morgan Kaufmann.
    Google ScholarFindings
  • M. Osadchy, Y. LeCun, and M. Miller. Synergistic face detection and pose estimation with energy-based models. Journal of Machine Learning Research, 8:1197–1215, May 2007.
    Google ScholarLocate open access versionFindings
  • P. Sermanet, S. Chintala, and Y. LeCun. Convolutional neural networks applied to house numbers digit classification. In International Conference on Pattern Recognition (ICPR 2012), 2012.
    Google ScholarLocate open access versionFindings
  • P. Sermanet, K. Kavukcuoglu, S. Chintala, and Y. LeCun. Pedestrian detection with unsupervised multistage feature learning. In Proc. International Conference on Computer Vision and Pattern Recognition (CVPR’13). IEEE, June 2013.
    Google ScholarLocate open access versionFindings
  • P. Sermanet and Y. LeCun. Traffic sign recognition with multi-scale convolutional networks. In Proceedings of International Joint Conference on Neural Networks (IJCNN’11), 2011.
    Google ScholarLocate open access versionFindings
  • G. Taylor, R. Fergus, G. Williams, I. Spiro, and C. Bregler. Pose-sensitive embedding by nonlinear nca regression. In NIPS, 2011.
    Google ScholarLocate open access versionFindings
  • G. Taylor, I. Spiro, C. Bregler, and R. Fergus. Learning invarance through imitation. In CVPR, 2011.
    Google ScholarLocate open access versionFindings
  • J. R. R. Uijlings, K. E. A. van de Sande, T. Gevers, and A. W. M. Smeulders. Selective search for object recognition. International Journal of Computer Vision, 104(2):154–171, 2013.
    Google ScholarLocate open access versionFindings
  • R. Vaillant, C. Monrocq, and Y. LeCun. Original approach for the localisation of objects in images. IEE Proc on Vision, Image, and Signal Processing, 141(4):245–250, August 1994.
    Google ScholarLocate open access versionFindings
Full Text
Your rating :
0

 

Tags
Comments