Scalable Object Detection Using Deep Neural Networks

Computer Vision and Pattern Recognition, Volume abs/1312.2249, 2014, Pages 2155-2162.

Cited by: 848|Bibtex|Views223|DOI:https://doi.org/10.1109/CVPR.2014.276
EI WOS
Other Links: dl.acm.org|dblp.uni-trier.de|academic.microsoft.com|arxiv.org
Weibo:
In Fig. 1 plot we show results obtained by training on VOC2012

Abstract:

Deep convolutional neural networks have recently achieved state-of-the-art performance on a number of image recognition benchmarks, including the ImageNet Large-Scale Visual Recognition Challenge (ILSVRC-2012). The winning model on the localization sub-task was a network that predicts a single bounding box and a confidence score for each ...More

Code:

Data:

0
Introduction
  • A common paradigm to address this problem is to train object detectors which operate on a sub-image and apply these detectors in an exhaustive manner across all locations and scales.
  • This paradigm was successfully used within a discriminatively trained Deformable Part Model (DPM) to achieve state-of-art results on detection tasks [6].
  • This is quite different from traditional approaches, which score features within predefined boxes, and has the advantage of expressing detection of objects in a very compact and efficient way
Highlights
  • Object detection is one of the fundamental tasks in computer vision
  • In Fig. 1 plot we show results obtained by training on VOC2012
  • Inference is done as with the Visual Object Classes setup: the number of predicted locations is K = 100, which are reduced by Non-Max-Suppression (Jaccard overlap criterion of 0.4) and which are post-scored by the classifier: the score is the product of the localizer confidence for the given box multiplied by the score of the classifier evaluated on the minimum square region around the crop
  • We propose a novel method for localizing objects in an image, which predicts multiple bounding boxes at a time
  • We present results on two challenging benchmarks, VOC2007 and ILSVRC-2012, on which the proposed method is competitive
  • Our results show that the DeepMultiBox approach is scalable and can even generalize
Methods
  • The authors trained the classifier on a data set comprising of

    10 million crops overlapping some object with at least 0.5 Jaccard overlap similarity.
  • In addition to a localization model that is identical to the VOC model, the authors train a model on the ImageNet Classification challenge data, which will serve as the recognition model
  • This model is trained in a procedure that is substantially similar to that of [11] and is able to achieve the same results on the classification challenge validation set; note that the authors only train a single model, instead of 7 – the latter brings substantial benefits in terms of classification accuracy, but is 7× more cat chair horse person precision precision precision precision recall recall recall recall potted plant sheep train tv expensive, which is not a negligible factor.
  • The final scores are sorted in descending order and only the top scoring score/location pair is kept for a given class.
Results
  • Network Architecture and Experiment Details.
  • The network architecture for the localization and classification models that the authors use is the same as the one used by [11].
  • The authors use priors in the localization loss – these are computed using k-means on the training set.
  • The localizer might output coordinates outside the crop area used for the inference.
  • The authors' second model classifies each bounding box as objects of interest or “background”
Conclusion
  • The authors analyze the performance of the localizer in isolation. The authors present the number of detected objects, as defined by the Pascal detection criterion, against the number GHWHFWLRQUDWH

    QXPEHURIZLQGRZV of produced bounding boxes.
  • As the authors can see, when using a budget of 10 bounding boxes the authors can localize 45.3% of the objects with the first model, and 48% with the second model
  • This shows better performance than other reported results, such as the objectness algorithm achieving 42% [1].
  • The method uses a deep convolutional neural network as a base feature extraction and learning model
  • It formulates a multiple box localization cost that is able to take advantage of variable number of groundtruth locations of interest in a given image and learn to predict such locations in unseen images.
  • The authors' results show that the DeepMultiBox approach is scalable and can even generalize
Summary
  • Introduction:

    A common paradigm to address this problem is to train object detectors which operate on a sub-image and apply these detectors in an exhaustive manner across all locations and scales.
  • This paradigm was successfully used within a discriminatively trained Deformable Part Model (DPM) to achieve state-of-art results on detection tasks [6].
  • This is quite different from traditional approaches, which score features within predefined boxes, and has the advantage of expressing detection of objects in a very compact and efficient way
  • Methods:

    The authors trained the classifier on a data set comprising of

    10 million crops overlapping some object with at least 0.5 Jaccard overlap similarity.
  • In addition to a localization model that is identical to the VOC model, the authors train a model on the ImageNet Classification challenge data, which will serve as the recognition model
  • This model is trained in a procedure that is substantially similar to that of [11] and is able to achieve the same results on the classification challenge validation set; note that the authors only train a single model, instead of 7 – the latter brings substantial benefits in terms of classification accuracy, but is 7× more cat chair horse person precision precision precision precision recall recall recall recall potted plant sheep train tv expensive, which is not a negligible factor.
  • The final scores are sorted in descending order and only the top scoring score/location pair is kept for a given class.
  • Results:

    Network Architecture and Experiment Details.
  • The network architecture for the localization and classification models that the authors use is the same as the one used by [11].
  • The authors use priors in the localization loss – these are computed using k-means on the training set.
  • The localizer might output coordinates outside the crop area used for the inference.
  • The authors' second model classifies each bounding box as objects of interest or “background”
  • Conclusion:

    The authors analyze the performance of the localizer in isolation. The authors present the number of detected objects, as defined by the Pascal detection criterion, against the number GHWHFWLRQUDWH

    QXPEHURIZLQGRZV of produced bounding boxes.
  • As the authors can see, when using a budget of 10 bounding boxes the authors can localize 45.3% of the objects with the first model, and 48% with the second model
  • This shows better performance than other reported results, such as the objectness algorithm achieving 42% [1].
  • The method uses a deep convolutional neural network as a base feature extraction and learning model
  • It formulates a multiple box localization cost that is able to take advantage of variable number of groundtruth locations of interest in a given image and learn to predict such locations in unseen images.
  • The authors' results show that the DeepMultiBox approach is scalable and can even generalize
Tables
  • Table1: Average Precision on VOC 2007 test of our method, called DeepMultiBox, and other competitive methods. DeepMultibox was trained on VOC2012 training data, while the rest of the models were trained on VOC2007 data
  • Table2: Performance of Multibox (the proposed method) vs. classifying ground-truth boxes directly and predicting one box per class Method det@5 class@5
Download tables as Excel
Reference
  • B. Alexe, T. Deselaers, and V. Ferrari. What is an object? In CVPR. IEEE, 2010.
    Google ScholarFindings
  • J. Carreira and C. Sminchisescu. Constrained parametric min-cuts for automatic object segmentation. In CVPR, 2010.
    Google ScholarLocate open access versionFindings
  • T. Dean, M. A. Ruzon, M. Segal, J. Shlens, S. Vijayanarasimhan, and J. Yagnik. Fast, accurate detection of 100,000 object classes on a single machine. In CVPR, 2013.
    Google ScholarLocate open access versionFindings
  • I. Endres and D. Hoiem. Category independent object proposals. In ECCV. 2010.
    Google ScholarLocate open access versionFindings
  • M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman. The pascal visual object classes (voc) challenge. International journal of computer vision, 88(2):303– 338, 2010.
    Google ScholarLocate open access versionFindings
  • P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan. Object detection with discriminatively trained partbased models. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 32(9):1627–1645, 2010.
    Google ScholarLocate open access versionFindings
  • M. A. Fischler and R. A. Elschlager. The representation and matching of pictorial structures. Computers, IEEE Transactions on, 100(1):67–92, 1973.
    Google ScholarLocate open access versionFindings
  • R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014.
    Google ScholarLocate open access versionFindings
  • R. B. Girshick, P. F. Felzenszwalb, and D. McAllester. Discriminatively trained deformable part models, release 5. http://people.cs.uchicago.edu/rbg/latent-release5/.
    Findings
  • C. Gu, J. J. Lim, P. Arbelaez, and J. Malik. Recognition using regions. In CVPR, 2009.
    Google ScholarLocate open access versionFindings
  • A. Krizhevsky, I. Sutskever, and G. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems 25, pages 1106–1114, 2012.
    Google ScholarLocate open access versionFindings
  • C. H. Lampert, M. B. Blaschko, and T. Hofmann. Beyond sliding windows: Object localization by efficient subwindow search. In CVPR, 2008.
    Google ScholarLocate open access versionFindings
  • P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. LeCun. Overfeat: Integrated recognition, localization and detection using convolutional networks. arXiv preprint arXiv:1312.6229, 2013.
    Findings
  • H. O. Song, S. Zickler, T. Althoff, R. Girshick, M. Fritz, C. Geyer, P. Felzenszwalb, and T. Darrell. Sparselet models for efficient multiclass object detection. In ECCV. 2012.
    Google ScholarLocate open access versionFindings
  • C. Szegedy, A. Toshev, and D. Erhan. Deep neural networks for object detection. In Advances in Neural Information Processing Systems (NIPS), 2013.
    Google ScholarLocate open access versionFindings
  • J. Uijlings, K. van de Sande, T. Gevers, and A. Smeulders. Selective search for object recognition. International journal of computer vision, 104(2):154–171, 2013.
    Google ScholarLocate open access versionFindings
  • K. E. van de Sande, J. R. Uijlings, T. Gevers, and A. W. Smeulders. Segmentation as selective search for object recognition. In ICCV, 2011.
    Google ScholarLocate open access versionFindings
  • L. Zhu, Y. Chen, A. Yuille, and W. Freeman. Latent hierarchical structural learning for object detection. In Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on, pages 1062–1069. IEEE, 2010.
    Google ScholarLocate open access versionFindings
Full Text
Your rating :
0

 

Tags
Comments