Region-based Convolutional Networks for Accurate Object Detection and Segmentation

Pattern Analysis and Machine Intelligence, IEEE Transactions, Volume PP, Issue 99, 2015, Pages 1

Cited by: 1253|Bibtex|Views284|DOI:https://doi.org/10.1109/TPAMI.2015.2437384
EI WOS
Other Links: academic.microsoft.com|dblp.uni-trier.de
Weibo:
This paper presents a simple and scalable object detection algorithm that gives more than a 50% relative improvement over the best previous results on PASCAL VOC 2012

Abstract:

Object detection performance, as measured on the canonical PASCAL VOC Challenge datasets, plateaued in the final years of the competition. The best-performing methods were complex ensemble systems that typically combined multiple low-level image features with high-level context. In this paper, we propose a simple and scalable detection al...More

Code:

Data:

0
Introduction
  • R ECOGNIZING objects and localizing them in images is one of the most fundamental and challenging problems in computer vision.
  • The authors describe an object detection and segmentation system that uses multi-layer convolutional networks to compute highly discriminative, yet invariant, features.
  • The authors use these features to classify image regions, which can be output as detected bounding boxes or pixel-level segmentation masks.
  • The authors' approach scales well with the number of object categories, which is a longstanding challenge for existing methods
Highlights
  • R ECOGNIZING objects and localizing them in images is one of the most fundamental and challenging problems in computer vision
  • On the PASCAL detection benchmark, our system achieves a relative improvement of more than 50% mean average precision compared to the best methods based on low-level image features
  • After fine-tuning, our system achieves a mean average precision of 63% on VOC 2010 compared to 33% for the highly-tuned, HOG-based deformable part model (DPM) [18], [23]
  • Using TorontoNet, our R-convolutional networks achieves a mean average precision of 31.4%, which is significantly ahead of the second-best result of 24.3% from OverFeat
  • Most of the competing submissions (OverFeat, NEC-MU, Toronto A, and UIUC-IFP) used convolutional networks, indicating that there is significant nuance in how convolutional networks can be applied to object detection, leading to greatly varying outcomes
  • This paper presents a simple and scalable object detection algorithm that gives more than a 50% relative improvement over the best previous results on PASCAL VOC 2012
Methods
  • The dominant approach to object detection has been based on sliding-window detectors.
  • This approach goes back to early face detectors [15], and continued with HOG-based pedestrian detection [2], and part-based generic object detection [18].
  • Multiple segmentation hypotheses were used by Hoiem et al [29] to estimate the rough geometric scene structure and by Russell et al [30] to automatically discover object classes in a set of images.
  • The authors' approach was inspired by the success of selective search
Results
  • Results on PASCAL

    VOC 2010-12

    Following the PASCAL VOC best practices [3], the authors validated all design decisions and hyperparameters on the VOC 2007 dataset (Section 4.2).
  • R-CNNs achieve similar performance (53.3% / 62.4% mAP) on VOC 2012 test.
  • The authors ran an R-CNN on the 200-class ILSVRC2013 detection dataset using the same system hyperparameters that the authors used for PASCAL VOC.
  • Table 6 shows the per-category segmentation accuracy on VOC 2011 val for each of the six segmentation methods in addition to the O2P method [59]
  • These results show which methods are strongest across each of the 20 PASCAL classes, plus the background class
Conclusion
  • Object detection performance had stagnated. The best performing systems were complex ensembles combining multiple low-level image features with high-level context from object detectors and scene classifiers.
  • This paper presents a simple and scalable object detection algorithm that gives more than a 50% relative improvement over the best previous results on PASCAL VOC 2012.
  • The authors achieved this performance through two insights.
  • The authors conjecture that the “supervised pre-training/domain-specific fine-tuning” paradigm will be highly effective for a variety of data-scarce vision problems
Summary
  • Introduction:

    R ECOGNIZING objects and localizing them in images is one of the most fundamental and challenging problems in computer vision.
  • The authors describe an object detection and segmentation system that uses multi-layer convolutional networks to compute highly discriminative, yet invariant, features.
  • The authors use these features to classify image regions, which can be output as detected bounding boxes or pixel-level segmentation masks.
  • The authors' approach scales well with the number of object categories, which is a longstanding challenge for existing methods
  • Objectives:

    Some of these hyperparameter choices are slightly suboptimal for ILSVRC, the goal of this work was to produce a preliminary R-CNN result on ILSVRC without extensive dataset tuning.
  • Methods:

    The dominant approach to object detection has been based on sliding-window detectors.
  • This approach goes back to early face detectors [15], and continued with HOG-based pedestrian detection [2], and part-based generic object detection [18].
  • Multiple segmentation hypotheses were used by Hoiem et al [29] to estimate the rough geometric scene structure and by Russell et al [30] to automatically discover object classes in a set of images.
  • The authors' approach was inspired by the success of selective search
  • Results:

    Results on PASCAL

    VOC 2010-12

    Following the PASCAL VOC best practices [3], the authors validated all design decisions and hyperparameters on the VOC 2007 dataset (Section 4.2).
  • R-CNNs achieve similar performance (53.3% / 62.4% mAP) on VOC 2012 test.
  • The authors ran an R-CNN on the 200-class ILSVRC2013 detection dataset using the same system hyperparameters that the authors used for PASCAL VOC.
  • Table 6 shows the per-category segmentation accuracy on VOC 2011 val for each of the six segmentation methods in addition to the O2P method [59]
  • These results show which methods are strongest across each of the 20 PASCAL classes, plus the background class
  • Conclusion:

    Object detection performance had stagnated. The best performing systems were complex ensembles combining multiple low-level image features with high-level context from object detectors and scene classifiers.
  • This paper presents a simple and scalable object detection algorithm that gives more than a 50% relative improvement over the best previous results on PASCAL VOC 2012.
  • The authors achieved this performance through two insights.
  • The authors conjecture that the “supervised pre-training/domain-specific fine-tuning” paradigm will be highly effective for a variety of data-scarce vision problems
Tables
  • Table1: Detection average precision (%) on VOC 2010 test. T-Net stands for TorontoNet and O-Net for OxfordNet (Section 3.1.2). R-CNNs are most directly comparable to UVA and Regionlets since all methods use selective search region proposals. Bounding-box regression (BB) is described in Section 7.3. At publication time, SegDPM was the top-performer on the PASCAL VOC leaderboard. DPM and SegDPM use context rescoring not used by the other methods. SegDPM and all R-CNNs use additional training data
  • Table2: Detection average precision (%) on VOC 2007 test. Rows 1-3 show R-CNN performance without fine-tuning. Rows 4-6 show results for the CNN pre-trained on ILSVRC 2012 and then fine-tuned (FT) on VOC 2007 trainval. Row 7 includes a simple bounding-box regression (BB) stage that reduces localization errors (Section 7.3). Rows 8-10 present DPM methods as a strong baseline. The first uses only HOG, while the next two use different feature learning approaches to augment or replace HOG. All R-CNN results use TorontoNet
  • Table3: Detection average precision (%) on VOC 2007 test for two different CNN architectures. The first two rows are results from Table 2 using Krizhevsky et al.’s TorontoNet architecture (T-Net). Rows three and four use the recently proposed 16-layer OxfordNet architecture (O-Net) from Simonyan and Zisserman [<a class="ref-link" id="c24" href="#r24">24</a>]
  • Table4: ILSVRC2013 ablation study of data usage choices, fine-tuning, and bounding-box regression. All experiments use TorontoNet
  • Table5: Segmentation mean accuracy (%) on VOC 2011 validation. Column 1 presents O2P; 2-7 use our CNN pre-trained on ILSVRC 2012
  • Table6: Per-category segmentation accuracy (%) on the VOC 2011 validation set. These experiments use TorontoNet without fine-tuning
  • Table7: Segmentation accuracy (%) on VOC 2011 test. We compare against two strong baselines: the “Regions and Parts” (R&P) method of [<a class="ref-link" id="c68" href="#r68">68</a>] and the second-order pooling (O2P) method of [<a class="ref-link" id="c59" href="#r59">59</a>]. Without any fine-tuning, our CNN achieves top segmentation performance, outperforming R&P and roughly matching O2P. These experiments use TorontoNet without fine-tuning
Download tables as Excel
Related work
  • Deep CNNs for object detection. There were several efforts [12], [13], [19] to use convolutional networks for PASCALstyle object detection concurrent with the development of R-CNNs. Szegedy et al [12] model object detection as a regression problem. Given an image window, they use a CNN to predict foreground pixels over a coarse grid for the whole object as well as the object’s top, bottom, left and right halves. A grouping process then converts the predicted masks into detected bounding boxes. Szegedy et al train their model from a random initialization on VOC 2012 trainval and get a mAP of 30.5% on VOC 2007 test. In comparison, an R-CNN using the same network architecture gets a mAP of 58.5%, but uses supervised ImageNet pretraining. One hypothesis is that [12] performs worse because it does not use ImageNet pre-training. Recent work from Agrawal et al [25] shows that this is not the case; they find that an R-CNN trained from a random initialization on VOC 2007 trainval (using the same network architecture as [12]) achieves a mAP of 40.7% on VOC 2007 test despite using half the amount of training data as [12].
Funding
  • This research was supported in part by DARPA Mind’s Eye and MSEE programs, by NSF awards IIS-0905647, IIS1134072, and IIS-1212798, MURI N000014-10-1-0933, and by support from Toyota
Reference
  • D. Lowe, “Distinctive image features from scale-invariant keypoints,” IJCV, 2004.
    Google ScholarLocate open access versionFindings
  • N. Dalal and B. Triggs, “Histograms of oriented gradients for human detection,” in CVPR, 2005.
    Google ScholarFindings
  • M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman, “The PASCAL Visual Object Classes (VOC) Challenge,” IJCV, 2010.
    Google ScholarLocate open access versionFindings
  • K. Fukushima, “Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position,” Biological cybernetics, vol. 36, no. 4, pp. 193–202, 1980.
    Google ScholarLocate open access versionFindings
  • D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning internal representations by error propagation,” Parallel Distributed Processing, vol. 1, pp. 318–362, 1986.
    Google ScholarLocate open access versionFindings
  • Y. LeCun, B. Boser, J. Denker, D. Henderson, R. Howard, W. Hubbard, and L. Jackel, “Backpropagation applied to handwritten zip code recognition,” Neural Comp., 1989.
    Google ScholarLocate open access versionFindings
  • Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” Proc. of the IEEE, 1998.
    Google ScholarLocate open access versionFindings
  • A. Krizhevsky, I. Sutskever, and G. Hinton, “ImageNet classification with deep convolutional neural networks,” in NIPS, 2012.
    Google ScholarLocate open access versionFindings
  • J. Deng, A. Berg, S. Satheesh, H. Su, A. Khosla, and L. FeiFei, “ImageNet Large Scale Visual Recognition Competition 2012 (ILSVRC2012),” http://www.image-net.org/challenges/LSVRC/2012/.
    Findings
  • J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “ImageNet: A large-scale hierarchical image database,” in CVPR, 2009.
    Google ScholarFindings
  • R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for accurate object detection and semantic segmentation,” in CVPR, 2014.
    Google ScholarFindings
  • C. Szegedy, A. Toshev, and D. Erhan, “Deep neural networks for object detection,” in NIPS, 2013.
    Google ScholarFindings
  • D. Erhan, C. Szegedy, A. Toshev, and D. Anguelov, “Scalable object detection using deep neural networks,” in CVPR, 2014.
    Google ScholarFindings
  • H. A. Rowley, S. Baluja, and T. Kanade, “Neural network-based face detection,” TPAMI, 1998.
    Google ScholarLocate open access versionFindings
  • R. Vaillant, C. Monrocq, and Y. LeCun, “Original approach for the localisation of objects in images,” IEE Proc on Vision, Image, and Signal Processing, 1994.
    Google ScholarLocate open access versionFindings
  • J. Platt and S. Nowlan, “A convolutional neural network hand tracker,” in NIPS, 1995.
    Google ScholarLocate open access versionFindings
  • P. Sermanet, K. Kavukcuoglu, S. Chintala, and Y. LeCun, “Pedestrian detection with unsupervised multi-stage feature learning,” in CVPR, 2013.
    Google ScholarFindings
  • P. Felzenszwalb, R. Girshick, D. McAllester, and D. Ramanan, “Object detection with discriminatively trained part based models,” TPAMI, 2010.
    Google ScholarLocate open access versionFindings
  • P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. LeCun, “OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks,” in ICLR, 2014.
    Google ScholarFindings
  • C. Gu, J. J. Lim, P. Arbelaez, and J. Malik, “Recognition using regions,” in CVPR, 2009.
    Google ScholarFindings
  • J. Uijlings, K. van de Sande, T. Gevers, and A. Smeulders, “Selective search for object recognition,” IJCV, 2013.
    Google ScholarLocate open access versionFindings
  • J. Carreira and C. Sminchisescu, “CPMC: Automatic object segmentation using constrained parametric min-cuts,” TPAMI, 2012.
    Google ScholarLocate open access versionFindings
  • R. Girshick, P. Felzenszwalb, and D. McAllester, “Discriminatively trained deformable part models, release 5,” http://www.cs.berkeley.edu/∼rbg/latent-v5/.
    Findings
  • K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” in ICLR, 2015.
    Google ScholarLocate open access versionFindings
  • P. Agrawal, R. Girshick, and J. Malik, “Analyzing the performance of multilayer neural networks for object recognition,” in ECCV, 2014.
    Google ScholarFindings
  • T. Dean, J. Yagnik, M. Ruzon, M. Segal, J. Shlens, and S. Vijayanarasimhan, “Fast, accurate detection of 100,000 object classes on a single machine,” in CVPR, 2013.
    Google ScholarFindings
  • K. He, X. Zhang, S. Ren, and J. Sun, “Spatial pyramid pooling in deep convolutional networks for visual recognition,” in ECCV, 2014.
    Google ScholarFindings
  • R. Girshick, “Fast R-CNN,” arXiv e-prints, vol. arXiv:1504.08083v1 [cs.CV], 2015.
    Findings
  • D. Hoiem, A. Efros, and M. Hebert, “Geometric context from a single image,” in CVPR, 2005.
    Google ScholarFindings
  • B. C. Russell, W. T. Freeman, A. A. Efros, J. Sivic, and A. Zisserman, “Using multiple segmentations to discover objects and their extent in image collections,” in CVPR, 2006.
    Google ScholarFindings
  • C. L. Zitnick and P. Dollar, “Edge boxes: Locating object proposals from edges,” in ECCV, 2014.
    Google ScholarLocate open access versionFindings
  • M.-M. Cheng, Z. Zhang, W.-Y. Lin, and P. H. S. Torr, “BING: Binarized normed gradients for objectness estimation at 300fps,” in CVPR, 2014.
    Google ScholarFindings
  • J. Hosang, R. Benenson, P. Dollar, and B. Schiele, “What makes for effective detection proposals?” arXiv e-prints, vol. arXiv:1502.05082v1 [cs.CV], 2015.
    Findings
  • A. Humayun, F. Li, and J. M. Rehg, “RIGOR: Reusing Inference in Graph Cuts for generating Object Regions,” in CVPR, 2014.
    Google ScholarFindings
  • P. Arbelaez, J. Pont-Tuset, J. Barron, F. Marques, and J. Malik, “Multiscale combinatorial grouping,” in CVPR, 2014.
    Google ScholarFindings
  • P. Krahenbuhl and V. Koltun, “Geodesic object proposals,” in ECCV, 2014.
    Google ScholarFindings
  • S. J. Pan and Q. Yang, “A survey on transfer learning,” TPAMI, 2010.
    Google ScholarLocate open access versionFindings
  • R. Caruana, “Multitask learning: A knowledge-based source of inductive bias,” in ICML, 1993.
    Google ScholarLocate open access versionFindings
  • S. Thrun, “Is learning the n-th thing any easier than learning the first?” NIPS, 1996.
    Google ScholarLocate open access versionFindings
  • Y. Bengio, A. Courville, and P. Vincent, “Representation learning: A review and new perspectives,” TPAMI, 2013.
    Google ScholarLocate open access versionFindings
  • J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, and T. Darrell, “DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition,” in ICML, 2014.
    Google ScholarFindings
  • J. Hoffman, S. Guadarrama, E. Tzeng, J. Donahue, R. Girshick, T. Darrell, and K. Saenko, “From large-scale object classifiers to large-scale object detectors: An adaptation approach,” in NIPS, 2014.
    Google ScholarFindings
  • A. Karpathy, A. Joulin, and L. Fei-Fei, “Deep fragment embeddings for bidirectional image sentence mapping,” in NIPS, 2014.
    Google ScholarLocate open access versionFindings
  • G. Gkioxari, B. Hariharan, R. Girshick, and J. Malik, “R-CNNs for pose estimation and action detection,” arXiv e-prints, vol. arXiv:1406.5212v1 [cs.CV], 2014.
    Findings
  • B. Hariharan, P. Arbelaez, R. Girshick, and J. Malik, “Simultaneous detection and segmentation,” in ECCV, 2014.
    Google ScholarLocate open access versionFindings
  • S. Gupta, R. Girshick, P. Arbelaez, and J. Malik, “Learning rich features from RGB-D images for object detection and segmentation,” in ECCV, 2014.
    Google ScholarFindings
  • H. O. Song, R. Girshick, S. Jegelka, J. Mairal, Z. Harchaoui, and T. Darrell, “On learning to localize objects with minimal supervision,” in ICML, 2014.
    Google ScholarFindings
  • O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei, “ImageNet large scale visual recognition challenge,” arXiv e-prints, vol. arXiv:1409.0575v1 [cs.CV], 2014.
    Findings
  • C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” arXiv e-prints, vol. arXiv:1409.4842v1 [cs.CV], 2014.
    Findings
  • C. Szegedy, S. Reed, D. Erhan, and D. Anguelov, “Scalable, highquality object detection,” arXiv e-prints, vol. arXiv:1412.1441v2 [cs.CV], 2015.
    Findings
  • B. Alexe, T. Deselaers, and V. Ferrari, “Measuring the objectness of image windows,” TPAMI, 2012.
    Google ScholarLocate open access versionFindings
  • I. Endres and D. Hoiem, “Category independent object proposals,” in ECCV, 2010.
    Google ScholarFindings
  • D. Ciresan, A. Giusti, L. Gambardella, and J. Schmidhuber, “Mitosis detection in breast cancer histology images with deep neural networks,” in MICCAI, 2013.
    Google ScholarFindings
  • X. Wang, M. Yang, S. Zhu, and Y. Lin, “Regionlets for generic object detection,” in ICCV, 2013.
    Google ScholarFindings
  • Y. Jia, “Caffe: An open source convolutional architecture for fast feature embedding,” http://caffe.berkeleyvision.org/, 2013.
    Findings
  • T. Dean, M. A. Ruzon, M. Segal, J. Shlens, S. Vijayanarasimhan, and J. Yagnik, “Fast, accurate detection of 100,000 object classes on a single machine,” in CVPR, 2013.
    Google ScholarFindings
  • S. Fidler, R. Mottaghi, A. Yuille, and R. Urtasun, “Bottom-up segmentation for top-down detection,” in CVPR, 2013.
    Google ScholarFindings
  • K. Sung and T. Poggio, “Example-based learning for view-based human face detection,” Massachussets Institute of Technology, Tech. Rep. A.I. Memo No. 1521, 1994.
    Google ScholarLocate open access versionFindings
  • J. Carreira, R. Caseiro, J. Batista, and C. Sminchisescu, “Semantic segmentation with second-order pooling,” in ECCV, 2012.
    Google ScholarFindings
  • K. E. van de Sande, C. G. Snoek, and A. W. Smeulders, “Fisher and vlad with flair,” in CVPR, 2014.
    Google ScholarFindings
  • J. J. Lim, C. L. Zitnick, and P. Dollar, “Sketch tokens: A learned mid-level representation for contour and object detection,” in CVPR, 2013.
    Google ScholarFindings
  • X. Ren and D. Ramanan, “Histograms of sparse codes for object detection,” in CVPR, 2013.
    Google ScholarFindings
  • M. Zeiler, G. Taylor, and R. Fergus, “Adaptive deconvolutional networks for mid and high level feature learning,” in CVPR, 2011.
    Google ScholarFindings
  • D. Hoiem, Y. Chodpathumwan, and Q. Dai, “Diagnosing error in object detectors,” in ECCV, 2012.
    Google ScholarFindings
  • J. Deng, O. Russakovsky, J. Krause, M. Bernstein, A. C. Berg, and L. Fei-Fei, “Scalable multi-label annotation,” in CHI, 2014.
    Google ScholarFindings
  • H. Su, J. Deng, and L. Fei-Fei, “Crowdsourcing annotations for visual object detection,” in AAAI Technical Report, 4th Human Computation Workshop, 2012.
    Google ScholarFindings
  • C. Farabet, C. Couprie, L. Najman, and Y. LeCun, “Learning hierarchical features for scene labeling,” TPAMI, 2013.
    Google ScholarLocate open access versionFindings
  • P. Arbelaez, B. Hariharan, C. Gu, S. Gupta, L. Bourdev, and J. Malik, “Semantic segmentation using regions and parts,” in CVPR, 2012.
    Google ScholarFindings
  • B. Hariharan, P. Arbelaez, L. Bourdev, S. Maji, and J. Malik, “Semantic contours from inverse detectors,” in ICCV, 2011.
    Google ScholarFindings
  • B. Hariharan, P. Arbelaez, R. Girshick, and J. Malik, “Hypercolumns for object segmentation and fine-grained localization,” in CVPR, 2015.
    Google ScholarFindings
  • L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, “Semantic image segmentation with deep convolutional nets and fully connected CRFs,” in ICLR, 2015.
    Google ScholarFindings
  • S. Zheng, S. Jayasumana, B. Romera-Paredes, V. Vineet, Z. Su, D. Du, C. Huang, and P. Torr, “Conditional random fields as recurrent neural networks,” arXiv e-print, vol. arXiv:1502.03240v2 [cs.CV], 2015.
    Findings
  • J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in CVPR, 2015.
    Google ScholarFindings
  • P. Krahenbuhl and V. Koltun, “Efficient inference in fully connected CRFs with gaussian edge potentials,” in NIPS, 2011.
    Google ScholarLocate open access versionFindings
  • A. Oliva and A. Torralba, “Modeling the shape of the scene: A holistic representation of the spatial envelope,” IJCV, 2001.
    Google ScholarLocate open access versionFindings
  • M. Douze, H. Jegou, H. Sandhawalia, L. Amsaleg, and C. Schmid, “Evaluation of gist descriptors for web-scale image search,” in Proc. of the ACM International Conference on Image and Video Retrieval, 2009. Jeff Donahue is a fourth year PhD student at UC Berkeley, supervised by Trevor Darrell. He graduated from UT Austin in 2011 with a BS in computer science, completing an honors thesis with Kristen Grauman. Jeff’s research interests are in visual recognition and machine learning, most recently focusing on the application of deep learning to visual localization and sequencing problems.
    Google ScholarLocate open access versionFindings
Full Text
Your rating :
0

 

Tags
Comments