Region-based Convolutional Networks for Accurate Object Detection and Segmentation
Pattern Analysis and Machine Intelligence, IEEE Transactions, Volume PP, Issue 99, 2015, Pages 1
EI WOS
Weibo:
Abstract:
Object detection performance, as measured on the canonical PASCAL VOC Challenge datasets, plateaued in the final years of the competition. The best-performing methods were complex ensemble systems that typically combined multiple low-level image features with high-level context. In this paper, we propose a simple and scalable detection al...More
Code:
Data:
Introduction
- R ECOGNIZING objects and localizing them in images is one of the most fundamental and challenging problems in computer vision.
- The authors describe an object detection and segmentation system that uses multi-layer convolutional networks to compute highly discriminative, yet invariant, features.
- The authors use these features to classify image regions, which can be output as detected bounding boxes or pixel-level segmentation masks.
- The authors' approach scales well with the number of object categories, which is a longstanding challenge for existing methods
Highlights
- R ECOGNIZING objects and localizing them in images is one of the most fundamental and challenging problems in computer vision
- On the PASCAL detection benchmark, our system achieves a relative improvement of more than 50% mean average precision compared to the best methods based on low-level image features
- After fine-tuning, our system achieves a mean average precision of 63% on VOC 2010 compared to 33% for the highly-tuned, HOG-based deformable part model (DPM) [18], [23]
- Using TorontoNet, our R-convolutional networks achieves a mean average precision of 31.4%, which is significantly ahead of the second-best result of 24.3% from OverFeat
- Most of the competing submissions (OverFeat, NEC-MU, Toronto A, and UIUC-IFP) used convolutional networks, indicating that there is significant nuance in how convolutional networks can be applied to object detection, leading to greatly varying outcomes
- This paper presents a simple and scalable object detection algorithm that gives more than a 50% relative improvement over the best previous results on PASCAL VOC 2012
Methods
- The dominant approach to object detection has been based on sliding-window detectors.
- This approach goes back to early face detectors [15], and continued with HOG-based pedestrian detection [2], and part-based generic object detection [18].
- Multiple segmentation hypotheses were used by Hoiem et al [29] to estimate the rough geometric scene structure and by Russell et al [30] to automatically discover object classes in a set of images.
- The authors' approach was inspired by the success of selective search
Results
- Results on PASCAL
VOC 2010-12
Following the PASCAL VOC best practices [3], the authors validated all design decisions and hyperparameters on the VOC 2007 dataset (Section 4.2). - R-CNNs achieve similar performance (53.3% / 62.4% mAP) on VOC 2012 test.
- The authors ran an R-CNN on the 200-class ILSVRC2013 detection dataset using the same system hyperparameters that the authors used for PASCAL VOC.
- Table 6 shows the per-category segmentation accuracy on VOC 2011 val for each of the six segmentation methods in addition to the O2P method [59]
- These results show which methods are strongest across each of the 20 PASCAL classes, plus the background class
Conclusion
- Object detection performance had stagnated. The best performing systems were complex ensembles combining multiple low-level image features with high-level context from object detectors and scene classifiers.
- This paper presents a simple and scalable object detection algorithm that gives more than a 50% relative improvement over the best previous results on PASCAL VOC 2012.
- The authors achieved this performance through two insights.
- The authors conjecture that the “supervised pre-training/domain-specific fine-tuning” paradigm will be highly effective for a variety of data-scarce vision problems
Summary
Introduction:
R ECOGNIZING objects and localizing them in images is one of the most fundamental and challenging problems in computer vision.- The authors describe an object detection and segmentation system that uses multi-layer convolutional networks to compute highly discriminative, yet invariant, features.
- The authors use these features to classify image regions, which can be output as detected bounding boxes or pixel-level segmentation masks.
- The authors' approach scales well with the number of object categories, which is a longstanding challenge for existing methods
Objectives:
Some of these hyperparameter choices are slightly suboptimal for ILSVRC, the goal of this work was to produce a preliminary R-CNN result on ILSVRC without extensive dataset tuning.Methods:
The dominant approach to object detection has been based on sliding-window detectors.- This approach goes back to early face detectors [15], and continued with HOG-based pedestrian detection [2], and part-based generic object detection [18].
- Multiple segmentation hypotheses were used by Hoiem et al [29] to estimate the rough geometric scene structure and by Russell et al [30] to automatically discover object classes in a set of images.
- The authors' approach was inspired by the success of selective search
Results:
Results on PASCAL
VOC 2010-12
Following the PASCAL VOC best practices [3], the authors validated all design decisions and hyperparameters on the VOC 2007 dataset (Section 4.2).- R-CNNs achieve similar performance (53.3% / 62.4% mAP) on VOC 2012 test.
- The authors ran an R-CNN on the 200-class ILSVRC2013 detection dataset using the same system hyperparameters that the authors used for PASCAL VOC.
- Table 6 shows the per-category segmentation accuracy on VOC 2011 val for each of the six segmentation methods in addition to the O2P method [59]
- These results show which methods are strongest across each of the 20 PASCAL classes, plus the background class
Conclusion:
Object detection performance had stagnated. The best performing systems were complex ensembles combining multiple low-level image features with high-level context from object detectors and scene classifiers.- This paper presents a simple and scalable object detection algorithm that gives more than a 50% relative improvement over the best previous results on PASCAL VOC 2012.
- The authors achieved this performance through two insights.
- The authors conjecture that the “supervised pre-training/domain-specific fine-tuning” paradigm will be highly effective for a variety of data-scarce vision problems
Tables
- Table1: Detection average precision (%) on VOC 2010 test. T-Net stands for TorontoNet and O-Net for OxfordNet (Section 3.1.2). R-CNNs are most directly comparable to UVA and Regionlets since all methods use selective search region proposals. Bounding-box regression (BB) is described in Section 7.3. At publication time, SegDPM was the top-performer on the PASCAL VOC leaderboard. DPM and SegDPM use context rescoring not used by the other methods. SegDPM and all R-CNNs use additional training data
- Table2: Detection average precision (%) on VOC 2007 test. Rows 1-3 show R-CNN performance without fine-tuning. Rows 4-6 show results for the CNN pre-trained on ILSVRC 2012 and then fine-tuned (FT) on VOC 2007 trainval. Row 7 includes a simple bounding-box regression (BB) stage that reduces localization errors (Section 7.3). Rows 8-10 present DPM methods as a strong baseline. The first uses only HOG, while the next two use different feature learning approaches to augment or replace HOG. All R-CNN results use TorontoNet
- Table3: Detection average precision (%) on VOC 2007 test for two different CNN architectures. The first two rows are results from Table 2 using Krizhevsky et al.’s TorontoNet architecture (T-Net). Rows three and four use the recently proposed 16-layer OxfordNet architecture (O-Net) from Simonyan and Zisserman [<a class="ref-link" id="c24" href="#r24">24</a>]
- Table4: ILSVRC2013 ablation study of data usage choices, fine-tuning, and bounding-box regression. All experiments use TorontoNet
- Table5: Segmentation mean accuracy (%) on VOC 2011 validation. Column 1 presents O2P; 2-7 use our CNN pre-trained on ILSVRC 2012
- Table6: Per-category segmentation accuracy (%) on the VOC 2011 validation set. These experiments use TorontoNet without fine-tuning
- Table7: Segmentation accuracy (%) on VOC 2011 test. We compare against two strong baselines: the “Regions and Parts” (R&P) method of [<a class="ref-link" id="c68" href="#r68">68</a>] and the second-order pooling (O2P) method of [<a class="ref-link" id="c59" href="#r59">59</a>]. Without any fine-tuning, our CNN achieves top segmentation performance, outperforming R&P and roughly matching O2P. These experiments use TorontoNet without fine-tuning
Related work
- Deep CNNs for object detection. There were several efforts [12], [13], [19] to use convolutional networks for PASCALstyle object detection concurrent with the development of R-CNNs. Szegedy et al [12] model object detection as a regression problem. Given an image window, they use a CNN to predict foreground pixels over a coarse grid for the whole object as well as the object’s top, bottom, left and right halves. A grouping process then converts the predicted masks into detected bounding boxes. Szegedy et al train their model from a random initialization on VOC 2012 trainval and get a mAP of 30.5% on VOC 2007 test. In comparison, an R-CNN using the same network architecture gets a mAP of 58.5%, but uses supervised ImageNet pretraining. One hypothesis is that [12] performs worse because it does not use ImageNet pre-training. Recent work from Agrawal et al [25] shows that this is not the case; they find that an R-CNN trained from a random initialization on VOC 2007 trainval (using the same network architecture as [12]) achieves a mAP of 40.7% on VOC 2007 test despite using half the amount of training data as [12].
Funding
- This research was supported in part by DARPA Mind’s Eye and MSEE programs, by NSF awards IIS-0905647, IIS1134072, and IIS-1212798, MURI N000014-10-1-0933, and by support from Toyota
Reference
- D. Lowe, “Distinctive image features from scale-invariant keypoints,” IJCV, 2004.
- N. Dalal and B. Triggs, “Histograms of oriented gradients for human detection,” in CVPR, 2005.
- M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman, “The PASCAL Visual Object Classes (VOC) Challenge,” IJCV, 2010.
- K. Fukushima, “Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position,” Biological cybernetics, vol. 36, no. 4, pp. 193–202, 1980.
- D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning internal representations by error propagation,” Parallel Distributed Processing, vol. 1, pp. 318–362, 1986.
- Y. LeCun, B. Boser, J. Denker, D. Henderson, R. Howard, W. Hubbard, and L. Jackel, “Backpropagation applied to handwritten zip code recognition,” Neural Comp., 1989.
- Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” Proc. of the IEEE, 1998.
- A. Krizhevsky, I. Sutskever, and G. Hinton, “ImageNet classification with deep convolutional neural networks,” in NIPS, 2012.
- J. Deng, A. Berg, S. Satheesh, H. Su, A. Khosla, and L. FeiFei, “ImageNet Large Scale Visual Recognition Competition 2012 (ILSVRC2012),” http://www.image-net.org/challenges/LSVRC/2012/.
- J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “ImageNet: A large-scale hierarchical image database,” in CVPR, 2009.
- R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for accurate object detection and semantic segmentation,” in CVPR, 2014.
- C. Szegedy, A. Toshev, and D. Erhan, “Deep neural networks for object detection,” in NIPS, 2013.
- D. Erhan, C. Szegedy, A. Toshev, and D. Anguelov, “Scalable object detection using deep neural networks,” in CVPR, 2014.
- H. A. Rowley, S. Baluja, and T. Kanade, “Neural network-based face detection,” TPAMI, 1998.
- R. Vaillant, C. Monrocq, and Y. LeCun, “Original approach for the localisation of objects in images,” IEE Proc on Vision, Image, and Signal Processing, 1994.
- J. Platt and S. Nowlan, “A convolutional neural network hand tracker,” in NIPS, 1995.
- P. Sermanet, K. Kavukcuoglu, S. Chintala, and Y. LeCun, “Pedestrian detection with unsupervised multi-stage feature learning,” in CVPR, 2013.
- P. Felzenszwalb, R. Girshick, D. McAllester, and D. Ramanan, “Object detection with discriminatively trained part based models,” TPAMI, 2010.
- P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. LeCun, “OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks,” in ICLR, 2014.
- C. Gu, J. J. Lim, P. Arbelaez, and J. Malik, “Recognition using regions,” in CVPR, 2009.
- J. Uijlings, K. van de Sande, T. Gevers, and A. Smeulders, “Selective search for object recognition,” IJCV, 2013.
- J. Carreira and C. Sminchisescu, “CPMC: Automatic object segmentation using constrained parametric min-cuts,” TPAMI, 2012.
- R. Girshick, P. Felzenszwalb, and D. McAllester, “Discriminatively trained deformable part models, release 5,” http://www.cs.berkeley.edu/∼rbg/latent-v5/.
- K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” in ICLR, 2015.
- P. Agrawal, R. Girshick, and J. Malik, “Analyzing the performance of multilayer neural networks for object recognition,” in ECCV, 2014.
- T. Dean, J. Yagnik, M. Ruzon, M. Segal, J. Shlens, and S. Vijayanarasimhan, “Fast, accurate detection of 100,000 object classes on a single machine,” in CVPR, 2013.
- K. He, X. Zhang, S. Ren, and J. Sun, “Spatial pyramid pooling in deep convolutional networks for visual recognition,” in ECCV, 2014.
- R. Girshick, “Fast R-CNN,” arXiv e-prints, vol. arXiv:1504.08083v1 [cs.CV], 2015.
- D. Hoiem, A. Efros, and M. Hebert, “Geometric context from a single image,” in CVPR, 2005.
- B. C. Russell, W. T. Freeman, A. A. Efros, J. Sivic, and A. Zisserman, “Using multiple segmentations to discover objects and their extent in image collections,” in CVPR, 2006.
- C. L. Zitnick and P. Dollar, “Edge boxes: Locating object proposals from edges,” in ECCV, 2014.
- M.-M. Cheng, Z. Zhang, W.-Y. Lin, and P. H. S. Torr, “BING: Binarized normed gradients for objectness estimation at 300fps,” in CVPR, 2014.
- J. Hosang, R. Benenson, P. Dollar, and B. Schiele, “What makes for effective detection proposals?” arXiv e-prints, vol. arXiv:1502.05082v1 [cs.CV], 2015.
- A. Humayun, F. Li, and J. M. Rehg, “RIGOR: Reusing Inference in Graph Cuts for generating Object Regions,” in CVPR, 2014.
- P. Arbelaez, J. Pont-Tuset, J. Barron, F. Marques, and J. Malik, “Multiscale combinatorial grouping,” in CVPR, 2014.
- P. Krahenbuhl and V. Koltun, “Geodesic object proposals,” in ECCV, 2014.
- S. J. Pan and Q. Yang, “A survey on transfer learning,” TPAMI, 2010.
- R. Caruana, “Multitask learning: A knowledge-based source of inductive bias,” in ICML, 1993.
- S. Thrun, “Is learning the n-th thing any easier than learning the first?” NIPS, 1996.
- Y. Bengio, A. Courville, and P. Vincent, “Representation learning: A review and new perspectives,” TPAMI, 2013.
- J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, and T. Darrell, “DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition,” in ICML, 2014.
- J. Hoffman, S. Guadarrama, E. Tzeng, J. Donahue, R. Girshick, T. Darrell, and K. Saenko, “From large-scale object classifiers to large-scale object detectors: An adaptation approach,” in NIPS, 2014.
- A. Karpathy, A. Joulin, and L. Fei-Fei, “Deep fragment embeddings for bidirectional image sentence mapping,” in NIPS, 2014.
- G. Gkioxari, B. Hariharan, R. Girshick, and J. Malik, “R-CNNs for pose estimation and action detection,” arXiv e-prints, vol. arXiv:1406.5212v1 [cs.CV], 2014.
- B. Hariharan, P. Arbelaez, R. Girshick, and J. Malik, “Simultaneous detection and segmentation,” in ECCV, 2014.
- S. Gupta, R. Girshick, P. Arbelaez, and J. Malik, “Learning rich features from RGB-D images for object detection and segmentation,” in ECCV, 2014.
- H. O. Song, R. Girshick, S. Jegelka, J. Mairal, Z. Harchaoui, and T. Darrell, “On learning to localize objects with minimal supervision,” in ICML, 2014.
- O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei, “ImageNet large scale visual recognition challenge,” arXiv e-prints, vol. arXiv:1409.0575v1 [cs.CV], 2014.
- C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” arXiv e-prints, vol. arXiv:1409.4842v1 [cs.CV], 2014.
- C. Szegedy, S. Reed, D. Erhan, and D. Anguelov, “Scalable, highquality object detection,” arXiv e-prints, vol. arXiv:1412.1441v2 [cs.CV], 2015.
- B. Alexe, T. Deselaers, and V. Ferrari, “Measuring the objectness of image windows,” TPAMI, 2012.
- I. Endres and D. Hoiem, “Category independent object proposals,” in ECCV, 2010.
- D. Ciresan, A. Giusti, L. Gambardella, and J. Schmidhuber, “Mitosis detection in breast cancer histology images with deep neural networks,” in MICCAI, 2013.
- X. Wang, M. Yang, S. Zhu, and Y. Lin, “Regionlets for generic object detection,” in ICCV, 2013.
- Y. Jia, “Caffe: An open source convolutional architecture for fast feature embedding,” http://caffe.berkeleyvision.org/, 2013.
- T. Dean, M. A. Ruzon, M. Segal, J. Shlens, S. Vijayanarasimhan, and J. Yagnik, “Fast, accurate detection of 100,000 object classes on a single machine,” in CVPR, 2013.
- S. Fidler, R. Mottaghi, A. Yuille, and R. Urtasun, “Bottom-up segmentation for top-down detection,” in CVPR, 2013.
- K. Sung and T. Poggio, “Example-based learning for view-based human face detection,” Massachussets Institute of Technology, Tech. Rep. A.I. Memo No. 1521, 1994.
- J. Carreira, R. Caseiro, J. Batista, and C. Sminchisescu, “Semantic segmentation with second-order pooling,” in ECCV, 2012.
- K. E. van de Sande, C. G. Snoek, and A. W. Smeulders, “Fisher and vlad with flair,” in CVPR, 2014.
- J. J. Lim, C. L. Zitnick, and P. Dollar, “Sketch tokens: A learned mid-level representation for contour and object detection,” in CVPR, 2013.
- X. Ren and D. Ramanan, “Histograms of sparse codes for object detection,” in CVPR, 2013.
- M. Zeiler, G. Taylor, and R. Fergus, “Adaptive deconvolutional networks for mid and high level feature learning,” in CVPR, 2011.
- D. Hoiem, Y. Chodpathumwan, and Q. Dai, “Diagnosing error in object detectors,” in ECCV, 2012.
- J. Deng, O. Russakovsky, J. Krause, M. Bernstein, A. C. Berg, and L. Fei-Fei, “Scalable multi-label annotation,” in CHI, 2014.
- H. Su, J. Deng, and L. Fei-Fei, “Crowdsourcing annotations for visual object detection,” in AAAI Technical Report, 4th Human Computation Workshop, 2012.
- C. Farabet, C. Couprie, L. Najman, and Y. LeCun, “Learning hierarchical features for scene labeling,” TPAMI, 2013.
- P. Arbelaez, B. Hariharan, C. Gu, S. Gupta, L. Bourdev, and J. Malik, “Semantic segmentation using regions and parts,” in CVPR, 2012.
- B. Hariharan, P. Arbelaez, L. Bourdev, S. Maji, and J. Malik, “Semantic contours from inverse detectors,” in ICCV, 2011.
- B. Hariharan, P. Arbelaez, R. Girshick, and J. Malik, “Hypercolumns for object segmentation and fine-grained localization,” in CVPR, 2015.
- L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, “Semantic image segmentation with deep convolutional nets and fully connected CRFs,” in ICLR, 2015.
- S. Zheng, S. Jayasumana, B. Romera-Paredes, V. Vineet, Z. Su, D. Du, C. Huang, and P. Torr, “Conditional random fields as recurrent neural networks,” arXiv e-print, vol. arXiv:1502.03240v2 [cs.CV], 2015.
- J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in CVPR, 2015.
- P. Krahenbuhl and V. Koltun, “Efficient inference in fully connected CRFs with gaussian edge potentials,” in NIPS, 2011.
- A. Oliva and A. Torralba, “Modeling the shape of the scene: A holistic representation of the spatial envelope,” IJCV, 2001.
- M. Douze, H. Jegou, H. Sandhawalia, L. Amsaleg, and C. Schmid, “Evaluation of gist descriptors for web-scale image search,” in Proc. of the ACM International Conference on Image and Video Retrieval, 2009. Jeff Donahue is a fourth year PhD student at UC Berkeley, supervised by Trevor Darrell. He graduated from UT Austin in 2011 with a BS in computer science, completing an honors thesis with Kristen Grauman. Jeff’s research interests are in visual recognition and machine learning, most recently focusing on the application of deep learning to visual localization and sequencing problems.
Full Text
Tags
Comments