Supervised Transformer Network for Efficient Face Detection

ECCV, 2016.

Cited by: 106|Bibtex|Views108
EI
Other Links: dblp.uni-trier.de|academic.microsoft.com|arxiv.org
Weibo:
We proposed a new Supervised Transformer Network for face detection

Abstract:

Large pose variations remain to be a challenge that confronts real-word face detection. We propose a new cascaded Convolutional Neural Network, dubbed the name Supervised Transformer Network, to address this challenge. The first stage is a multi-task Region Proposal Network (RPN), which simultaneously predicts candidate face regions along...More

Code:

Data:

0
Introduction
  • Among the various factors that confront real-world face detection, large pose variations remain to be a big challenge.
  • Some other works [5,6,7] proposed to first estimate the face pose and run the cascade of the corresponding face pose to verify the detection.
  • The complexity of the former approach increases with the number of pose categories, while the accuracy of the latter is prone to the mistakes of pose estimation
Highlights
  • Among the various factors that confront real-world face detection, large pose variations remain to be a big challenge
  • There were abundant works attempted to tackle with large pose variations under the regime of the boosting cascade advocated by Viola and Jones [1]
  • Our contributions are: 1) we proposed a new cascaded network named Supervised Transformer Network trained end-to-end for efficient face detection; 2) we introduced the supervised transformer layer, which enables to learn the optimal canonical pose to best differentiate face/non-face patterns; 3) we introduced a Non-top K suppression scheme, which can achieve better recall without sacrificing precision; 4) we introduced a ROI convolution scheme
  • We propose to learn both the canonical positions and the prediction of the facial landmarks end-to-end from the network with additional supervision information from the classification objective of the RCNN using end-to-end back propagation
  • By combining feature maps from both stages of the network, we achieve state-of-the-art detection accuracies on several public benchmarks
  • We proposed a new Supervised Transformer Network for face detection
Methods
  • The authors collected about 400K face images from the web with various variations as positive training samples.
  • These images are exclusive from FDDB [29], AFW [8] and PASCAL [30] datasets.
  • For the negative training samples, the authors use the Coco database [31]
  • This dataset has pixel level annotations of various objects, including people.
Results
  • By combining feature maps from both stages of the network, the authors achieve state-of-the-art detection accuracies on several public benchmarks.
  • In a typical DNN, the convolutional layers are the most computationally expensive and often take up about more than 90% of the time in runtime
Conclusion
  • The authors proposed a new Supervised Transformer Network for face detection.
  • The superior performance on three challenge datasets shows its ability to learn the optimal canonical positions to best distinguish face/non-face patterns.
  • The authors introduced a ROI convolution, which speeds up the detector 3x on CPU with little recall drop.
  • The authors' future work will explore how to enhance the ROI convolution so that it does not incur additional drops in recall
Summary
  • Introduction:

    Among the various factors that confront real-world face detection, large pose variations remain to be a big challenge.
  • Some other works [5,6,7] proposed to first estimate the face pose and run the cascade of the corresponding face pose to verify the detection.
  • The complexity of the former approach increases with the number of pose categories, while the accuracy of the latter is prone to the mistakes of pose estimation
  • Methods:

    The authors collected about 400K face images from the web with various variations as positive training samples.
  • These images are exclusive from FDDB [29], AFW [8] and PASCAL [30] datasets.
  • For the negative training samples, the authors use the Coco database [31]
  • This dataset has pixel level annotations of various objects, including people.
  • Results:

    By combining feature maps from both stages of the network, the authors achieve state-of-the-art detection accuracies on several public benchmarks.
  • In a typical DNN, the convolutional layers are the most computationally expensive and often take up about more than 90% of the time in runtime
  • Conclusion:

    The authors proposed a new Supervised Transformer Network for face detection.
  • The superior performance on three challenge datasets shows its ability to learn the optimal canonical positions to best distinguish face/non-face patterns.
  • The authors introduced a ROI convolution, which speeds up the detector 3x on CPU with little recall drop.
  • The authors' future work will explore how to enhance the ROI convolution so that it does not incur additional drops in recall
Tables
  • Table1: RPN network structure
  • Table2: Evaluation of the effect of three parts in training architecture
  • Table3: Various results demonstrating the effects of ROI convolution
Download tables as Excel
Funding
  • By combining feature maps from both stages of the network, we achieve state-of-the-art detection accuracies on several public benchmarks
  • In a typical DNN, the convolutional layers are the most computationally expensive and often take up about more than 90% of the time in runtime
Study subjects and analysis
challenging public datasets: 3
After the predicted facial landmarks are largely correct, we add the RCNN network and perform end-to-end training together. For evaluation, we use three challenging public datasets, i.e., FDDB [29], AFW [8] and PASCAL faces [30]. All these three datasets are widely used as face detection benchmark

datasets: 3
For evaluation, we use three challenging public datasets, i.e., FDDB [29], AFW [8] and PASCAL faces [30]. All these three datasets are widely used as face detection benchmark. We employ the Intersection over Union (IoU) as the evaluation metric and fix the IoU threshold to 0.5

benchmark datasets: 3
4.4 Comparing with state-of-the-art. We conduct face detection experiments on three benchmark datasets. On the FDDB dataset, we compare with all public methods [33, 8, 34, 35, 9, 36,37,38,39,40, 35, 10, 41, 42]

challenge datasets: 3
In this paper, we proposed a new Supervised Transformer Network for face detection. The superior performance on three challenge datasets shows its ability to learn the optimal canonical positions to best distinguish face/non-face patterns. We also introduced a ROI convolution, which speeds up our detector 3x on CPU with little recall drop

Reference
  • Viola, P., Jones, M.: Rapid object detection using a boosted cascade of simple features. In: Computer Vision and Pattern Recognition, 2001. CVPR 2001. Proceedings of the 2001 IEEE Computer Society Conference on. Volume 1., IEEE (2001) 511–518
    Google ScholarLocate open access versionFindings
  • Li, S.Z., Zhu, L., Zhang, Z., Blake, A., Zhang, H., Shum, H.: Statistical learning of multi-view face detection. In: European Conference on Computer Vision. (2002) 67–81
    Google ScholarLocate open access versionFindings
  • Wu, B., Ai, H., Huang, C., Lao, S.: Fast rotation invariant multi-view face detection based on real adaboost. In: Automatic Face and Gesture Recognition. (2004) 79–84
    Google ScholarFindings
  • Mathias, M., Benenson, R., Pedersoli, M., Van Gool, L.: Face detection without bells and whistles. In: European Conference on Computer Vision. (2014) 720–735
    Google ScholarLocate open access versionFindings
  • Viola, M., Jones, M.J., Viola, P.: Fast multi-view face detection. In: TR2003-96. (2003)
    Google ScholarFindings
  • Huang, C., Ai, H., Li, Y., Lao, S.: Vector boosting for rotation invariant multi-view face detection. In: Computer Vision, 2005. ICCV 2005. Tenth IEEE International Conference on. Volume 1. (Oct 2005) 446–453 Vol. 1
    Google ScholarLocate open access versionFindings
  • Huang, C., Ai, H., Li, Y., Lao, S.: High-performance rotation invariant multiview face detection. Pattern Analysis and Machine Intelligence, IEEE Transactions on 29(4) (April 2007) 671–686
    Google ScholarLocate open access versionFindings
  • Zhu, X., Ramanan, D.: Face detection, pose estimation, and landmark localization in the wild. In: Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, IEEE (2012) 2879–2886
    Google ScholarLocate open access versionFindings
  • Li, H., Hua, G., Lin, Z., Brandt, J., Yang, J.: Probabilistic elastic part model for unsupervised face detector adaptation. In: The IEEE International Conference on Computer Vision (ICCV). (2013)
    Google ScholarLocate open access versionFindings
  • Yang, S., Luo, P., Loy, C.C., Tang, X.: From facial parts responses to face detection: A deep learning approach. In: Proceedings of the IEEE International Conference on Computer Vision. (2015) 3676–3684
    Google ScholarLocate open access versionFindings
  • Dollar, P., Appel, R., Belongie, S., Perona, P.: Fast feature pyramids for object detection. Pattern Analysis and Machine Intelligence, IEEE Transactions on 36(8) (Aug 2014) 1532–1545
    Google ScholarLocate open access versionFindings
  • Yang, B., Yan, J., Lei, Z., Li, S.Z.: Convolutional channel features for pedestrian, face and edge detection. CoRR abs/1504.07339 (2015)
    Findings
  • Shen, X., Lin, Z., Brandt, J., Wu, Y.: Detecting and aligning faces by image retrieval. In: Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on. (June 2013) 3460–3467
    Google ScholarLocate open access versionFindings
  • Farfade, S.S., Saberian, M.J., Li, L.J.: Multi-view face detection using deep convolutional neural networks. In: Proceedings of the 5th ACM on International Conference on Multimedia Retrieval. ICMR ’15, New York, NY, USA, ACM (2015) 643–650
    Google ScholarLocate open access versionFindings
  • Li, H., Lin, Z., Shen, X., Brandt, J., Hua, G.: A convolutional neural network cascade for face detection. In: Computer Vision and Pattern Recognition (CVPR), 2015 IEEE Conference on. (June 2015) 5325–5334
    Google ScholarLocate open access versionFindings
  • Chen, D., Ren, S., Wei, Y., Cao, X., Sun, J.: Joint cascade face detection and alignment. In: Proceedings of the European Conference on Computer Vision (ECCV). (2014)
    Google ScholarLocate open access versionFindings
  • Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on. (June 2014) 580–587
    Google ScholarLocate open access versionFindings
  • Bell, S., Zitnick, C.L., Bala, K., Girshick, R.: Inside-outside net: Detecting objects in context with skip pooling and recurrent neural networks. arXiv preprint arXiv:1512.04143 (2015)
    Findings
  • Jaderberg, M., Simonyan, K., Zisserman, A., et al.: Spatial transformer networks. In: Advances in Neural Information Processing Systems. (2015) 2008–2016
    Google ScholarLocate open access versionFindings
  • Ren, S., He, K., Girshick, R., Sun, J.: Faster r-cnn: Towards real-time object detection with region proposal networks. In: Advances in Neural Information Processing Systems. (2015) 91–99
    Google ScholarLocate open access versionFindings
  • Vanhoucke, V., Senior, A., Mao, M.Z.: Improving the speed of neural networks on cpus. In: Deep Learning and Unsupervised Feature Learning Workshop, NIPS 2011. (2011)
    Google ScholarLocate open access versionFindings
  • Liu, B., Wang, M., Foroosh, H., Tappen, M., Penksy, M.: Sparse convolutional neural networks. In: Computer Vision and Pattern Recognition (CVPR), 2015 IEEE Conference on. (June 2015) 806–814
    Google ScholarLocate open access versionFindings
  • Jaderberg, M., Vedaldi, A., Zisserman, A.: Speeding up convolutional neural networks with low rank expansions. In: Proceedings of the British Machine Vision Conference, BMVA Press (2014)
    Google ScholarLocate open access versionFindings
  • Zhang, X., Zou, J., Ming, X., He, K., Sun, J.: Efficient and accurate approximations of nonlinear convolutional networks. In: Computer Vision and Pattern Recognition (CVPR), 2015 IEEE Conference on. (June 2015) 1984–1992
    Google ScholarLocate open access versionFindings
  • Zhang, C., Zhang, Z.: Improving multiview face detection with multi-task deep convolutional neural networks. In: Applications of Computer Vision (WACV), 2014 IEEE Winter Conference on. (March 2014) 1036–1041
    Google ScholarLocate open access versionFindings
  • Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., Darrell, T.: Caffe: Convolutional architecture for fast feature embedding. In: Proceedings of the ACM International Conference on Multimedia, ACM (2014) 675–678
    Google ScholarLocate open access versionFindings
  • Ozuysal, M., Fua, P., Lepetit, V.: Fast keypoint recognition in ten lines of code. In: Computer Vision and Pattern Recognition, 2007. CVPR’07. IEEE Conference on, Ieee (2007) 1–8
    Google ScholarLocate open access versionFindings
  • Chellapilla, K., Puri, S., Simard, P.: High performance convolutional neural networks for document processing. In: Tenth International Workshop on Frontiers in Handwriting Recognition, Suvisoft (2006)
    Google ScholarLocate open access versionFindings
  • Jain, V., Learned-Miller, E.: Fddb: A benchmark for face detection in unconstrained settings. Technical Report UM-CS-2010-009, University of Massachusetts, Amherst (2010)
    Google ScholarFindings
  • Yan, J., Zhang, X., Lei, Z., Li, S.Z.: Face detection by structural models. Image and Vision Computing 32(10) (2014) 790–799
    Google ScholarLocate open access versionFindings
  • Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollar, P., Zitnick, C.L.: Microsoft coco: Common objects in context. In: Computer Vision– ECCV 2014. Springer (2014) 740–755
    Google ScholarLocate open access versionFindings
  • Schroff, F., Kalenichenko, D., Philbin, J.: Facenet: A unified embedding for face recognition and clustering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (2015) 815–823
    Google ScholarLocate open access versionFindings
  • Wu, B., Ai, H., Huang, C., Lao, S.: Fast rotation invariant multi-view face detection based on real adaboost. In: Automatic Face and Gesture Recognition, 2004. Proceedings. Sixth IEEE International Conference on. (May 2004) 79–84
    Google ScholarLocate open access versionFindings
  • Shen, X., Lin, Z., Brandt, J., Wu, Y.: Detecting and aligning faces by image retrieval. In: Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on. (June 2013) 3460–3467
    Google ScholarLocate open access versionFindings
  • Li, H., Lin, Z., Brandt, J., Shen, X., Hua, G.: Efficient boosted exemplar-based face detection. In: Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on. (June 2014) 1843–1850
    Google ScholarLocate open access versionFindings
  • Li, J., Zhang, Y.: Learning surf cascade for fast and accurate object detection. In: Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on. (June 2013) 3468–3475
    Google ScholarLocate open access versionFindings
  • Jain, V., Learned-Miller, E.: Online domain adaptation of a pre-trained cascade of classifiers. In: Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, IEEE (2011) 577–584
    Google ScholarLocate open access versionFindings
  • Subburaman, V.B., Marcel, S.: Fast bounding box estimation based face detection. In: ECCV, Workshop on Face Detection: Where we are, and what next? (2010)
    Google ScholarFindings
  • Mikolajczyk, K., Schmid, C., Zisserman, A.: Human detection based on a probabilistic assembly of robust part detectors. In: Computer Vision-ECCV 2004. Springer (2004) 69–82
    Google ScholarFindings
  • Yan, J., Lei, Z., Wen, L., Li, S.: The fastest deformable part model for object detection. In: Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on. (June 2014) 2497–2504
    Google ScholarFindings
  • Ranjan, R., Patel, V.M., Chellappa, R.: A deep pyramid deformable part model for face detection. In: Biometrics Theory, Applications and Systems (BTAS), 2015 IEEE 7th International Conference on, IEEE (2015) 1–8
    Google ScholarLocate open access versionFindings
  • Farfade, S.S., Saberian, M.J., Li, L.J.: Multi-view face detection using deep convolutional neural networks. In: Proceedings of the 5th ACM on International Conference on Multimedia Retrieval, ACM (2015) 643–650
    Google ScholarLocate open access versionFindings
Full Text
Your rating :
0

 

Tags
Comments