Supervised Transformer Network for Efficient Face Detection
ECCV, 2016.
EI
Weibo:
Abstract:
Large pose variations remain to be a challenge that confronts real-word face detection. We propose a new cascaded Convolutional Neural Network, dubbed the name Supervised Transformer Network, to address this challenge. The first stage is a multi-task Region Proposal Network (RPN), which simultaneously predicts candidate face regions along...More
Code:
Data:
Introduction
- Among the various factors that confront real-world face detection, large pose variations remain to be a big challenge.
- Some other works [5,6,7] proposed to first estimate the face pose and run the cascade of the corresponding face pose to verify the detection.
- The complexity of the former approach increases with the number of pose categories, while the accuracy of the latter is prone to the mistakes of pose estimation
Highlights
- Among the various factors that confront real-world face detection, large pose variations remain to be a big challenge
- There were abundant works attempted to tackle with large pose variations under the regime of the boosting cascade advocated by Viola and Jones [1]
- Our contributions are: 1) we proposed a new cascaded network named Supervised Transformer Network trained end-to-end for efficient face detection; 2) we introduced the supervised transformer layer, which enables to learn the optimal canonical pose to best differentiate face/non-face patterns; 3) we introduced a Non-top K suppression scheme, which can achieve better recall without sacrificing precision; 4) we introduced a ROI convolution scheme
- We propose to learn both the canonical positions and the prediction of the facial landmarks end-to-end from the network with additional supervision information from the classification objective of the RCNN using end-to-end back propagation
- By combining feature maps from both stages of the network, we achieve state-of-the-art detection accuracies on several public benchmarks
- We proposed a new Supervised Transformer Network for face detection
Methods
- The authors collected about 400K face images from the web with various variations as positive training samples.
- These images are exclusive from FDDB [29], AFW [8] and PASCAL [30] datasets.
- For the negative training samples, the authors use the Coco database [31]
- This dataset has pixel level annotations of various objects, including people.
Results
- By combining feature maps from both stages of the network, the authors achieve state-of-the-art detection accuracies on several public benchmarks.
- In a typical DNN, the convolutional layers are the most computationally expensive and often take up about more than 90% of the time in runtime
Conclusion
- The authors proposed a new Supervised Transformer Network for face detection.
- The superior performance on three challenge datasets shows its ability to learn the optimal canonical positions to best distinguish face/non-face patterns.
- The authors introduced a ROI convolution, which speeds up the detector 3x on CPU with little recall drop.
- The authors' future work will explore how to enhance the ROI convolution so that it does not incur additional drops in recall
Summary
Introduction:
Among the various factors that confront real-world face detection, large pose variations remain to be a big challenge.- Some other works [5,6,7] proposed to first estimate the face pose and run the cascade of the corresponding face pose to verify the detection.
- The complexity of the former approach increases with the number of pose categories, while the accuracy of the latter is prone to the mistakes of pose estimation
Methods:
The authors collected about 400K face images from the web with various variations as positive training samples.- These images are exclusive from FDDB [29], AFW [8] and PASCAL [30] datasets.
- For the negative training samples, the authors use the Coco database [31]
- This dataset has pixel level annotations of various objects, including people.
Results:
By combining feature maps from both stages of the network, the authors achieve state-of-the-art detection accuracies on several public benchmarks.- In a typical DNN, the convolutional layers are the most computationally expensive and often take up about more than 90% of the time in runtime
Conclusion:
The authors proposed a new Supervised Transformer Network for face detection.- The superior performance on three challenge datasets shows its ability to learn the optimal canonical positions to best distinguish face/non-face patterns.
- The authors introduced a ROI convolution, which speeds up the detector 3x on CPU with little recall drop.
- The authors' future work will explore how to enhance the ROI convolution so that it does not incur additional drops in recall
Tables
- Table1: RPN network structure
- Table2: Evaluation of the effect of three parts in training architecture
- Table3: Various results demonstrating the effects of ROI convolution
Funding
- By combining feature maps from both stages of the network, we achieve state-of-the-art detection accuracies on several public benchmarks
- In a typical DNN, the convolutional layers are the most computationally expensive and often take up about more than 90% of the time in runtime
Study subjects and analysis
challenging public datasets: 3
After the predicted facial landmarks are largely correct, we add the RCNN network and perform end-to-end training together. For evaluation, we use three challenging public datasets, i.e., FDDB [29], AFW [8] and PASCAL faces [30]. All these three datasets are widely used as face detection benchmark
datasets: 3
For evaluation, we use three challenging public datasets, i.e., FDDB [29], AFW [8] and PASCAL faces [30]. All these three datasets are widely used as face detection benchmark. We employ the Intersection over Union (IoU) as the evaluation metric and fix the IoU threshold to 0.5
benchmark datasets: 3
4.4 Comparing with state-of-the-art. We conduct face detection experiments on three benchmark datasets. On the FDDB dataset, we compare with all public methods [33, 8, 34, 35, 9, 36,37,38,39,40, 35, 10, 41, 42]
challenge datasets: 3
In this paper, we proposed a new Supervised Transformer Network for face detection. The superior performance on three challenge datasets shows its ability to learn the optimal canonical positions to best distinguish face/non-face patterns. We also introduced a ROI convolution, which speeds up our detector 3x on CPU with little recall drop
Reference
- Viola, P., Jones, M.: Rapid object detection using a boosted cascade of simple features. In: Computer Vision and Pattern Recognition, 2001. CVPR 2001. Proceedings of the 2001 IEEE Computer Society Conference on. Volume 1., IEEE (2001) 511–518
- Li, S.Z., Zhu, L., Zhang, Z., Blake, A., Zhang, H., Shum, H.: Statistical learning of multi-view face detection. In: European Conference on Computer Vision. (2002) 67–81
- Wu, B., Ai, H., Huang, C., Lao, S.: Fast rotation invariant multi-view face detection based on real adaboost. In: Automatic Face and Gesture Recognition. (2004) 79–84
- Mathias, M., Benenson, R., Pedersoli, M., Van Gool, L.: Face detection without bells and whistles. In: European Conference on Computer Vision. (2014) 720–735
- Viola, M., Jones, M.J., Viola, P.: Fast multi-view face detection. In: TR2003-96. (2003)
- Huang, C., Ai, H., Li, Y., Lao, S.: Vector boosting for rotation invariant multi-view face detection. In: Computer Vision, 2005. ICCV 2005. Tenth IEEE International Conference on. Volume 1. (Oct 2005) 446–453 Vol. 1
- Huang, C., Ai, H., Li, Y., Lao, S.: High-performance rotation invariant multiview face detection. Pattern Analysis and Machine Intelligence, IEEE Transactions on 29(4) (April 2007) 671–686
- Zhu, X., Ramanan, D.: Face detection, pose estimation, and landmark localization in the wild. In: Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, IEEE (2012) 2879–2886
- Li, H., Hua, G., Lin, Z., Brandt, J., Yang, J.: Probabilistic elastic part model for unsupervised face detector adaptation. In: The IEEE International Conference on Computer Vision (ICCV). (2013)
- Yang, S., Luo, P., Loy, C.C., Tang, X.: From facial parts responses to face detection: A deep learning approach. In: Proceedings of the IEEE International Conference on Computer Vision. (2015) 3676–3684
- Dollar, P., Appel, R., Belongie, S., Perona, P.: Fast feature pyramids for object detection. Pattern Analysis and Machine Intelligence, IEEE Transactions on 36(8) (Aug 2014) 1532–1545
- Yang, B., Yan, J., Lei, Z., Li, S.Z.: Convolutional channel features for pedestrian, face and edge detection. CoRR abs/1504.07339 (2015)
- Shen, X., Lin, Z., Brandt, J., Wu, Y.: Detecting and aligning faces by image retrieval. In: Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on. (June 2013) 3460–3467
- Farfade, S.S., Saberian, M.J., Li, L.J.: Multi-view face detection using deep convolutional neural networks. In: Proceedings of the 5th ACM on International Conference on Multimedia Retrieval. ICMR ’15, New York, NY, USA, ACM (2015) 643–650
- Li, H., Lin, Z., Shen, X., Brandt, J., Hua, G.: A convolutional neural network cascade for face detection. In: Computer Vision and Pattern Recognition (CVPR), 2015 IEEE Conference on. (June 2015) 5325–5334
- Chen, D., Ren, S., Wei, Y., Cao, X., Sun, J.: Joint cascade face detection and alignment. In: Proceedings of the European Conference on Computer Vision (ECCV). (2014)
- Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on. (June 2014) 580–587
- Bell, S., Zitnick, C.L., Bala, K., Girshick, R.: Inside-outside net: Detecting objects in context with skip pooling and recurrent neural networks. arXiv preprint arXiv:1512.04143 (2015)
- Jaderberg, M., Simonyan, K., Zisserman, A., et al.: Spatial transformer networks. In: Advances in Neural Information Processing Systems. (2015) 2008–2016
- Ren, S., He, K., Girshick, R., Sun, J.: Faster r-cnn: Towards real-time object detection with region proposal networks. In: Advances in Neural Information Processing Systems. (2015) 91–99
- Vanhoucke, V., Senior, A., Mao, M.Z.: Improving the speed of neural networks on cpus. In: Deep Learning and Unsupervised Feature Learning Workshop, NIPS 2011. (2011)
- Liu, B., Wang, M., Foroosh, H., Tappen, M., Penksy, M.: Sparse convolutional neural networks. In: Computer Vision and Pattern Recognition (CVPR), 2015 IEEE Conference on. (June 2015) 806–814
- Jaderberg, M., Vedaldi, A., Zisserman, A.: Speeding up convolutional neural networks with low rank expansions. In: Proceedings of the British Machine Vision Conference, BMVA Press (2014)
- Zhang, X., Zou, J., Ming, X., He, K., Sun, J.: Efficient and accurate approximations of nonlinear convolutional networks. In: Computer Vision and Pattern Recognition (CVPR), 2015 IEEE Conference on. (June 2015) 1984–1992
- Zhang, C., Zhang, Z.: Improving multiview face detection with multi-task deep convolutional neural networks. In: Applications of Computer Vision (WACV), 2014 IEEE Winter Conference on. (March 2014) 1036–1041
- Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., Darrell, T.: Caffe: Convolutional architecture for fast feature embedding. In: Proceedings of the ACM International Conference on Multimedia, ACM (2014) 675–678
- Ozuysal, M., Fua, P., Lepetit, V.: Fast keypoint recognition in ten lines of code. In: Computer Vision and Pattern Recognition, 2007. CVPR’07. IEEE Conference on, Ieee (2007) 1–8
- Chellapilla, K., Puri, S., Simard, P.: High performance convolutional neural networks for document processing. In: Tenth International Workshop on Frontiers in Handwriting Recognition, Suvisoft (2006)
- Jain, V., Learned-Miller, E.: Fddb: A benchmark for face detection in unconstrained settings. Technical Report UM-CS-2010-009, University of Massachusetts, Amherst (2010)
- Yan, J., Zhang, X., Lei, Z., Li, S.Z.: Face detection by structural models. Image and Vision Computing 32(10) (2014) 790–799
- Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollar, P., Zitnick, C.L.: Microsoft coco: Common objects in context. In: Computer Vision– ECCV 2014. Springer (2014) 740–755
- Schroff, F., Kalenichenko, D., Philbin, J.: Facenet: A unified embedding for face recognition and clustering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (2015) 815–823
- Wu, B., Ai, H., Huang, C., Lao, S.: Fast rotation invariant multi-view face detection based on real adaboost. In: Automatic Face and Gesture Recognition, 2004. Proceedings. Sixth IEEE International Conference on. (May 2004) 79–84
- Shen, X., Lin, Z., Brandt, J., Wu, Y.: Detecting and aligning faces by image retrieval. In: Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on. (June 2013) 3460–3467
- Li, H., Lin, Z., Brandt, J., Shen, X., Hua, G.: Efficient boosted exemplar-based face detection. In: Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on. (June 2014) 1843–1850
- Li, J., Zhang, Y.: Learning surf cascade for fast and accurate object detection. In: Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on. (June 2013) 3468–3475
- Jain, V., Learned-Miller, E.: Online domain adaptation of a pre-trained cascade of classifiers. In: Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, IEEE (2011) 577–584
- Subburaman, V.B., Marcel, S.: Fast bounding box estimation based face detection. In: ECCV, Workshop on Face Detection: Where we are, and what next? (2010)
- Mikolajczyk, K., Schmid, C., Zisserman, A.: Human detection based on a probabilistic assembly of robust part detectors. In: Computer Vision-ECCV 2004. Springer (2004) 69–82
- Yan, J., Lei, Z., Wen, L., Li, S.: The fastest deformable part model for object detection. In: Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on. (June 2014) 2497–2504
- Ranjan, R., Patel, V.M., Chellappa, R.: A deep pyramid deformable part model for face detection. In: Biometrics Theory, Applications and Systems (BTAS), 2015 IEEE 7th International Conference on, IEEE (2015) 1–8
- Farfade, S.S., Saberian, M.J., Li, L.J.: Multi-view face detection using deep convolutional neural networks. In: Proceedings of the 5th ACM on International Conference on Multimedia Retrieval, ACM (2015) 643–650
Full Text
Tags
Comments