Joint Training of Cascaded CNN for Face Detection
2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), pp. 3456-3465, 2016.
EI SCOPUS WOS
Weibo:
Abstract:
Cascade has been widely used in face detection, where classifier with low computation cost can be firstly used to shrink most of the background while keeping the recall. The cascade in detection is popularized by seminal Viola-Jones framework and then widely used in other pipelines, such as DPM and CNN. However, to our best knowledge, mos...More
Code:
Data:
Introduction
- Face detection plays an important role in face based image analysis and is one of the fundamental problems in computer vision.
- Recent works in face detection focus on faces in uncontrolled setting, which is challenging due to the variations in subject level, category level and image level.
- The number of detected faces N always vary in different images.
- Considering that the bi can possibly appear in any scale and position, the face detection problem has a output space of size (w∗h)2 2
Highlights
- Face detection plays an important role in face based image analysis and is one of the fundamental problems in computer vision
- We show that the back propagation algorithm used in training CNN can be naturally used in training CNN cascade
- In training joint cascaded CNNs, we use Annotated Facial Landmarks in the Wild (AFLW) [15] and our dataset called 3R. 3R contains about 26000 images that have faces and 27000 images that have no faces. 3R is collected from conv layers
- We have presented joint training as a novel way of training cascaded CNNs
- We evaluate joint training on face detection datasets
- Joint training can extend to general cascaded CNNs, and we show how to jointly train region proposal network and fast R-CNN as an example
Methods
- The authors carry out experiments on face detection dataset to evaluate the joint training pipeline. 6.1.
- The authors carry out experiments on face detection dataset to evaluate the joint training pipeline.
- In training joint cascaded CNNs, the authors use Annotated Facial Landmarks in the Wild (AFLW) [15] and the dataset called 3R.
- The authors use images in PASCAL VOC2012 [4] that do not contain persons as background image.
- The dataset contain 47211 images with 82987 faces and about 32000 background images.
Results
- The joint training result get a recall of 88.2% (1000 false positives), which is comparative with the state-of-the-art.
- This is better than Cascaded CNN result (85.7%) reported in [16]
Conclusion
- The authors have presented joint training as a novel way of training cascaded CNNs. By joint training, CNN cascade can achieve end-to-end optimization.
- By jointly optimizing cascaded stages, the whole network get improved performance with smaller models for sharing convolutions.
- The authors evaluate joint training on face detection datasets.
- Joint training can extend to general cascaded CNNs, and the authors show how to jointly train RPN and fast R-CNN as an example
Summary
Introduction:
Face detection plays an important role in face based image analysis and is one of the fundamental problems in computer vision.- Recent works in face detection focus on faces in uncontrolled setting, which is challenging due to the variations in subject level, category level and image level.
- The number of detected faces N always vary in different images.
- Considering that the bi can possibly appear in any scale and position, the face detection problem has a output space of size (w∗h)2 2
Methods:
The authors carry out experiments on face detection dataset to evaluate the joint training pipeline. 6.1.- The authors carry out experiments on face detection dataset to evaluate the joint training pipeline.
- In training joint cascaded CNNs, the authors use Annotated Facial Landmarks in the Wild (AFLW) [15] and the dataset called 3R.
- The authors use images in PASCAL VOC2012 [4] that do not contain persons as background image.
- The dataset contain 47211 images with 82987 faces and about 32000 background images.
Results:
The joint training result get a recall of 88.2% (1000 false positives), which is comparative with the state-of-the-art.- This is better than Cascaded CNN result (85.7%) reported in [16]
Conclusion:
The authors have presented joint training as a novel way of training cascaded CNNs. By joint training, CNN cascade can achieve end-to-end optimization.- By jointly optimizing cascaded stages, the whole network get improved performance with smaller models for sharing convolutions.
- The authors evaluate joint training on face detection datasets.
- Joint training can extend to general cascaded CNNs, and the authors show how to jointly train RPN and fast R-CNN as an example
Tables
- Table1: Comparison of training methods of RPN + F-RCNN
Related work
- Numerous works have been proposed for face detection and some of them have been delivered to real applications. Similar to many other computer vision tasks, leading algorithms in face detection are based on convolutional neural network in the 1990s, then based on hand-craft feature and model, and recently based on convolutional neural network again. In this part, we briefly review the three kinds of methods and refer more detailed survey to [33, 37, 35].
2.1. Early CNN based methods
Face detection, as well as MNIST OCR recognition, are two tasks where CNN based approach achieve success in 1990s. In [26], CNN is used in a sliding window manner to traverse different locations and scales and classify faces from the background. In [22], CNN is used for frontal face detection and shows quite good performance. In [23], CNNs trained on faces from different poses are used for rotation invariant face detection. These methods are quite similar to modern CNN methods and get relatively good performance on easy datasets.
Funding
- This work was partly supported by National Natural Science Foundation of China (Grant No 71171121), National 863 High Technology Research and Development Program of China (Grant No 2012AA09A408), and Shenzhen Science and Technology Project (Grant No JCYJ20151117173236192 and No CXZZ20140902110505864)
Reference
- L. Bourdev and J. Brandt. Robust object detection via soft cascade. In Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on, volume 2, pages 236–243. IEEE, 2005. 2
- D. Chen, S. Ren, Y. Wei, X. Cao, and J. Sun. Joint cascade face detection and alignment. In Computer Vision–ECCV 2014, pages 109–12Springer, 2014. 2
- J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. FeiFei. Imagenet: A large-scale hierarchical image database. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pages 248–255. IEEE, 2009. 4
- M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The PASCAL Visual Object Classes Challenge 2012 (VOC2012) Results. http://www.pascalnetwork.org/challenges/VOC/voc2012/workshop/index.html.6
- S. S. Farfade, M. Saberian, and L.-J. Li. Multi-view face detection using deep convolutional neural networks. arXiv preprint arXiv:1502.02766, 2012
- P. F. Felzenszwalb, R. B. Girshick, and D. McAllester. Cascade object detection with deformable part models. In Computer vision and pattern recognition (CVPR), 2010 IEEE conference on, pages 2241–2248. IEEE, 2010. 1
- P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan. Object detection with discriminatively trained partbased models. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 32(9):1627–1645, 2010. 1, 2
- G. Ghiasi and C. C. Fowlkes. Occlusion coherence: Detecting and localizing occluded faces. arXiv preprint arXiv:1506.08347, 2015. 2
- R. Girshick. Fast r-cnn. arXiv preprint arXiv:1504.08083, 2015. 3
- R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on, pages 580–587. IEEE, 2014. 3
- C. Huang, H. Ai, Y. Li, and S. Lao. High-performance rotation invariant multiview face detection. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 29(4):671– 686, 2007. 2
- L. Huang, Y. Yang, Y. Deng, and Y. Yu. Densebox: Unifying landmark localization with end to end object detection. arXiv preprint arXiv:1509.04874, 2015. 2
- V. Jain and E. G. Learned-Miller. Fddb: A benchmark for face detection in unconstrained settings. UMass Amherst Technical Report, 2010. 2, 6
- M. Jones and P. Viola. Fast multi-view face detection. Mitsubishi Electric Research Lab TR-20003-96, 3:14, 2003. 2
- M. Kostinger, P. Wohlhart, P. M. Roth, and H. Bischof. Annotated facial landmarks in the wild: A large-scale, realworld database for facial landmark localization. In Computer Vision Workshops (ICCV Workshops), 2011 IEEE International Conference on, pages 2144–2151. IEEE, 2011. 5
- H. Li, Z. Lin, X. Shen, J. Brandt, and G. Hua. A convolutional neural network cascade for face detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5325–5334, 2015. 1, 2, 3, 6, 7, 8
- S. Z. Li, L. Zhu, Z. Zhang, A. Blake, H. Zhang, and H. Shum. Statistical learning of multi-view face detection. In Computer VisionECCV 2002, pages 67–81. Springer, 2002. 2
- M. Mathias, R. Benenson, M. Pedersoli, and L. Van Gool. Face detection without bells and whistles. In Computer Vision–ECCV 2014, pages 720–735. Springer, 2014. 2, 6, 7
- E. Osuna, R. Freund, and F. Girosi. Training support vector machines: an application to face detection. In Computer Vision and Pattern Recognition, 1997. Proceedings., 1997 IEEE Computer Society Conference on, pages 130– 136. IEEE, 1997. 2
- R. Ranjan, V. M. Patel, and R. Chellappa. A deep pyramid deformable part model for face detection. arXiv preprint arXiv:1508.04389, 2015. 2
- S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. arXiv preprint arXiv:1506.01497, 2015. 2, 3, 8
- H. Rowley, S. Baluja, T. Kanade, et al. Neural network-based face detection. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 20(1):23–38, 1998. 2
- H. Rowley, S. Baluja, T. Kanade, et al. Rotation invariant neural network-based face detection. In Computer Vision and Pattern Recognition, 1998. Proceedings. 1998 IEEE Computer Society Conference on, pages 38–44. IEEE, 1998. 2
- H. Schneiderman and T. Kanade. A statistical method for 3d object detection applied to faces and cars. In Computer Vision and Pattern Recognition, 2000. Proceedings. IEEE Conference on, volume 1, pages 746–751. IEEE, 2000. 2
- X. Shen, Z. Lin, J. Brandt, and Y. Wu. Detecting and aligning faces by image retrieval. In Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on, pages 3460–3467. IEEE, 2013. 7
- R. Vaillant, C. Monrocq, and Y. Le Cun. Original approach for the localisation of objects in images. IEE ProceedingsVision, Image and Signal Processing, 141(4):245–250, 1994. 2
- P. Viola and M. J. Jones. Robust real-time face detection. International journal of computer vision, 57(2):137–154, 2004. 1, 2
- R. Xiao, L. Zhu, and H.-J. Zhang. Boosting chain learning for object detection. In Computer Vision, 2003. Proceedings. Ninth IEEE International Conference on, pages 709– 715. IEEE, 2003. 2
- J. Yan, Z. Lei, L. Wen, and S. Z. Li. The fastest deformable part model for object detection. In Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on, pages 2497–2504. IEEE, 2014. 2
- J. Yan, X. Zhang, Z. Lei, and S. Z. Li. Face detection by structural models. Image and Vision Computing, 32(10):790–799, 2014. 2, 7
- B. Yang, J. Yan, Z. Lei, and S. Z. Li. Aggregate channel features for multi-view face detection. In Biometrics (IJCB), 2014 IEEE International Joint Conference on, pages 1–8. IEEE, 2014. 2
- B. Yang, J. Yan, Z. Lei, and S. Z. Li. Convolutional channel features for pedestrian, face and edge detection. arXiv preprint arXiv:1504.07339, 2015. 2
- M.-H. Yang, D. J. Kriegman, and N. Ahuja. Detecting faces in images: A survey. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 24(1):34–58, 2002. 2
- S. Yang, P. Luo, C. C. Loy, and X. Tang. From facial parts responses to face detection: A deep learning approach. arXiv preprint arXiv:1509.06451, 2015. 2, 7, 8
- S. Zafeiriou, C. Zhang, and Z. Zhang. A survey on face detection in the wild: past, present and future. Computer Vision and Image Understanding, 2015. 2
- C. Zhang, J. C. Platt, and P. A. Viola. Multiple instance boosting for object detection. In Advances in neural information processing systems, pages 1417–1424, 2005. 2
- C. Zhang and Z. Zhang. A survey of recent advances in face detection. Technical report, Tech. rep., Microsoft Research, 2010. 2
- L. Zhang, R. Chu, S. Xiang, S. Liao, and S. Z. Li. Face detection based on multi-block lbp representation. In Advances in biometrics, pages 11–18. Springer, 2007. 2
- X. Zhu and D. Ramanan. Face detection, pose estimation, and landmark localization in the wild. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pages 2879–2886. IEEE, 2012. 2, 6, 7
Full Text
Tags
Comments