Scale-Aware Face Detection
CVPR, 2017.
EI
Weibo:
Abstract:
Convolutional neural network (CNN) based face detectors are inefficient in handling faces of diverse scales. They rely on either fitting a large single model to faces across a large scale range or multi-scale testing. Both are computationally expensive. We propose Scale-aware Face Detector (SAFD) to handle scale explicitly using CNN, an...More
Code:
Data:
Introduction
- Face detection is one of the most widely used computer vision applications. Popular face detectors have been proposed, including the Viola-Jones[34]and its extensions, part model [9] and its successors and the convolutional neural network (CNN) based approaches [33].
- For CNN-based face detectors, the variance in pose and appearance can be handled by the large capacity of convolutional neural network.
- The popularity of CNN in computer vision domain largely comes from its translation invariance property, which significantly reduces computation and model size compared to fully-connected neural networks.
- Multi-scale testing leads to heavy computation cost
- Another way to avoid this problem is to fit a CNN model to multiple scales.
- This may lead to an increase in model size and computation
Highlights
- Face detection is one of the most widely used computer vision applications
- For convolutional neural network-based face detectors, the variance in pose and appearance can be handled by the large capacity of convolutional neural network
- The popularity of convolutional neural network in computer vision domain largely comes from its translation invariance property, which significantly reduces computation and model size compared to fully-connected neural networks
- As for scale invariance, convolutional neural network meets the limitation that is similar to the limitation of translation invariance for fully-connected networks
- Multi-scale testing leads to heavy computation cost
- We proposed Scale-aware Face Detection, a two-stage face detection pipeline
Methods
- The authors report the performance of SPN using Recall-Average Scale Proposals Per Image curves, as shown in Figure 7.
- The authors benchmark the method on FDDB, MALF and AFW following the evaluation procedure provide by each dataset.
- The authors' method achieves best performance on FDDB and best accuracy in high confidence regions on MALF.
- The MALF dataset contains many challenging faces, having large face size diversity and a high proportion of small faces, which affect the recall rate of SPN and reduce the maximum possible recall of SAFD pipeline
Results
- Evaluation of scale proposal stage
the authors first evaluate the performance of SPN separately from the whole pipeline. - Evaluation of scale proposal stage.
- The authors first evaluate the performance of SPN separately from the whole pipeline.
- Since the scale proposal stage and detection stage essentially form a cascaded structure, any face that is missed by this stage will not be recalled by the detector.
- It is crucial to make sure that the scale proposal stage is not the performance bottleneck of the whole pipeline.
- The authors expect a high recall from this stage while keeping average resizes per image low.
Conclusion
- The authors proposed SAFD, a two-stage face detection pipeline.
- It contains a scale proposal stage which automatically normalizes face sizes prior to detection.
- This enables computationally cheap single-scale face detector to handle large scale variation without using computationally expensive multi-scale pyramid testing.
- The SPN is designed to generate scale proposals.
- SPN can share convolution layers with RPN to further reduce model size
Summary
Introduction:
Face detection is one of the most widely used computer vision applications. Popular face detectors have been proposed, including the Viola-Jones[34]and its extensions, part model [9] and its successors and the convolutional neural network (CNN) based approaches [33].- For CNN-based face detectors, the variance in pose and appearance can be handled by the large capacity of convolutional neural network.
- The popularity of CNN in computer vision domain largely comes from its translation invariance property, which significantly reduces computation and model size compared to fully-connected neural networks.
- Multi-scale testing leads to heavy computation cost
- Another way to avoid this problem is to fit a CNN model to multiple scales.
- This may lead to an increase in model size and computation
Methods:
The authors report the performance of SPN using Recall-Average Scale Proposals Per Image curves, as shown in Figure 7.- The authors benchmark the method on FDDB, MALF and AFW following the evaluation procedure provide by each dataset.
- The authors' method achieves best performance on FDDB and best accuracy in high confidence regions on MALF.
- The MALF dataset contains many challenging faces, having large face size diversity and a high proportion of small faces, which affect the recall rate of SPN and reduce the maximum possible recall of SAFD pipeline
Results:
Evaluation of scale proposal stage
the authors first evaluate the performance of SPN separately from the whole pipeline.- Evaluation of scale proposal stage.
- The authors first evaluate the performance of SPN separately from the whole pipeline.
- Since the scale proposal stage and detection stage essentially form a cascaded structure, any face that is missed by this stage will not be recalled by the detector.
- It is crucial to make sure that the scale proposal stage is not the performance bottleneck of the whole pipeline.
- The authors expect a high recall from this stage while keeping average resizes per image low.
Conclusion:
The authors proposed SAFD, a two-stage face detection pipeline.- It contains a scale proposal stage which automatically normalizes face sizes prior to detection.
- This enables computationally cheap single-scale face detector to handle large scale variation without using computationally expensive multi-scale pyramid testing.
- The SPN is designed to generate scale proposals.
- SPN can share convolution layers with RPN to further reduce model size
Tables
- Table1: Architectures and computation analysis for Scale Proposal Network (1/4 GoogleNet) and Region Proposal network (full GoogleNet). All the data assume an input size of 224 × 224 × 3. Batch Normalization layers are not shown and can be removed at test time. Auxiliary convolution layers are not shown for clarity
- Table2: Comparison of Scale-aware RPN (SA-RPN), multi-scale testing RPN (MST-RPN) and standard single-shot multi-anchor RPN (RPN) on computation requirements. The reported data are the average result for a single image
Related work
- The CNN based face detection approaches emerged in 1990s [33]. Some of the modules are still widely used, such as sliding window, multi-scale testing and the CNN based classifier to distinguish faces from background. [31] shows that CNN achieves good performance for frontal face detection and [32] further extends it for rotation invariant face detection by training faces of different poses. Despite their good performance, they are too slow when considering the hardware of early years.
One breakthrough in face detection is the Viola-Jones framework [34], which combines Haar feature, Adaboost and cascade in face detection. It becomes very popular due to its advantages in both speed and accuracy. Many works have been proposed to improve the Viola-Jones framework and achieves further improvements, such as local features [41, 20, 36], boosting algorithms [40, 21, 11], cascade structure [2] and multi-pose [22, 17, 12].
Reference
- S. Bell, C. L. Zitnick, K. Bala, and R. Girshick. Insideoutside net: Detecting objects in context with skip pooling and recurrent neural networks. Computer Vision and Pattern Recognition (CVPR), 2016. 2
- L. Bourdev and J. Brandt. Robust object detection via soft cascade. In Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on, volume 2, pages 236–243. IEEE, 2005. 2
- Z. Cai, Q. Fan, R. Feris, and N. Vasconcelos. A unified multi-scale deep convolutional neural network for fast object detection. In ECCV, 2016. 2
- D. Chen, G. Hua, F. Wen, and J. Sun. Supervised transformer network for efficient face detection. In European Conference on Computer Vision, pages 122–138. Springer, 2016. 1, 2
- j. dai, Y. Li, K. He, and J. Sun. R-fcn: Object detection via region-based fully convolutional networks. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems 29, pages 379–387. Curran Associates, Inc., 2016. 2
- N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. In 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), volume 1, pages 886–893. IEEE, 2005. 2
- P. Dollar, R. Appel, S. Belongie, and P. Perona. Fast feature pyramids for object detection. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 36(8):1532–1545, 2014. 2
- S. S. Farfade, M. Saberian, and L.-J. Li. Multi-view face detection using deep convolutional neural networks. arXiv preprint arXiv:1502.02766, 2015. 2
- P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan. Object detection with discriminatively trained partbased models. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 32(9):1627–1645, 2010. 1, 2
- G. Ghiasi and C. C. Fowlkes. Occlusion coherence: Detecting and localizing occluded faces. arXiv preprint arXiv:1506.08347, 2015. 2
- C. Huang, H. Ai, Y. Li, and S. Lao. Vector boosting for rotation invariant multi-view face detection. In Tenth IEEE International Conference on Computer Vision (ICCV’05) Volume 1, volume 1, pages 446–453. IEEE, 2005. 2
- C. Huang, H. Ai, Y. Li, and S. Lao. High-performance rotation invariant multiview face detection. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 29(4):671– 686, 2007. 2
- L. Huang, Y. Yang, Y. Deng, and Y. Yu. Densebox: Unifying landmark localization with end to end object detection. arXiv preprint arXiv:1509.04874, 2015. 1, 2
- S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015. 6
- V. Jain and E. Learned-Miller. Fddb: A benchmark for face detection in unconstrained settings. Technical Report UMCS-2010-009, University of Massachusetts, Amherst, 2010. 6
- V. Jain and E. G. Learned-Miller. Fddb: A benchmark for face detection in unconstrained settings. UMass Amherst Technical Report, 2010. 2
- M. Jones and P. Viola. Fast multi-view face detection. Mitsubishi Electric Research Lab TR-20003-96, 3:14, 2003. 2
- M. Kostinger, P. Wohlhart, P. M. Roth, and H. Bischof. Annotated facial landmarks in the wild: A large-scale, realworld database for facial landmark localization. In Computer Vision Workshops (ICCV Workshops), 2011 IEEE International Conference on, pages 2144–2151. IEEE, 2011. 6
- H. Li, Z. Lin, X. Shen, J. Brandt, and G. Hua. A convolutional neural network cascade for face detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5325–5334, 2015. 2
- J. Li and Y. Zhang. Learning surf cascade for fast and accurate object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3468–3475, 2013. 2
- S. Z. Li and Z. Zhang. Floatboost learning and statistical face detection. IEEE Transactions on pattern analysis and machine intelligence, 26(9):1112–1123, 2004. 2
- S. Z. Li, L. Zhu, Z. Zhang, A. Blake, H. Zhang, and H. Shum. Statistical learning of multi-view face detection. In ECCV 2002, pages 67–81. Springer, 2002. 2
- T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollr, and C. L. Zitnick. Microsoft coco: Common objects in context. In European Conference on Computer Vision (ECCV), Zrich, 2014. Oral. 6
- W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg. Ssd: Single shot multibox detector. In ECCV, 2016. 2
- M. Mathias, R. Benenson, M. Pedersoli, and L. Van Gool. Face detection without bells and whistles. In Computer Vision–ECCV 2014, pages 720–735. Springer, 2014. 2
- H. Qin, J. Yan, X. Li, and X. Hu. Joint training of cascaded cnn for face detection. In Computer Vision and Pattern Recognition (CVPR), 2016 IEEE Conference on, 2016. 2
- D. Ramanan. Face detection, pose estimation, and landmark localization in the wild. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), CVPR ’12, pages 2879–2886, Washington, DC, USA, 2012. IEEE Computer Society. 6
- R. Ranjan, V. M. Patel, and R. Chellappa. A deep pyramid deformable part model for face detection. In BTAS, pages 1–8. IEEE, 2015. 2
- J. Redmon, S. Divvala, R. Girshick, and A. Farhadi. You only look once: Unified, real-time object detection. arXiv preprint arXiv:1506.02640, 2015. 2
- S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems 28, pages 91–99. Curran Associates, Inc., 2015. 2, 5
- H. Rowley, S. Baluja, T. Kanade, et al. Neural network-based face detection. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 20(1):23–38, 1998. 2
- H. Rowley, S. Baluja, T. Kanade, et al. Rotation invariant neural network-based face detection. In Computer Vision and Pattern Recognition, 1998. Proceedings. 1998 IEEE Computer Society Conference on, pages 38–44. IEEE, 1998. 2
- R. Vaillant, C. Monrocq, and Y. Le Cun. Original approach for the localisation of objects in images. IEE ProceedingsVision, Image and Signal Processing, 141(4):245–250, 1994. 1, 2
- P. Viola and M. J. Jones. Robust real-time face detection. International journal of computer vision, 57(2):137–154, 2004. 1, 2
- J. Yan, Z. Lei, L. Wen, and S. Li. The fastest deformable part model for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2497–2504, 2014. 2
- B. Yang, J. Yan, Z. Lei, and S. Z. Li. Aggregate channel features for multi-view face detection. In Biometrics (IJCB), 2014 IEEE International Joint Conference on, pages 1–8. IEEE, 2014. 2
- B. Yang, J. Yan, Z. Lei, and S. Z. Li. Convolutional channel features for pedestrian, face and edge detection. arXiv preprint arXiv:1504.07339, 2015. 2
- B. Yang, J. Yan, Z. Lei, and S. Z. Li. Fine-grained evaluation on face detection in the wild. In Automatic Face and Gesture Recognition (FG), 11th IEEE International Conference on. IEEE, 2015. 6
- S. Yang, P. Luo, C.-C. Loy, and X. Tang. From facial parts responses to face detection: A deep learning approach. In Proceedings of International Conference on Computer Vision (ICCV), 2015. 1, 2
- C. Zhang, J. C. Platt, and P. A. Viola. Multiple instance boosting for object detection. In Advances in neural information processing systems, pages 1417–1424, 2005. 2
- L. Zhang, R. Chu, S. Xiang, S. Liao, and S. Z. Li. Face detection based on multi-block lbp representation. In Advances in biometrics, pages 11–18. Springer, 2007. 2
- X. Zhu and D. Ramanan. Face detection, pose estimation, and landmark localization in the wild. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pages 2879–2886. IEEE, 2012. 2
Full Text
Tags
Comments