Scale-Aware Face Detection

CVPR, 2017.

Cited by: 75|Bibtex|Views90
EI
Other Links: dblp.uni-trier.de|arxiv.org
Weibo:
The popularity of convolutional neural network in computer vision domain largely comes from its translation invariance property, which significantly reduces computation and model size compared to fully-connected neural networks

Abstract:

Convolutional neural network (CNN) based face detectors are inefficient in handling faces of diverse scales. They rely on either fitting a large single model to faces across a large scale range or multi-scale testing. Both are computationally expensive. We propose Scale-aware Face Detector (SAFD) to handle scale explicitly using CNN, an...More

Code:

Data:

0
Introduction
  • Face detection is one of the most widely used computer vision applications. Popular face detectors have been proposed, including the Viola-Jones[34]and its extensions, part model [9] and its successors and the convolutional neural network (CNN) based approaches [33].
  • For CNN-based face detectors, the variance in pose and appearance can be handled by the large capacity of convolutional neural network.
  • The popularity of CNN in computer vision domain largely comes from its translation invariance property, which significantly reduces computation and model size compared to fully-connected neural networks.
  • Multi-scale testing leads to heavy computation cost
  • Another way to avoid this problem is to fit a CNN model to multiple scales.
  • This may lead to an increase in model size and computation
Highlights
  • Face detection is one of the most widely used computer vision applications
  • For convolutional neural network-based face detectors, the variance in pose and appearance can be handled by the large capacity of convolutional neural network
  • The popularity of convolutional neural network in computer vision domain largely comes from its translation invariance property, which significantly reduces computation and model size compared to fully-connected neural networks
  • As for scale invariance, convolutional neural network meets the limitation that is similar to the limitation of translation invariance for fully-connected networks
  • Multi-scale testing leads to heavy computation cost
  • We proposed Scale-aware Face Detection, a two-stage face detection pipeline
Methods
  • The authors report the performance of SPN using Recall-Average Scale Proposals Per Image curves, as shown in Figure 7.
  • The authors benchmark the method on FDDB, MALF and AFW following the evaluation procedure provide by each dataset.
  • The authors' method achieves best performance on FDDB and best accuracy in high confidence regions on MALF.
  • The MALF dataset contains many challenging faces, having large face size diversity and a high proportion of small faces, which affect the recall rate of SPN and reduce the maximum possible recall of SAFD pipeline
Results
  • Evaluation of scale proposal stage

    the authors first evaluate the performance of SPN separately from the whole pipeline.
  • Evaluation of scale proposal stage.
  • The authors first evaluate the performance of SPN separately from the whole pipeline.
  • Since the scale proposal stage and detection stage essentially form a cascaded structure, any face that is missed by this stage will not be recalled by the detector.
  • It is crucial to make sure that the scale proposal stage is not the performance bottleneck of the whole pipeline.
  • The authors expect a high recall from this stage while keeping average resizes per image low.
Conclusion
  • The authors proposed SAFD, a two-stage face detection pipeline.
  • It contains a scale proposal stage which automatically normalizes face sizes prior to detection.
  • This enables computationally cheap single-scale face detector to handle large scale variation without using computationally expensive multi-scale pyramid testing.
  • The SPN is designed to generate scale proposals.
  • SPN can share convolution layers with RPN to further reduce model size
Summary
  • Introduction:

    Face detection is one of the most widely used computer vision applications. Popular face detectors have been proposed, including the Viola-Jones[34]and its extensions, part model [9] and its successors and the convolutional neural network (CNN) based approaches [33].
  • For CNN-based face detectors, the variance in pose and appearance can be handled by the large capacity of convolutional neural network.
  • The popularity of CNN in computer vision domain largely comes from its translation invariance property, which significantly reduces computation and model size compared to fully-connected neural networks.
  • Multi-scale testing leads to heavy computation cost
  • Another way to avoid this problem is to fit a CNN model to multiple scales.
  • This may lead to an increase in model size and computation
  • Methods:

    The authors report the performance of SPN using Recall-Average Scale Proposals Per Image curves, as shown in Figure 7.
  • The authors benchmark the method on FDDB, MALF and AFW following the evaluation procedure provide by each dataset.
  • The authors' method achieves best performance on FDDB and best accuracy in high confidence regions on MALF.
  • The MALF dataset contains many challenging faces, having large face size diversity and a high proportion of small faces, which affect the recall rate of SPN and reduce the maximum possible recall of SAFD pipeline
  • Results:

    Evaluation of scale proposal stage

    the authors first evaluate the performance of SPN separately from the whole pipeline.
  • Evaluation of scale proposal stage.
  • The authors first evaluate the performance of SPN separately from the whole pipeline.
  • Since the scale proposal stage and detection stage essentially form a cascaded structure, any face that is missed by this stage will not be recalled by the detector.
  • It is crucial to make sure that the scale proposal stage is not the performance bottleneck of the whole pipeline.
  • The authors expect a high recall from this stage while keeping average resizes per image low.
  • Conclusion:

    The authors proposed SAFD, a two-stage face detection pipeline.
  • It contains a scale proposal stage which automatically normalizes face sizes prior to detection.
  • This enables computationally cheap single-scale face detector to handle large scale variation without using computationally expensive multi-scale pyramid testing.
  • The SPN is designed to generate scale proposals.
  • SPN can share convolution layers with RPN to further reduce model size
Tables
  • Table1: Architectures and computation analysis for Scale Proposal Network (1/4 GoogleNet) and Region Proposal network (full GoogleNet). All the data assume an input size of 224 × 224 × 3. Batch Normalization layers are not shown and can be removed at test time. Auxiliary convolution layers are not shown for clarity
  • Table2: Comparison of Scale-aware RPN (SA-RPN), multi-scale testing RPN (MST-RPN) and standard single-shot multi-anchor RPN (RPN) on computation requirements. The reported data are the average result for a single image
Download tables as Excel
Related work
  • The CNN based face detection approaches emerged in 1990s [33]. Some of the modules are still widely used, such as sliding window, multi-scale testing and the CNN based classifier to distinguish faces from background. [31] shows that CNN achieves good performance for frontal face detection and [32] further extends it for rotation invariant face detection by training faces of different poses. Despite their good performance, they are too slow when considering the hardware of early years.

    One breakthrough in face detection is the Viola-Jones framework [34], which combines Haar feature, Adaboost and cascade in face detection. It becomes very popular due to its advantages in both speed and accuracy. Many works have been proposed to improve the Viola-Jones framework and achieves further improvements, such as local features [41, 20, 36], boosting algorithms [40, 21, 11], cascade structure [2] and multi-pose [22, 17, 12].
Reference
  • S. Bell, C. L. Zitnick, K. Bala, and R. Girshick. Insideoutside net: Detecting objects in context with skip pooling and recurrent neural networks. Computer Vision and Pattern Recognition (CVPR), 2016. 2
    Google ScholarLocate open access versionFindings
  • L. Bourdev and J. Brandt. Robust object detection via soft cascade. In Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on, volume 2, pages 236–243. IEEE, 2005. 2
    Google ScholarLocate open access versionFindings
  • Z. Cai, Q. Fan, R. Feris, and N. Vasconcelos. A unified multi-scale deep convolutional neural network for fast object detection. In ECCV, 2016. 2
    Google ScholarLocate open access versionFindings
  • D. Chen, G. Hua, F. Wen, and J. Sun. Supervised transformer network for efficient face detection. In European Conference on Computer Vision, pages 122–138. Springer, 2016. 1, 2
    Google ScholarLocate open access versionFindings
  • j. dai, Y. Li, K. He, and J. Sun. R-fcn: Object detection via region-based fully convolutional networks. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems 29, pages 379–387. Curran Associates, Inc., 2016. 2
    Google ScholarLocate open access versionFindings
  • N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. In 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), volume 1, pages 886–893. IEEE, 2005. 2
    Google ScholarLocate open access versionFindings
  • P. Dollar, R. Appel, S. Belongie, and P. Perona. Fast feature pyramids for object detection. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 36(8):1532–1545, 2014. 2
    Google ScholarLocate open access versionFindings
  • S. S. Farfade, M. Saberian, and L.-J. Li. Multi-view face detection using deep convolutional neural networks. arXiv preprint arXiv:1502.02766, 2015. 2
    Findings
  • P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan. Object detection with discriminatively trained partbased models. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 32(9):1627–1645, 2010. 1, 2
    Google ScholarLocate open access versionFindings
  • G. Ghiasi and C. C. Fowlkes. Occlusion coherence: Detecting and localizing occluded faces. arXiv preprint arXiv:1506.08347, 2015. 2
    Findings
  • C. Huang, H. Ai, Y. Li, and S. Lao. Vector boosting for rotation invariant multi-view face detection. In Tenth IEEE International Conference on Computer Vision (ICCV’05) Volume 1, volume 1, pages 446–453. IEEE, 2005. 2
    Google ScholarLocate open access versionFindings
  • C. Huang, H. Ai, Y. Li, and S. Lao. High-performance rotation invariant multiview face detection. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 29(4):671– 686, 2007. 2
    Google ScholarLocate open access versionFindings
  • L. Huang, Y. Yang, Y. Deng, and Y. Yu. Densebox: Unifying landmark localization with end to end object detection. arXiv preprint arXiv:1509.04874, 2015. 1, 2
    Findings
  • S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015. 6
    Findings
  • V. Jain and E. Learned-Miller. Fddb: A benchmark for face detection in unconstrained settings. Technical Report UMCS-2010-009, University of Massachusetts, Amherst, 2010. 6
    Google ScholarFindings
  • V. Jain and E. G. Learned-Miller. Fddb: A benchmark for face detection in unconstrained settings. UMass Amherst Technical Report, 2010. 2
    Google ScholarFindings
  • M. Jones and P. Viola. Fast multi-view face detection. Mitsubishi Electric Research Lab TR-20003-96, 3:14, 2003. 2
    Google ScholarLocate open access versionFindings
  • M. Kostinger, P. Wohlhart, P. M. Roth, and H. Bischof. Annotated facial landmarks in the wild: A large-scale, realworld database for facial landmark localization. In Computer Vision Workshops (ICCV Workshops), 2011 IEEE International Conference on, pages 2144–2151. IEEE, 2011. 6
    Google ScholarLocate open access versionFindings
  • H. Li, Z. Lin, X. Shen, J. Brandt, and G. Hua. A convolutional neural network cascade for face detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5325–5334, 2015. 2
    Google ScholarLocate open access versionFindings
  • J. Li and Y. Zhang. Learning surf cascade for fast and accurate object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3468–3475, 2013. 2
    Google ScholarLocate open access versionFindings
  • S. Z. Li and Z. Zhang. Floatboost learning and statistical face detection. IEEE Transactions on pattern analysis and machine intelligence, 26(9):1112–1123, 2004. 2
    Google ScholarLocate open access versionFindings
  • S. Z. Li, L. Zhu, Z. Zhang, A. Blake, H. Zhang, and H. Shum. Statistical learning of multi-view face detection. In ECCV 2002, pages 67–81. Springer, 2002. 2
    Google ScholarLocate open access versionFindings
  • T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollr, and C. L. Zitnick. Microsoft coco: Common objects in context. In European Conference on Computer Vision (ECCV), Zrich, 2014. Oral. 6
    Google ScholarLocate open access versionFindings
  • W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg. Ssd: Single shot multibox detector. In ECCV, 2016. 2
    Google ScholarLocate open access versionFindings
  • M. Mathias, R. Benenson, M. Pedersoli, and L. Van Gool. Face detection without bells and whistles. In Computer Vision–ECCV 2014, pages 720–735. Springer, 2014. 2
    Google ScholarLocate open access versionFindings
  • H. Qin, J. Yan, X. Li, and X. Hu. Joint training of cascaded cnn for face detection. In Computer Vision and Pattern Recognition (CVPR), 2016 IEEE Conference on, 2016. 2
    Google ScholarLocate open access versionFindings
  • D. Ramanan. Face detection, pose estimation, and landmark localization in the wild. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), CVPR ’12, pages 2879–2886, Washington, DC, USA, 2012. IEEE Computer Society. 6
    Google ScholarLocate open access versionFindings
  • R. Ranjan, V. M. Patel, and R. Chellappa. A deep pyramid deformable part model for face detection. In BTAS, pages 1–8. IEEE, 2015. 2
    Google ScholarLocate open access versionFindings
  • J. Redmon, S. Divvala, R. Girshick, and A. Farhadi. You only look once: Unified, real-time object detection. arXiv preprint arXiv:1506.02640, 2015. 2
    Findings
  • S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems 28, pages 91–99. Curran Associates, Inc., 2015. 2, 5
    Google ScholarLocate open access versionFindings
  • H. Rowley, S. Baluja, T. Kanade, et al. Neural network-based face detection. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 20(1):23–38, 1998. 2
    Google ScholarLocate open access versionFindings
  • H. Rowley, S. Baluja, T. Kanade, et al. Rotation invariant neural network-based face detection. In Computer Vision and Pattern Recognition, 1998. Proceedings. 1998 IEEE Computer Society Conference on, pages 38–44. IEEE, 1998. 2
    Google ScholarLocate open access versionFindings
  • R. Vaillant, C. Monrocq, and Y. Le Cun. Original approach for the localisation of objects in images. IEE ProceedingsVision, Image and Signal Processing, 141(4):245–250, 1994. 1, 2
    Google ScholarLocate open access versionFindings
  • P. Viola and M. J. Jones. Robust real-time face detection. International journal of computer vision, 57(2):137–154, 2004. 1, 2
    Google ScholarLocate open access versionFindings
  • J. Yan, Z. Lei, L. Wen, and S. Li. The fastest deformable part model for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2497–2504, 2014. 2
    Google ScholarLocate open access versionFindings
  • B. Yang, J. Yan, Z. Lei, and S. Z. Li. Aggregate channel features for multi-view face detection. In Biometrics (IJCB), 2014 IEEE International Joint Conference on, pages 1–8. IEEE, 2014. 2
    Google ScholarLocate open access versionFindings
  • B. Yang, J. Yan, Z. Lei, and S. Z. Li. Convolutional channel features for pedestrian, face and edge detection. arXiv preprint arXiv:1504.07339, 2015. 2
    Findings
  • B. Yang, J. Yan, Z. Lei, and S. Z. Li. Fine-grained evaluation on face detection in the wild. In Automatic Face and Gesture Recognition (FG), 11th IEEE International Conference on. IEEE, 2015. 6
    Google ScholarLocate open access versionFindings
  • S. Yang, P. Luo, C.-C. Loy, and X. Tang. From facial parts responses to face detection: A deep learning approach. In Proceedings of International Conference on Computer Vision (ICCV), 2015. 1, 2
    Google ScholarLocate open access versionFindings
  • C. Zhang, J. C. Platt, and P. A. Viola. Multiple instance boosting for object detection. In Advances in neural information processing systems, pages 1417–1424, 2005. 2
    Google ScholarLocate open access versionFindings
  • L. Zhang, R. Chu, S. Xiang, S. Liao, and S. Z. Li. Face detection based on multi-block lbp representation. In Advances in biometrics, pages 11–18. Springer, 2007. 2
    Google ScholarLocate open access versionFindings
  • X. Zhu and D. Ramanan. Face detection, pose estimation, and landmark localization in the wild. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pages 2879–2886. IEEE, 2012. 2
    Google ScholarLocate open access versionFindings
Full Text
Your rating :
0

 

Tags
Comments