# FaceNet: A Unified Embedding for Face Recognition and Clustering

IEEE Conference on Computer Vision and Pattern Recognition, 2015.

EI

Weibo:

Abstract:

Despite significant recent advances in the field of face recognition, implementing face verification and recognition efficiently at scale presents serious challenges to current approaches. In this paper we present a system, called FaceNet, that directly learns a mapping from face images to a compact Euclidean space where distances direc...More

Code:

Data:

Introduction

- In this paper the authors present a unified system for face verification, recognition and clustering.
- The network is trained such that the squared L2 distances in the embedding space directly correspond to face similarity: faces of the same person have small distances and faces of distinct people have large distances
- Once this embedding has been produced, the aforementioned tasks become straight-forward: face verification involves thresholding the distance between the two embeddings; recognition becomes a k-NN classifica-.
- Some recent work [15] has reduced this dimensionality using PCA, but this is a linear transformation that can be learnt in one layer of the network

Highlights

- In this paper we present a unified system for face verification, recognition and clustering
- Our method is based on learning a Euclidean embedding per image using a deep convolutional network
- Previous face recognition approaches based on deep networks use a classification layer [15, 17] trained over a set of known face identities and take an intermediate bottleneck layer as a representation used to generalize recognition beyond the set of identities used in training
- In this paper we explore two different deep network architectures that have been recently used to great success in the computer vision community
- We provide a method to directly learn an embedding into an Euclidean space for face verification
- This sets it apart from other methods [15, 17] who use the CNN bottleneck layer, or require additional post-processing such as concatenation of multiple models and PCA, as well as SVM classification

Methods

- FaceNet uses a deep convolutional network.
- The authors discuss two different core architectures: The Zeiler&Fergus [22] style networks and the recent Inception [16] type networks.
- The details of these networks are described in section 3.3.
- DEEP ARCHITECTURE L2 BEDDI Triplet Loss N G

Results

- The benefit of the approach is much greater representational efficiency: the authors achieve state-of-the-art face recognition performance using only 128-bytes per face.
- The authors achieve a classification accuracy of 98.87%±0.15 when using the fixed center crop described in (1) and the record breaking 99.63%±0.09 standard error of the mean when using the extra face alignment (2)

Conclusion

- The authors provide a method to directly learn an embedding into an Euclidean space for face verification.
- The authors' end-to-end training both simplifies the setup and shows that directly optimizing a loss relevant to the task at hand improves performance.
- Another strength of the model is that it only requires minimal alignment.
- The authors experimented with a similarity transform alignment and notice that this can improve performance slightly
- It is not clear if it is worth the extra complexity

Summary

## Introduction:

In this paper the authors present a unified system for face verification, recognition and clustering.- The network is trained such that the squared L2 distances in the embedding space directly correspond to face similarity: faces of the same person have small distances and faces of distinct people have large distances
- Once this embedding has been produced, the aforementioned tasks become straight-forward: face verification involves thresholding the distance between the two embeddings; recognition becomes a k-NN classifica-.
- Some recent work [15] has reduced this dimensionality using PCA, but this is a linear transformation that can be learnt in one layer of the network
## Methods:

FaceNet uses a deep convolutional network.- The authors discuss two different core architectures: The Zeiler&Fergus [22] style networks and the recent Inception [16] type networks.
- The details of these networks are described in section 3.3.
- DEEP ARCHITECTURE L2 BEDDI Triplet Loss N G
## Results:

The benefit of the approach is much greater representational efficiency: the authors achieve state-of-the-art face recognition performance using only 128-bytes per face.- The authors achieve a classification accuracy of 98.87%±0.15 when using the fixed center crop described in (1) and the record breaking 99.63%±0.09 standard error of the mean when using the extra face alignment (2)
## Conclusion:

The authors provide a method to directly learn an embedding into an Euclidean space for face verification.- The authors' end-to-end training both simplifies the setup and shows that directly optimizing a loss relevant to the task at hand improves performance.
- Another strength of the model is that it only requires minimal alignment.
- The authors experimented with a similarity transform alignment and notice that this can improve performance slightly
- It is not clear if it is worth the extra complexity

- Table1: NN1. This table show the structure of our Zeiler&Fergus [<a class="ref-link" id="c22" href="#r22">22</a>] based model with 1×1 convolutions inspired by [<a class="ref-link" id="c9" href="#r9">9</a>]. The input and output sizes are described in rows × cols × #f ilters. The kernel is specified as rows × cols, stride and the maxout [<a class="ref-link" id="c6" href="#r6">6</a>] pooling size as p = 2
- Table2: NN2. Details of the NN2 Inception incarnation. This model is almost identical to the one described in [<a class="ref-link" id="c16" href="#r16">16</a>]. The two major differences are the use of L2 pooling instead of max pooling (m), where specified. The pooling is always 3×3 (aside from the final average pooling) and in parallel to the convolutional modules inside each Inception module. If there is a dimensionality reduction after the pooling it is denoted with p. 1×1, 3×3, and 5×5 pooling are then concatenated to get the final output
- Table3: Network Architectures. This table compares the performance of our model architectures on the hold out test set (see section 4.1). Reported is the mean validation rate VAL at 10E-3 false accept rate. Also shown is the standard error of the mean across the five test splits
- Table4: Image Quality. The table on the left shows the effect on the validation rate at 10E-3 precision with varying JPEG quality. The one on the right shows how the image size in pixels effects the validation rate at 10E-3 precision. This experiment was done with NN1 on the first split of our test hold-out dataset
- Table5: Embedding Dimensionality. This
- Table6: Training Data Size. This table compares the performance after 700h of training for a smaller model with 96x96 pixel inputs. The model architecture is similar to NN2, but without the 5x5 convolutions in the Inception modules

Related work

- Similarly to other recent works which employ deep networks [15, 17], our approach is a purely data driven method which learns its representation directly from the pixels of the face. Rather than using engineered features, we use a large dataset of labelled faces to attain the appropriate invariances to pose, illumination, and other variational conditions.

In this paper we explore two different deep network architectures that have been recently used to great success in the computer vision community. Both are deep convolutional networks [8, 11]. The first architecture is based on the Zeiler&Fergus [22] model which consists of multiple interleaved layers of convolutions, non-linear activations, local response normalizations, and max pooling layers. We additionally add several 1×1×d convolution layers inspired by the work of [9]. The second architecture is based on the Inception model of Szegedy et al which was recently used as the winning approach for ImageNet 2014 [16]. These networks use mixed layers that run several different convolutional and pooling layers in parallel and concatenate their responses. We have found that these models can reduce the number of parameters by up to 20 times and have the potential to reduce the number of FLOPS required for comparable performance.

Reference

- Y. Bengio, J. Louradour, R. Collobert, and J. Weston. Curriculum learning. In Proc. of ICML, New York, NY, USA, 2009. 2
- D. Chen, X. Cao, L. Wang, F. Wen, and J. Sun. Bayesian face revisited: A joint formulation. In Proc. ECCV, 2012
- D. Chen, S. Ren, Y. Wei, X. Cao, and J. Sun. Joint cascade face detection and alignment. In Proc. ECCV, 2014. 8
- J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, M. Mao, M. Ranzato, A. Senior, P. Tucker, K. Yang, Q. V. Le, and A. Y. Ng. Large scale distributed deep networks. In P. Bartlett, F. Pereira, C. Burges, L. Bottou, and K. Weinberger, editors, NIPS, pages 1232–1240. 2012. 9
- J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res., 12:2121–2159, July 2011. 4
- I. J. Goodfellow, D. Warde-farley, M. Mirza, A. Courville, and Y. Bengio. Maxout networks. In In ICML, 2013. 4
- G. B. Huang, M. Ramesh, T. Berg, and E. Learned-Miller. Labeled faces in the wild: A database for studying face recognition in unconstrained environments. Technical Report 07-49, University of Massachusetts, Amherst, October 2005
- Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel. Backpropagation applied to handwritten zip code recognition. Neural Computation, 1(4):541–551, Dec. 1989. 2, 4
- M. Lin, Q. Chen, and S. Yan. Network in network. CoRR, abs/1312.4400, 2013. 2, 4, 6
- C. Lu and X. Tang. Surpassing human-level face verification performance on LFW with gaussianface. CoRR, abs/1404.3840, 2014. 1
- D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning representations by back-propagating errors. Nature, 1986. 2, 4
- M. Schultz and T. Joachims. Learning a distance metric from relative comparisons. In S. Thrun, L. Saul, and B. Schölkopf, editors, NIPS, pages 41–48. MIT Press, 2004. 2
- T. Sim, S. Baker, and M. Bsat. The CMU pose, illumination, and expression (PIE) database. In In Proc. FG, 2002. 2
- Y. Sun, X. Wang, and X. Tang. Deep learning face representation by joint identification-verification. CoRR, abs/1406.4773, 201, 2, 3
- Y. Sun, X. Wang, and X. Tang. Deeply learned face representations are sparse, selective, and robust. CoRR, abs/1412.1265, 2014. 1, 2, 5, 8
- C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. CoRR, abs/1409.4842, 2014. 2, 4, 5, 6, 9
- Y. Taigman, M. Yang, M. Ranzato, and L. Wolf. Deepface: Closing the gap to human-level performance in face verification. In IEEE Conf. on CVPR, 2014. 1, 2, 5, 8
- J. Wang, Y. Song, T. Leung, C. Rosenberg, J. Wang, J. Philbin, B. Chen, and Y. Wu. Learning fine-grained image similarity with deep ranking. CoRR, abs/1404.4661, 2014. 2
- K. Q. Weinberger, J. Blitzer, and L. K. Saul. Distance metric learning for large margin nearest neighbor classification. In NIPS. MIT Press, 2006. 2, 3
- D. R. Wilson and T. R. Martinez. The general inefficiency of batch training for gradient descent learning. Neural Networks, 16(10):1429–1451, 2003. 4
- L. Wolf, T. Hassner, and I. Maoz. Face recognition in unconstrained videos with matched background similarity. In IEEE Conf. on CVPR, 2011. 5
- M. D. Zeiler and R. Fergus. Visualizing and understanding convolutional networks. CoRR, abs/1311.2901, 2013. 2, 4, 6
- Z. Zhu, P. Luo, X. Wang, and X. Tang. Recover canonicalview faces in the wild with deep neural networks. CoRR, abs/1404.3543, 2014. 2

Full Text

Tags

Comments