We propose a novel architecture based on the concept of transforming encoder-decoders, where labeled factors of image variation are learned via rotationally equivariant mappings1 between independent latent embedding spaces and the image space
Self-Learning Transformations for Improving Gaze and Head Redirection
NIPS 2020, (2020)
Many computer vision tasks rely on labeled data. Rapid progress in generative modeling has led to the ability to synthesize photorealistic images. However, controlling specific aspects of the generation process such that the data can be used for supervision of downstream tasks remains challenging. In this paper we propose a novel genera...更多
下载 PDF 全文
- Extracting information from images of human faces is one of the core problems in artificial intelligence and computer vision.
- Previous methods only work with eye region inputs, requiring high-quality images for training, and suffer from a lack of fidelity in preserving gaze in many cases
- The authors advance this task by generating high-fidelity face images with target gaze and head orientation along with control over many other independent factors
- Extracting information from images of human faces is one of the core problems in artificial intelligence and computer vision
- Domain adaptation approaches can be sensitive to changes in the underlying distribution of gaze directions, producing unfaithful images that do not help in improving gaze estimator performance 
- We propose a novel architecture based on the concept of transforming encoder-decoders (T-ED), where labeled factors of image variation are learned via rotationally equivariant mappings1 between independent latent embedding spaces and the image space (Fig. 1 bottom-left)
- Our Self-Transforming Encoder-Decoder (ST-ED) architecture is able to align to all predicted conditions from the target image or just the gaze direction and head orientation conditions (g + h in Tab. 1) and as such we report both scores
- Our proposed functional loss punishes perceptual differences between images with an emphasis on taskrelevant features, which can be useful for various problems with an image reconstruction objective, for e.g., auto-encoding, neural rendering, etc
- A novel evaluation scheme shows that our method improves upon the state-of-the-art in redirection accuracy and disentanglement between gaze direction and head orientation changes
- The gaze direction and head orientation apparent in the output video sequences more faithfully reflect the given inputs, with promising results at extreme angles which go beyond the range of the training dataset
- 3.1 Problem Setting
The authors' goal is to train a conditional generative network, such that given an input image Xi and a set of target conditions ct, it generates an output image Xt by learning the mapping: (Xi, ct) → Xt (Fig. 1 top-left).
- As shown in Table 2, augmenting the real data samples using the method from He et al  generally results in performance degradation compared to the baseline, despite having used more labeled data for training the redirection network.
- This difference in implementation was necessary, as with very few samples the redirection network of He et al could not be trained successfully.
- The approach of He et al  cannot be trained in a semi-supervised manner
- 4.1 Implementation details
The authors parameterize Genc and Gdec with a DenseNet-based architecture as done in .
- The authors implement the external gaze direction and head orientation estimation network Fd by a VGG-16  based architecture which outputs its predictions in spherical coordinates .
- Each dataset exhibits different distributions of head orientation and gaze direction, as well as differences in the present extraneous factors.
- This cross-dataset experiment allows for a better characterization of the approach in comparison to the state-of-the-art approaches.
- Target (a) Eyeglasses and Expressions (b) Person-specific face shape (c) Finer details
- The authors note that the architecture is motivated in a general sense and the authors are optimistic in its potential application to other conditional image generation tasks for which a small subset of labels is available and many more factors of variation need to be identified and separated without explicit supervision.
- The authors can discover and match the mis-alignment of extraneous variations with the selflearned transformations, and improve the learning of task-relevant factors.
- The authors leave further exploration of different application domains for future work.
- Table1: Ablation study (lower is better). Our FAZE [<a class="ref-link" id="c20" href="#r20">20</a>]-like T-ED base model learns only explicit factors. Sensitivity to loss term weights. Our method is robust to different loss term weight combinations
- Table2: State-of-the-art comparisons. We compare our best model against StarGAN [<a class="ref-link" id="c21" href="#r21">21</a>] and He et al [<a class="ref-link" id="c17" href="#r17">17</a>] on the task of full-face gaze and head redirection, evaluated on four gaze datasets. Our approach not only generates gaze direction and head orientation more faithfully, but also achieves better disentanglement for separetely controlling the two properties. Furthermore, our model allows for the manipulation of extraneous factors, enabling us to out-perform in terms of perceptual image quality as well (for the row Ours, we calculate LPIPS after aligning all factors to a target image using its pseudo-labels ct). Downstream estimation error with 2.5k real training samples. We show the results from He et al [<a class="ref-link" id="c17" href="#r17">17</a>], a supervised baseline method and our method for both gaze and head pose estimation tasks. The redirector is trained on the whole GazeCapture training set for He et al [<a class="ref-link" id="c17" href="#r17">17</a>], and only 2.5k real samples for our method (in a semi-supervised fashion with unlabeled samples from the rest of the dataset)
- Table3: Architecture of the PatchGAN discriminator used to train ST-ED
- Table4: Architecture of the external gaze direction and head orientation estimation network, Fd
- Table5: Table 5
- Table6: Global discriminator network with a regression branch for gaze direction and head orientation, as used in the re-implementation of the He et al [<a class="ref-link" id="c17" href="#r17">17</a>] and StarGAN [<a class="ref-link" id="c21" href="#r21">21</a>] approaches
- Acknowledgments and Disclosure of Funding This project has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme grant agreement No StG-2016-717054
• high-fidelity gaze and head orientation manipulation on the generated face images, and. • demonstration of performance improvements on four datasets in the real-world downstream tasks of cross-dataset gaze estimation, by augmenting real training data via redirection. Gaze Redirection
We train ST-ED using the GazeCapture training subset , the largest publicly available gaze dataset. It consists of 1474 participants and over two million frames taken in unconstrained settings, which makes it challenging to train with. As such, to the best of our knowledge, we are the first to demonstrate that photo-realistic gaze redirection models can be learned from such noisy data
popular gaze datasets: 4
The value of learning to robustly isolate and control explicit factors from in-the-wild data lies in its potential to improve performance of downstream computer vision tasks, such as in training gaze or head orientation estimation models. Therefore, we perform experiments on semi-supervised (a) Input image (b) StarGAN (c) He et al (d) Ours (g + h) (e) Ours (all) (f) Target Image person-independent cross-dataset estimation on four popular gaze datasets. We show that even with small amounts of training data, our gaze redirector can extract and understand the variation of the dataset’s factors sufficiently to augment it with new samples without introducing errors
evaluation datasets: 4
Lastly, we train a new gaze and head orientation estimation network (with the same architecture as Fd), but with this augmented set and compare its performance to the “baseline" version trained only with the smaller labeled dataset. Fig. 3 shows that the gaze and head orientation estimation networks trained with both labeled data and augmented data (via redirection with a semi-supervised ST-ED) yields consistently improved performance on all four evaluation datasets. This is particularly true for cases with very few labeled samples (2,500), where the largest gains in performance are found
gaze datasets: 4
Ablation study (lower is better). Our FAZE -like T-ED base model learns only explicit factors. Sensitivity to loss term weights. Our method is robust to different loss term weight combinations. State-of-the-art comparisons. We compare our best model against StarGAN  and He et al  on the task of full-face gaze and head redirection, evaluated on four gaze datasets. Our approach not only generates gaze direction and head orientation more faithfully, but also achieves better disentanglement for separetely controlling the two properties. Furthermore, our model allows for the manipulation of extraneous factors, enabling us to out-perform in terms of perceptual image quality as well (for the row Ours, we calculate LPIPS after aligning all factors to a target image using its pseudo-labels ct). Downstream estimation error with 2.5k real training samples. We show the results from He et al , a supervised baseline method and our method for both gaze and head pose estimation tasks. The redirector is trained on the whole GazeCapture training set for He et al , and only 2.5k real samples for our method (in a semi-supervised fashion with unlabeled samples from the rest of the dataset). Architecture of the PatchGAN discriminator used to train ST-ED
- Catharine Oertel, Kenneth A Funes Mora, Joakim Gustafson, and Jean-Marc Odobez. Deciphering the silent participant: On the use of audio-visual cues for the classification of listener categories in group discussions. In Proceedings of the 2015 ACM on International Conference on Multimodal Interaction, pages 107–114, 2015.
- Lifeng Fan, Wenguan Wang, Siyuan Huang, Xinyu Tang, and Song-Chun Zhu. Understanding human gaze communication by spatio-temporal graph reasoning. In ICCV, pages 5724–5733, 2019.
- Guy Thomas Buswell. How people look at pictures: a study of the psychology and perception in art. 1935.
- Constantin A Rothkopf, Dana H Ballard, and Mary M Hayhoe. Task and context determine where you look. Journal of vision, 7(14):16–16, 2007.
- Lex Fridman, Bryan Reimer, Bruce Mehler, and William T. Freeman. Cognitive load estimation in the wild. In ACM CHI, 2018.
- Michael Xuelin Huang, Jiajia Li, Grace Ngai, and Hong Va Leong. Stressclick: Sensing stress from gaze-click patterns. In ACM MM, 2016.
- Anna Maria Feit, Shane Williams, Arturo Toledo, Ann Paradiso, Harish Kulkarni, Shaun K. Kane, and Meredith Ringel Morris. Toward everyday gaze input: Accuracy and precision of eye tracking and implications for design. In ACM CHI, pages 1118–1130, 2017.
- B.A. Smith, Q. Yin, S.K. Feiner, and S.K. Nayar. Gaze Locking: Passive Eye Contact Detection for Human-Object Interaction. In ACM UIST, pages 271–280, Oct 2013.
- Xucong Zhang, Yusuke Sugano, and Andreas Bulling. Everyday eye contact detection using unsupervised gaze target discovery. In ACM UIST, pages 193–203, 2017.
- Anjul Patney, Joohwan Kim, Marco Salvi, Anton Kaplanyan, Chris Wyman, Nir Benty, Aaron Lefohn, and David Luebke. Perceptually-based foveated virtual reality. In SIGGRAPH, 2016.
- Erroll Wood, Tadas Baltruaitis, Xucong Zhang, Yusuke Sugano, Peter Robinson, and Andreas Bulling. Rendering of eyes for eye-shape registration and gaze estimation. In ICCV, pages 3756–3764, 2015.
- Erroll Wood, Tadas Baltrušaitis, Louis-Philippe Morency, Peter Robinson, and Andreas Bulling. Learning an appearance-based gaze estimator from one million synthesised images. In ACM ETRA, page 131–138, New York, NY, USA, 2016. Association for Computing Machinery.
- Ashish Shrivastava, Tomas Pfister, Oncel Tuzel, Joshua Susskind, Wenda Wang, and Russell Webb. Learning from simulated and unsupervised images through adversarial training. In CVPR, July 2017.
- Kangwook Lee, Hoon Kim, and Changho Suh. Simulated+unsupervised learning with adaptive data generation and bidirectional mappings. In ICLR, 2018.
- Yaroslav Ganin, Daniil Kononenko, Diana Sungatullina, and Victor Lempitsky. Deepwarp: Photorealistic image resynthesis for gaze manipulation. In ECCV, pages 311–326.
- Yu Yu, Gang Liu, and Jean-Marc Odobez. Improving few-shot user-specific gaze adaptation via gaze redirection synthesis. In CVPR, June 2019.
- Zhe He, Adrian Spurr, Xucong Zhang, and Otmar Hilliges. Photo-realistic monocular gaze redirection using generative adversarial networks. In ICCV. IEEE, 2019.
- Kyle Krafka, Aditya Khosla, Petr Kellnhofer, Harini Kannan, Suchendra Bhandarkar, Wojciech Matusik, and Antonio Torralba. Eye Tracking for Everyone. In CVPR, 2016.
- Xucong Zhang, Yusuke Sugano, Mario Fritz, and Andreas Bulling. Appearance-based gaze estimation in the wild. In CVPR, 2015.
- Seonwook Park, Shalini De Mello, Pavlo Molchanov, Umar Iqbal, Otmar Hilliges, and Jan Kautz. Few-shot adaptive gaze estimation. In ICCV, 2019.
- Yunjey Choi, Minje Choi, Munyoung Kim, Jung-Woo Ha, Sunghun Kim, and Jaegul Choo. Stargan: Unified generative adversarial networks for multi-domain image-to-image translation. In CVPR, pages 8789–8797, 2018.
- Zhenliang He, Wangmeng Zuo, Meina Kan, Shiguang Shan, and Xilin Chen. Attgan: Facial attribute editing by only changing what you want. TIP, 28(11):5464–5478, 2019.
- Albert Pumarola, Antonio Agudo, Aleix M Martinez, Alberto Sanfeliu, and Francesc Moreno-Noguer. Ganimation: Anatomically-aware facial animation from a single image. In ECCV, pages 818–833, 2018.
- Po-Wei Wu, Yu-Jing Lin, Che-Han Chang, Edward Y Chang, and Shih-Wei Liao. Relgan: Multi-domain image-to-image translation via relative attributes. In ICCV, pages 5914–5922, 2019.
- Erroll Wood, Tadas Baltrušaitis, Louis-Philippe Morency, Peter Robinson, and Andreas Bulling. Learning an appearance-based gaze estimator from one million synthesised images. In ACM ETRA, pages 131–138, 2016.
- Justin Johnson, Alexandre Alahi, and Li Fei-Fei. Perceptual losses for real-time style transfer and super-resolution. In ECCV, pages 694–711.
- Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. Unpaired image-to-image translation using cycle-consistent adversarial networks. In ICCV, pages 2223–2232, 2017.
- Erroll Wood, Tadas Baltrušaitis, Louis-Philippe Morency, Peter Robinson, and Andreas Bulling. Gazedirector: Fully articulated eye gaze redirection in video. In Computer Graphics Forum, volume 37, pages 217–225. Wiley Online Library, 2018.
- Xucong Zhang, Yusuke Sugano, Mario Fritz, and Andreas Bulling. Mpiigaze: Real-world dataset and deep appearance-based gaze estimation. TPAMI, 2019.
- Kang Wang, Rui Zhao, and Qiang Ji. A hierarchical generative model for eye image synthesis and eye gaze estimation. In CVPR, 2018.
- Seonwook Park, Xucong Zhang, Andreas Bulling, and Otmar Hilliges. Learning to find eye region landmarks for remote gaze estimation in unconstrained settings. In ACM ETRA, 2018.
- Yifei Huang, Minjie Cai, Zhenqiang Li, and Yoichi Sato. Predicting gaze in egocentric video by learning task-dependent attention transition. In ECCV, 2018.
- Taesung Park, Ming-Yu Liu, Ting-Chun Wang, and Jun-Yan Zhu. Semantic image synthesis with spatiallyadaptive normalization. In CVPR, pages 2337–2346, 2019.
- Hsin-Ying Lee, Hung-Yu Tseng, Jia-Bin Huang, Maneesh Singh, and Ming-Hsuan Yang. Diverse imageto-image translation via disentangled representations. In ECCV, pages 35–51, 2018.
- Youssef Alami Mejjati, Christian Richardt, James Tompkin, Darren Cosker, and Kwang In Kim. Unsupervised attention-guided image-to-image translation. In NeurIPS, pages 3693–3703, 2018.
- Marek Kowalski, Stephan J. Garbin, Virginia Estellers, Tadas Baltrušaitis, Matthew Johnson, and Jamie Shotton. Config: Controllable neural face image generation. In European Conference on Computer Vision (ECCV), 2020.
- Mehdi Mirza and Simon Osindero. Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784, 2014.
- Geoffrey E Hinton, Alex Krizhevsky, and Sida D Wang. Transforming auto-encoders. In ICANN, 2011.
- Xu Chen, Jie Song, and Otmar Hilliges. Monocular neural image based rendering with continuous view control. In ICCV, October 2019.
- Siva Karthik Mustikovela, Varun Jampani, Shalini De Mello, Sifei Liu, Umar Iqbal, Carsten Rother, and Jan Kautz. Self-supervised viewpoint learning from image collections. In CVPR, 2020.
- Thu Nguyen-Phuoc, Chuan Li, Lucas Theis, Christian Richardt, and Yong-Liang Yang. Hologan: Unsupervised learning of 3d representations from natural images. In ICCV, pages 7588–7597, 2019.
- Daniel E Worrall, Stephan J Garbin, Daniyar Turmukhambetov, and Gabriel J Brostow. Interpretable transformations with encoder-decoder networks. In ICCV, 2017.
- Helge Rhodin, Mathieu Salzmann, and Pascal Fua. Unsupervised geometry-aware representation for 3d human pose estimation. In ECCV, 2018.
- Alexey Dosovitskiy and Thomas Brox. Generating images with perceptual similarity metrics based on deep networks. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, NeurIPS, pages 658–666. 2016.
- Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In NeurIPS, pages 2672–2680, 2014.
- P. Isola, J. Zhu, T. Zhou, and A. A. Efros. Image-to-image translation with conditional adversarial networks. In CVPR, 2017.
- Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR, 2014.
- Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. CoRR, abs/1512.03385, 2015.
- Yusuke Sugano, Yasuyuki Matsushita, and Yoichi Sato. Learning-by-Synthesis for Appearance-based 3D Gaze Estimation. In CVPR, 2014.
- Xucong Zhang, Yusuke Sugano, and Andreas Bulling. Revisiting data normalization for appearance-based gaze estimation. In ETRA, 2018.
- Xucong Zhang, Yusuke Sugano, Mario Fritz, and Andreas Bulling. It’s written all over your face: Full-face appearance-based gaze estimation. In CVPRW, 2017.
- Kenneth Alberto Funes Mora, Florent Monay, and Jean-Marc Odobez. Eyediap: A database for the development and evaluation of gaze estimation algorithms from rgb and rgb-d cameras. In ACM ETRA. ACM, March 2014.
- Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick, Shakir Mohamed, and Alexander Lerchner. beta-vae: Learning basic visual concepts with a constrained variational framework. In ICLR, 2017.
- Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. In CVPR, pages 586–595, 2018.
- G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger. Densely connected convolutional networks. In CVPR, pages 2261–2269, 2017.
- Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. ICLR, 12 2014.
- J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A Large-Scale Hierarchical Image Database. In CVPR, 2009.
- Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C Courville. Improved training of wasserstein gans. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, NeurIPS, 2017.
- Peiyun Hu and Deva Ramanan. Finding tiny faces. In CVPR, 2017.
- J. Deng, Y. Zhou, S. Cheng, and S. Zaferiou. Cascade multi-view hourglass model for robust 3d face alignment. In FG, 2018.
- Patrik Huber, Guosheng Hu, Rafael Tena, Pouria Mortazavian, P Koppen, William J Christmas, Matthias Ratsch, and Josef Kittler. A multiresolution 3d morphable face model and fitting framework. In VISIGRAPP, 2016.