AI帮你理解科学

AI 生成解读视频

AI抽取解析论文重点内容自动生成视频


pub
生成解读视频

AI 溯源

AI解析本论文相关学术脉络


Master Reading Tree
生成 溯源树

AI 精读

AI抽取本论文的概要总结


微博一下
We propose a novel architecture based on the concept of transforming encoder-decoders, where labeled factors of image variation are learned via rotationally equivariant mappings1 between independent latent embedding spaces and the image space

Self-Learning Transformations for Improving Gaze and Head Redirection

NIPS 2020, (2020)

被引用0|浏览157
EI
下载 PDF 全文
引用
微博一下

摘要

Many computer vision tasks rely on labeled data. Rapid progress in generative modeling has led to the ability to synthesize photorealistic images. However, controlling specific aspects of the generation process such that the data can be used for supervision of downstream tasks remains challenging. In this paper we propose a novel genera...更多
0
简介
  • Extracting information from images of human faces is one of the core problems in artificial intelligence and computer vision.
  • Previous methods only work with eye region inputs, requiring high-quality images for training, and suffer from a lack of fidelity in preserving gaze in many cases
  • The authors advance this task by generating high-fidelity face images with target gaze and head orientation along with control over many other independent factors
重点内容
  • Extracting information from images of human faces is one of the core problems in artificial intelligence and computer vision
  • Domain adaptation approaches can be sensitive to changes in the underlying distribution of gaze directions, producing unfaithful images that do not help in improving gaze estimator performance [14]
  • We propose a novel architecture based on the concept of transforming encoder-decoders (T-ED), where labeled factors of image variation are learned via rotationally equivariant mappings1 between independent latent embedding spaces and the image space (Fig. 1 bottom-left)
  • Our Self-Transforming Encoder-Decoder (ST-ED) architecture is able to align to all predicted conditions from the target image or just the gaze direction and head orientation conditions (g + h in Tab. 1) and as such we report both scores
  • Our proposed functional loss punishes perceptual differences between images with an emphasis on taskrelevant features, which can be useful for various problems with an image reconstruction objective, for e.g., auto-encoding, neural rendering, etc
  • A novel evaluation scheme shows that our method improves upon the state-of-the-art in redirection accuracy and disentanglement between gaze direction and head orientation changes
  • The gaze direction and head orientation apparent in the output video sequences more faithfully reflect the given inputs, with promising results at extreme angles which go beyond the range of the training dataset
方法
  • 3.1 Problem Setting

    The authors' goal is to train a conditional generative network, such that given an input image Xi and a set of target conditions ct, it generates an output image Xt by learning the mapping: (Xi, ct) → Xt (Fig. 1 top-left).
  • As shown in Table 2, augmenting the real data samples using the method from He et al [17] generally results in performance degradation compared to the baseline, despite having used more labeled data for training the redirection network.
  • This difference in implementation was necessary, as with very few samples the redirection network of He et al could not be trained successfully.
  • The approach of He et al [17] cannot be trained in a semi-supervised manner
结果
  • 4.1 Implementation details

    The authors parameterize Genc and Gdec with a DenseNet-based architecture as done in [20].
  • The authors implement the external gaze direction and head orientation estimation network Fd by a VGG-16 [47] based architecture which outputs its predictions in spherical coordinates [29].
  • Each dataset exhibits different distributions of head orientation and gaze direction, as well as differences in the present extraneous factors.
  • This cross-dataset experiment allows for a better characterization of the approach in comparison to the state-of-the-art approaches.
  • Target (a) Eyeglasses and Expressions (b) Person-specific face shape (c) Finer details
结论
  • The authors note that the architecture is motivated in a general sense and the authors are optimistic in its potential application to other conditional image generation tasks for which a small subset of labels is available and many more factors of variation need to be identified and separated without explicit supervision.
  • The authors can discover and match the mis-alignment of extraneous variations with the selflearned transformations, and improve the learning of task-relevant factors.
  • The authors leave further exploration of different application domains for future work.
表格
  • Table1: Ablation study (lower is better). Our FAZE [<a class="ref-link" id="c20" href="#r20">20</a>]-like T-ED base model learns only explicit factors. Sensitivity to loss term weights. Our method is robust to different loss term weight combinations
  • Table2: State-of-the-art comparisons. We compare our best model against StarGAN [<a class="ref-link" id="c21" href="#r21">21</a>] and He et al [<a class="ref-link" id="c17" href="#r17">17</a>] on the task of full-face gaze and head redirection, evaluated on four gaze datasets. Our approach not only generates gaze direction and head orientation more faithfully, but also achieves better disentanglement for separetely controlling the two properties. Furthermore, our model allows for the manipulation of extraneous factors, enabling us to out-perform in terms of perceptual image quality as well (for the row Ours, we calculate LPIPS after aligning all factors to a target image using its pseudo-labels ct). Downstream estimation error with 2.5k real training samples. We show the results from He et al [<a class="ref-link" id="c17" href="#r17">17</a>], a supervised baseline method and our method for both gaze and head pose estimation tasks. The redirector is trained on the whole GazeCapture training set for He et al [<a class="ref-link" id="c17" href="#r17">17</a>], and only 2.5k real samples for our method (in a semi-supervised fashion with unlabeled samples from the rest of the dataset)
  • Table3: Architecture of the PatchGAN discriminator used to train ST-ED
  • Table4: Architecture of the external gaze direction and head orientation estimation network, Fd
  • Table5: Table 5
  • Table6: Global discriminator network with a regression branch for gaze direction and head orientation, as used in the re-implementation of the He et al [<a class="ref-link" id="c17" href="#r17">17</a>] and StarGAN [<a class="ref-link" id="c21" href="#r21">21</a>] approaches
Download tables as Excel
基金
  • Acknowledgments and Disclosure of Funding This project has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme grant agreement No StG-2016-717054
研究对象与分析
datasets: 4
• high-fidelity gaze and head orientation manipulation on the generated face images, and. • demonstration of performance improvements on four datasets in the real-world downstream tasks of cross-dataset gaze estimation, by augmenting real training data via redirection. Gaze Redirection

participants: 1474
We train ST-ED using the GazeCapture training subset [18], the largest publicly available gaze dataset. It consists of 1474 participants and over two million frames taken in unconstrained settings, which makes it challenging to train with. As such, to the best of our knowledge, we are the first to demonstrate that photo-realistic gaze redirection models can be learned from such noisy data

popular gaze datasets: 4
The value of learning to robustly isolate and control explicit factors from in-the-wild data lies in its potential to improve performance of downstream computer vision tasks, such as in training gaze or head orientation estimation models. Therefore, we perform experiments on semi-supervised (a) Input image (b) StarGAN (c) He et al (d) Ours (g + h) (e) Ours (all) (f) Target Image person-independent cross-dataset estimation on four popular gaze datasets. We show that even with small amounts of training data, our gaze redirector can extract and understand the variation of the dataset’s factors sufficiently to augment it with new samples without introducing errors

evaluation datasets: 4
Lastly, we train a new gaze and head orientation estimation network (with the same architecture as Fd), but with this augmented set and compare its performance to the “baseline" version trained only with the smaller labeled dataset. Fig. 3 shows that the gaze and head orientation estimation networks trained with both labeled data and augmented data (via redirection with a semi-supervised ST-ED) yields consistently improved performance on all four evaluation datasets. This is particularly true for cases with very few labeled samples (2,500), where the largest gains in performance are found

gaze datasets: 4
Ablation study (lower is better). Our FAZE [20]-like T-ED base model learns only explicit factors. Sensitivity to loss term weights. Our method is robust to different loss term weight combinations. State-of-the-art comparisons. We compare our best model against StarGAN [21] and He et al [17] on the task of full-face gaze and head redirection, evaluated on four gaze datasets. Our approach not only generates gaze direction and head orientation more faithfully, but also achieves better disentanglement for separetely controlling the two properties. Furthermore, our model allows for the manipulation of extraneous factors, enabling us to out-perform in terms of perceptual image quality as well (for the row Ours, we calculate LPIPS after aligning all factors to a target image using its pseudo-labels ct). Downstream estimation error with 2.5k real training samples. We show the results from He et al [17], a supervised baseline method and our method for both gaze and head pose estimation tasks. The redirector is trained on the whole GazeCapture training set for He et al [17], and only 2.5k real samples for our method (in a semi-supervised fashion with unlabeled samples from the rest of the dataset). Architecture of the PatchGAN discriminator used to train ST-ED

引用论文
  • Catharine Oertel, Kenneth A Funes Mora, Joakim Gustafson, and Jean-Marc Odobez. Deciphering the silent participant: On the use of audio-visual cues for the classification of listener categories in group discussions. In Proceedings of the 2015 ACM on International Conference on Multimodal Interaction, pages 107–114, 2015.
    Google ScholarLocate open access versionFindings
  • Lifeng Fan, Wenguan Wang, Siyuan Huang, Xinyu Tang, and Song-Chun Zhu. Understanding human gaze communication by spatio-temporal graph reasoning. In ICCV, pages 5724–5733, 2019.
    Google ScholarLocate open access versionFindings
  • Guy Thomas Buswell. How people look at pictures: a study of the psychology and perception in art. 1935.
    Google ScholarFindings
  • Constantin A Rothkopf, Dana H Ballard, and Mary M Hayhoe. Task and context determine where you look. Journal of vision, 7(14):16–16, 2007.
    Google ScholarLocate open access versionFindings
  • Lex Fridman, Bryan Reimer, Bruce Mehler, and William T. Freeman. Cognitive load estimation in the wild. In ACM CHI, 2018.
    Google ScholarLocate open access versionFindings
  • Michael Xuelin Huang, Jiajia Li, Grace Ngai, and Hong Va Leong. Stressclick: Sensing stress from gaze-click patterns. In ACM MM, 2016.
    Google ScholarLocate open access versionFindings
  • Anna Maria Feit, Shane Williams, Arturo Toledo, Ann Paradiso, Harish Kulkarni, Shaun K. Kane, and Meredith Ringel Morris. Toward everyday gaze input: Accuracy and precision of eye tracking and implications for design. In ACM CHI, pages 1118–1130, 2017.
    Google ScholarLocate open access versionFindings
  • B.A. Smith, Q. Yin, S.K. Feiner, and S.K. Nayar. Gaze Locking: Passive Eye Contact Detection for Human-Object Interaction. In ACM UIST, pages 271–280, Oct 2013.
    Google ScholarLocate open access versionFindings
  • Xucong Zhang, Yusuke Sugano, and Andreas Bulling. Everyday eye contact detection using unsupervised gaze target discovery. In ACM UIST, pages 193–203, 2017.
    Google ScholarLocate open access versionFindings
  • Anjul Patney, Joohwan Kim, Marco Salvi, Anton Kaplanyan, Chris Wyman, Nir Benty, Aaron Lefohn, and David Luebke. Perceptually-based foveated virtual reality. In SIGGRAPH, 2016.
    Google ScholarLocate open access versionFindings
  • Erroll Wood, Tadas Baltruaitis, Xucong Zhang, Yusuke Sugano, Peter Robinson, and Andreas Bulling. Rendering of eyes for eye-shape registration and gaze estimation. In ICCV, pages 3756–3764, 2015.
    Google ScholarLocate open access versionFindings
  • Erroll Wood, Tadas Baltrušaitis, Louis-Philippe Morency, Peter Robinson, and Andreas Bulling. Learning an appearance-based gaze estimator from one million synthesised images. In ACM ETRA, page 131–138, New York, NY, USA, 2016. Association for Computing Machinery.
    Google ScholarLocate open access versionFindings
  • Ashish Shrivastava, Tomas Pfister, Oncel Tuzel, Joshua Susskind, Wenda Wang, and Russell Webb. Learning from simulated and unsupervised images through adversarial training. In CVPR, July 2017.
    Google ScholarLocate open access versionFindings
  • Kangwook Lee, Hoon Kim, and Changho Suh. Simulated+unsupervised learning with adaptive data generation and bidirectional mappings. In ICLR, 2018.
    Google ScholarLocate open access versionFindings
  • Yaroslav Ganin, Daniil Kononenko, Diana Sungatullina, and Victor Lempitsky. Deepwarp: Photorealistic image resynthesis for gaze manipulation. In ECCV, pages 311–326.
    Google ScholarLocate open access versionFindings
  • Yu Yu, Gang Liu, and Jean-Marc Odobez. Improving few-shot user-specific gaze adaptation via gaze redirection synthesis. In CVPR, June 2019.
    Google ScholarLocate open access versionFindings
  • Zhe He, Adrian Spurr, Xucong Zhang, and Otmar Hilliges. Photo-realistic monocular gaze redirection using generative adversarial networks. In ICCV. IEEE, 2019.
    Google ScholarLocate open access versionFindings
  • Kyle Krafka, Aditya Khosla, Petr Kellnhofer, Harini Kannan, Suchendra Bhandarkar, Wojciech Matusik, and Antonio Torralba. Eye Tracking for Everyone. In CVPR, 2016.
    Google ScholarLocate open access versionFindings
  • Xucong Zhang, Yusuke Sugano, Mario Fritz, and Andreas Bulling. Appearance-based gaze estimation in the wild. In CVPR, 2015.
    Google ScholarLocate open access versionFindings
  • Seonwook Park, Shalini De Mello, Pavlo Molchanov, Umar Iqbal, Otmar Hilliges, and Jan Kautz. Few-shot adaptive gaze estimation. In ICCV, 2019.
    Google ScholarLocate open access versionFindings
  • Yunjey Choi, Minje Choi, Munyoung Kim, Jung-Woo Ha, Sunghun Kim, and Jaegul Choo. Stargan: Unified generative adversarial networks for multi-domain image-to-image translation. In CVPR, pages 8789–8797, 2018.
    Google ScholarLocate open access versionFindings
  • Zhenliang He, Wangmeng Zuo, Meina Kan, Shiguang Shan, and Xilin Chen. Attgan: Facial attribute editing by only changing what you want. TIP, 28(11):5464–5478, 2019.
    Google ScholarLocate open access versionFindings
  • Albert Pumarola, Antonio Agudo, Aleix M Martinez, Alberto Sanfeliu, and Francesc Moreno-Noguer. Ganimation: Anatomically-aware facial animation from a single image. In ECCV, pages 818–833, 2018.
    Google ScholarLocate open access versionFindings
  • Po-Wei Wu, Yu-Jing Lin, Che-Han Chang, Edward Y Chang, and Shih-Wei Liao. Relgan: Multi-domain image-to-image translation via relative attributes. In ICCV, pages 5914–5922, 2019.
    Google ScholarLocate open access versionFindings
  • Erroll Wood, Tadas Baltrušaitis, Louis-Philippe Morency, Peter Robinson, and Andreas Bulling. Learning an appearance-based gaze estimator from one million synthesised images. In ACM ETRA, pages 131–138, 2016.
    Google ScholarLocate open access versionFindings
  • Justin Johnson, Alexandre Alahi, and Li Fei-Fei. Perceptual losses for real-time style transfer and super-resolution. In ECCV, pages 694–711.
    Google ScholarLocate open access versionFindings
  • Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. Unpaired image-to-image translation using cycle-consistent adversarial networks. In ICCV, pages 2223–2232, 2017.
    Google ScholarLocate open access versionFindings
  • Erroll Wood, Tadas Baltrušaitis, Louis-Philippe Morency, Peter Robinson, and Andreas Bulling. Gazedirector: Fully articulated eye gaze redirection in video. In Computer Graphics Forum, volume 37, pages 217–225. Wiley Online Library, 2018.
    Google ScholarLocate open access versionFindings
  • Xucong Zhang, Yusuke Sugano, Mario Fritz, and Andreas Bulling. Mpiigaze: Real-world dataset and deep appearance-based gaze estimation. TPAMI, 2019.
    Google ScholarLocate open access versionFindings
  • Kang Wang, Rui Zhao, and Qiang Ji. A hierarchical generative model for eye image synthesis and eye gaze estimation. In CVPR, 2018.
    Google ScholarLocate open access versionFindings
  • Seonwook Park, Xucong Zhang, Andreas Bulling, and Otmar Hilliges. Learning to find eye region landmarks for remote gaze estimation in unconstrained settings. In ACM ETRA, 2018.
    Google ScholarLocate open access versionFindings
  • Yifei Huang, Minjie Cai, Zhenqiang Li, and Yoichi Sato. Predicting gaze in egocentric video by learning task-dependent attention transition. In ECCV, 2018.
    Google ScholarLocate open access versionFindings
  • Taesung Park, Ming-Yu Liu, Ting-Chun Wang, and Jun-Yan Zhu. Semantic image synthesis with spatiallyadaptive normalization. In CVPR, pages 2337–2346, 2019.
    Google ScholarLocate open access versionFindings
  • Hsin-Ying Lee, Hung-Yu Tseng, Jia-Bin Huang, Maneesh Singh, and Ming-Hsuan Yang. Diverse imageto-image translation via disentangled representations. In ECCV, pages 35–51, 2018.
    Google ScholarLocate open access versionFindings
  • Youssef Alami Mejjati, Christian Richardt, James Tompkin, Darren Cosker, and Kwang In Kim. Unsupervised attention-guided image-to-image translation. In NeurIPS, pages 3693–3703, 2018.
    Google ScholarLocate open access versionFindings
  • Marek Kowalski, Stephan J. Garbin, Virginia Estellers, Tadas Baltrušaitis, Matthew Johnson, and Jamie Shotton. Config: Controllable neural face image generation. In European Conference on Computer Vision (ECCV), 2020.
    Google ScholarLocate open access versionFindings
  • Mehdi Mirza and Simon Osindero. Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784, 2014.
    Findings
  • Geoffrey E Hinton, Alex Krizhevsky, and Sida D Wang. Transforming auto-encoders. In ICANN, 2011.
    Google ScholarLocate open access versionFindings
  • Xu Chen, Jie Song, and Otmar Hilliges. Monocular neural image based rendering with continuous view control. In ICCV, October 2019.
    Google ScholarFindings
  • Siva Karthik Mustikovela, Varun Jampani, Shalini De Mello, Sifei Liu, Umar Iqbal, Carsten Rother, and Jan Kautz. Self-supervised viewpoint learning from image collections. In CVPR, 2020.
    Google ScholarLocate open access versionFindings
  • Thu Nguyen-Phuoc, Chuan Li, Lucas Theis, Christian Richardt, and Yong-Liang Yang. Hologan: Unsupervised learning of 3d representations from natural images. In ICCV, pages 7588–7597, 2019.
    Google ScholarLocate open access versionFindings
  • Daniel E Worrall, Stephan J Garbin, Daniyar Turmukhambetov, and Gabriel J Brostow. Interpretable transformations with encoder-decoder networks. In ICCV, 2017.
    Google ScholarLocate open access versionFindings
  • Helge Rhodin, Mathieu Salzmann, and Pascal Fua. Unsupervised geometry-aware representation for 3d human pose estimation. In ECCV, 2018.
    Google ScholarLocate open access versionFindings
  • Alexey Dosovitskiy and Thomas Brox. Generating images with perceptual similarity metrics based on deep networks. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, NeurIPS, pages 658–666. 2016.
    Google ScholarFindings
  • Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In NeurIPS, pages 2672–2680, 2014.
    Google ScholarLocate open access versionFindings
  • P. Isola, J. Zhu, T. Zhou, and A. A. Efros. Image-to-image translation with conditional adversarial networks. In CVPR, 2017.
    Google ScholarLocate open access versionFindings
  • Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR, 2014.
    Google ScholarLocate open access versionFindings
  • Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. CoRR, abs/1512.03385, 2015.
    Findings
  • Yusuke Sugano, Yasuyuki Matsushita, and Yoichi Sato. Learning-by-Synthesis for Appearance-based 3D Gaze Estimation. In CVPR, 2014.
    Google ScholarLocate open access versionFindings
  • Xucong Zhang, Yusuke Sugano, and Andreas Bulling. Revisiting data normalization for appearance-based gaze estimation. In ETRA, 2018.
    Google ScholarLocate open access versionFindings
  • Xucong Zhang, Yusuke Sugano, Mario Fritz, and Andreas Bulling. It’s written all over your face: Full-face appearance-based gaze estimation. In CVPRW, 2017.
    Google ScholarLocate open access versionFindings
  • Kenneth Alberto Funes Mora, Florent Monay, and Jean-Marc Odobez. Eyediap: A database for the development and evaluation of gaze estimation algorithms from rgb and rgb-d cameras. In ACM ETRA. ACM, March 2014.
    Google ScholarLocate open access versionFindings
  • Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick, Shakir Mohamed, and Alexander Lerchner. beta-vae: Learning basic visual concepts with a constrained variational framework. In ICLR, 2017.
    Google ScholarLocate open access versionFindings
  • Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. In CVPR, pages 586–595, 2018.
    Google ScholarLocate open access versionFindings
  • G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger. Densely connected convolutional networks. In CVPR, pages 2261–2269, 2017.
    Google ScholarLocate open access versionFindings
  • Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. ICLR, 12 2014.
    Google ScholarLocate open access versionFindings
  • J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A Large-Scale Hierarchical Image Database. In CVPR, 2009.
    Google ScholarLocate open access versionFindings
  • Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C Courville. Improved training of wasserstein gans. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, NeurIPS, 2017.
    Google ScholarFindings
  • Peiyun Hu and Deva Ramanan. Finding tiny faces. In CVPR, 2017.
    Google ScholarLocate open access versionFindings
  • J. Deng, Y. Zhou, S. Cheng, and S. Zaferiou. Cascade multi-view hourglass model for robust 3d face alignment. In FG, 2018.
    Google ScholarLocate open access versionFindings
  • Patrik Huber, Guosheng Hu, Rafael Tena, Pouria Mortazavian, P Koppen, William J Christmas, Matthias Ratsch, and Josef Kittler. A multiresolution 3d morphable face model and fitting framework. In VISIGRAPP, 2016.
    Google ScholarLocate open access versionFindings
作者
Yufeng Zheng
Yufeng Zheng
Seonwook Park
Seonwook Park
您的评分 :
0

 

标签
评论
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn
小科