Controllable Person Image Synthesis with Attribute-Decomposed GAN

CVPR, pp. 5083-5092, 2020.

Cited by: 1|Bibtex|Views18|Links
EI
Keywords:
fusion modulestyle codeglobal texture encodingadaptive instance normalizationStructural SimilarityMore(14+)
Weibo:
We presented a novel AttributeDecomposed Generative Adversarial Networks for controllable person image synthesis, which allows flexible and continuous control of human attributes

Abstract:

This paper introduces the Attribute-Decomposed GAN, a novel generative model for controllable person image synthesis, which can produce realistic person images with desired human attributes (e.g., pose, head, upper clothes and pants) provided in various source inputs. The core idea of the proposed model is to embed human attributes into...More

Code:

Data:

0
Introduction
  • Person image synthesis (PIS), a challenging problem in areas of Computer Vision and Computer Graphics, has huge potential applications for image editing, movie making, person re-identification (Re-ID), virtual clothes try-on and so on.
  • The authors propose a brand-new task that aims at synthesizing person images with controllable human attributes, including pose and component attributes such as head, upper clothes and pants.
  • The proposed model embeds component attributes into the latent space to construct the style code and encodes the keypointsbased 2D skeleton extracted from the person image as the pose code, which enables intuitive component-specific control of the synthesis by freely editing the style code.
  • The authors' method can automatically synthesize high-quality person images in desired component attributes under arbitrary poses and can be widely applied in pose transfer and Re-ID, and garment transfer and attribute-specific data augmentation
Highlights
  • Person image synthesis (PIS), a challenging problem in areas of Computer Vision and Computer Graphics, has huge potential applications for image editing, movie making, person re-identification (Re-ID), virtual clothes try-on and so on
  • Detailed results are shown in the following subsections and more are available in the supplemental materials (Supp)
  • Inception Score (IS) [32] and Structural Similarity (SSIM) [37] are two most commonly-used evaluation metrics in the person image synthesis task, which were firstly used in PG2 [23]
  • We introduce a new metric called contextual (CX) score, which is proposed for image transformation [25] and uses the cosine distance between deep features to measure the similarity of two non-aligned images, ignoring the spatial position of the features
  • We presented a novel AttributeDecomposed Generative Adversarial Networks for controllable person image synthesis, which allows flexible and continuous control of human attributes
  • Our method introduces a new generator architecture which embeds the source person image into the latent space as a series of decomposed component codes and recombines these codes in a specific order to construct the full style code
Methods
  • The authors' goal is to synthesize high-quality person images with user-controlled human attributes, such as pose, head, upper clothes and pants.
  • The corresponding keypoint-based pose P ∈ R18×H×W of I, 18 channel heat map that encodes the locations of 18 joints of a human body, can be automatically extracted via an existing pose estimation method [5].
  • Pt and a source person image Is are fed into the generator and a synthesized image Ig following the appearance of Is but under the pose Pt will be challenged for realness by the discriminators.
  • The authors will give a detailed description for each part of the model
Results
  • The authors verify the effectiveness of the proposed network for attributes-guided person image synthesis tasks, and illustrate its superiority over other state-of-the-art methods.
  • CX is able to explicitly assess the texture coherence between two images and it is suitable for the task to measure the appearance consistency between the generated image and source image, recording as CXGS (CX-GT).
  • Except for these computed metrics, the authors perform the user study to assess the realness of synthesized images by human
Conclusion
  • The authors presented a novel AttributeDecomposed GAN for controllable person image synthesis, which allows flexible and continuous control of human attributes.
  • The authors' method introduces a new generator architecture which embeds the source person image into the latent space as a series of decomposed component codes and recombines these codes in a specific order to construct the full style code.
  • The authors believed that the solution using the offthe-shelf human parser to automatically separate component attributes from the entire person image could inspire future researches with insufficient data annotation.
  • The authors' method is well suited to generate person images and can be potentially adapted to other image synthesis tasks
Summary
  • Introduction:

    Person image synthesis (PIS), a challenging problem in areas of Computer Vision and Computer Graphics, has huge potential applications for image editing, movie making, person re-identification (Re-ID), virtual clothes try-on and so on.
  • The authors propose a brand-new task that aims at synthesizing person images with controllable human attributes, including pose and component attributes such as head, upper clothes and pants.
  • The proposed model embeds component attributes into the latent space to construct the style code and encodes the keypointsbased 2D skeleton extracted from the person image as the pose code, which enables intuitive component-specific control of the synthesis by freely editing the style code.
  • The authors' method can automatically synthesize high-quality person images in desired component attributes under arbitrary poses and can be widely applied in pose transfer and Re-ID, and garment transfer and attribute-specific data augmentation
  • Methods:

    The authors' goal is to synthesize high-quality person images with user-controlled human attributes, such as pose, head, upper clothes and pants.
  • The corresponding keypoint-based pose P ∈ R18×H×W of I, 18 channel heat map that encodes the locations of 18 joints of a human body, can be automatically extracted via an existing pose estimation method [5].
  • Pt and a source person image Is are fed into the generator and a synthesized image Ig following the appearance of Is but under the pose Pt will be challenged for realness by the discriminators.
  • The authors will give a detailed description for each part of the model
  • Results:

    The authors verify the effectiveness of the proposed network for attributes-guided person image synthesis tasks, and illustrate its superiority over other state-of-the-art methods.
  • CX is able to explicitly assess the texture coherence between two images and it is suitable for the task to measure the appearance consistency between the generated image and source image, recording as CXGS (CX-GT).
  • Except for these computed metrics, the authors perform the user study to assess the realness of synthesized images by human
  • Conclusion:

    The authors presented a novel AttributeDecomposed GAN for controllable person image synthesis, which allows flexible and continuous control of human attributes.
  • The authors' method introduces a new generator architecture which embeds the source person image into the latent space as a series of decomposed component codes and recombines these codes in a specific order to construct the full style code.
  • The authors believed that the solution using the offthe-shelf human parser to automatically separate component attributes from the entire person image could inspire future researches with insufficient data annotation.
  • The authors' method is well suited to generate person images and can be potentially adapted to other image synthesis tasks
Tables
  • Table1: Quantitative comparison with state-of-the-art methods on DeepFashion
  • Table2: Results of the user study (%). R2G means the percentage of real images rated as generated w.r.t. all real images. G2R means the percentage of generated images rated as real w.r.t. all generated images. The user preference of the most realistic images w.r.t. source persons is shown in the last row
Download tables as Excel
Related work
  • 2.1. Image Synthesis

    Due to their remarkable results, Generative Adversarial Networks (GANs) [13] have become powerful generative models for image synthesis [16, 44, 4] in the last few years. The image-to-image translation task was solved with conditional GANs [26] in Pix2pix [16] and extended to highresolution level in Pix2pixHD [36]. Zhu et al [44] introduced an unsupervised method, CycleGAN, exploiting cycle consistency to generate the image from two domains with unlabeled images. Much of the work focused on improving the quality of GAN-synthesized images by stacked architectures [43, 27], more interpretable latent representations [7] or self-attention mechanism [42]. StyleGAN [18] synthesized impressive images by proposing a brandnew generator architecture which controls generator via the adaptive instance normalization (AdaIN) [15], the outcome of style transfer literature [10, 11, 17]. However, these techniques have limited scalability in handling attributed-guided person synthesis, due to complex appearances and simple poses with only several keypoints. Our method built on GANs overcomes these challenges by a novel generator architecture designed with attribute decomposition.
Funding
  • This work was supported by National Natural Science Foundation of China (Grant No.: 61672043 and 61672056), Beijing Nova Program of Science and Technology (Grant No.: Z191100001119077), Key Laboratory of Science, Technology and Standard in Press Industry (Key Laboratory of Intelligent Press Media Technology)
Reference
  • Kfir Aberman, Rundi Wu, Dani Lischinski, Baoquan Chen, and Daniel Cohen-Or. Learning character-agnostic motion for motion retargeting in 2d. arXiv preprint arXiv:1905.01680, 2019. 2
    Findings
  • Vijay Badrinarayanan, Alex Kendall, and Roberto Cipolla. Segnet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE transactions on pattern analysis and machine intelligence, 39(12):2481–2495, 2017. 6
    Google ScholarLocate open access versionFindings
  • Guha Balakrishnan, Amy Zhao, Adrian V Dalca, Fredo Durand, and John Guttag. Synthesizing images of humans in unseen poses. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8340– 8348, 2018. 2
    Google ScholarLocate open access versionFindings
  • Andrew Brock, Jeff Donahue, and Karen Simonyan. Large scale gan training for high fidelity natural image synthesis. arXiv preprint arXiv:1809.11096, 2018. 2
    Findings
  • Zhe Cao, Tomas Simon, Shih-En Wei, and Yaser Sheikh. Realtime multi-person 2d pose estimation using part affinity fields. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7291–7299, 2017. 3
    Google ScholarLocate open access versionFindings
  • Caroline Chan, Shiry Ginosar, Tinghui Zhou, and Alexei A Efros. Everybody dance now. In Proceedings of the IEEE International Conference on Computer Vision, pages 5933– 5942, 2019. 2
    Google ScholarLocate open access versionFindings
  • Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel. Infogan: Interpretable representation learning by information maximizing generative adversarial nets. In Advances in neural information processing systems, pages 2172–2180, 2016. 2
    Google ScholarLocate open access versionFindings
  • Vincent Dumoulin, Jonathon Shlens, and Manjunath Kudlur. A learned representation for artistic style. arXiv preprint arXiv:1610.07629, 2016. 5
    Findings
  • Patrick Esser, Ekaterina Sutter, and Bjorn Ommer. A variational u-net for conditional appearance and shape generation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8857–8866, 2018. 1, 2, 5
    Google ScholarLocate open access versionFindings
  • Leon Gatys, Alexander S Ecker, and Matthias Bethge. Texture synthesis using convolutional neural networks. In Advances in Neural Information Processing Systems, pages 262–270, 2015. 2
    Google ScholarLocate open access versionFindings
  • Leon A Gatys, Alexander S Ecker, and Matthias Bethge. Image style transfer using convolutional neural networks. In Computer Vision and Pattern Recognition (CVPR), 2016 IEEE Conference on, pages 2414–2423. IEEE, 2016. 2, 4
    Google ScholarLocate open access versionFindings
  • Ke Gong, Xiaodan Liang, Dongyu Zhang, Xiaohui Shen, and Liang Lin. Look into person: Self-supervised structuresensitive learning and a new benchmark for human parsing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 932–940, 2017. 4, 6
    Google ScholarLocate open access versionFindings
  • Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014. 2
    Google ScholarLocate open access versionFindings
  • Zhenliang He, Wangmeng Zuo, Meina Kan, Shiguang Shan, and Xilin Chen. Attgan: Facial attribute editing by only changing what you want. IEEE Transactions on Image Processing, 2019. 2, 3
    Google ScholarLocate open access versionFindings
  • Xun Huang and Serge Belongie. Arbitrary style transfer in real-time with adaptive instance normalization. In Proceedings of the IEEE International Conference on Computer Vision, pages 1501–1510, 2017. 2, 4, 5
    Google ScholarLocate open access versionFindings
  • Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1125–1134, 2017. 2, 5
    Google ScholarLocate open access versionFindings
  • Justin Johnson, Alexandre Alahi, and Li Fei-Fei. Perceptual losses for real-time style transfer and super-resolution. In European Conference on Computer Vision, pages 694–711. Springer, 2016. 2
    Google ScholarLocate open access versionFindings
  • Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4401–4410, 2019. 2
    Google ScholarLocate open access versionFindings
  • Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014. 6
    Findings
  • Christoph Lassner, Gerard Pons-Moll, and Peter V Gehler. A generative model of people in clothing. In Proceedings of the IEEE International Conference on Computer Vision, pages 853–862, 2017. 2
    Google ScholarLocate open access versionFindings
  • Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014. 4
    Google ScholarLocate open access versionFindings
  • Ziwei Liu, Ping Luo, Shi Qiu, Xiaogang Wang, and Xiaoou Tang. Deepfashion: Powering robust clothes recognition and retrieval with rich annotations. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1096–1104, 2016. 6
    Google ScholarLocate open access versionFindings
  • Liqian Ma, Xu Jia, Qianru Sun, Bernt Schiele, Tinne Tuytelaars, and Luc Van Gool. Pose guided person image generation. In Advances in Neural Information Processing Systems, pages 406–416, 2017. 1, 2, 3, 6, 7
    Google ScholarLocate open access versionFindings
  • Liqian Ma, Qianru Sun, Stamatios Georgoulis, Luc Van Gool, Bernt Schiele, and Mario Fritz. Disentangled person image generation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 99–108, 2018. 1, 2, 7
    Google ScholarLocate open access versionFindings
  • Roey Mechrez, Itamar Talmi, and Lihi Zelnik-Manor. The contextual loss for image transformation with non-aligned data. In Proceedings of the European Conference on Computer Vision (ECCV), pages 768–783, 2018. 6
    Google ScholarLocate open access versionFindings
  • Mehdi Mirza and Simon Osindero. Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784, 2014. 2
    Findings
  • Goncalo Mordido, Haojin Yang, and Christoph Meinel. Dropout-gan: Learning from a dynamic ensemble of discriminators. arXiv preprint arXiv:1807.11346, 2018. 2
    Findings
  • Gerard Pons-Moll, Sergi Pujades, Sonny Hu, and Michael J Black. Clothcap: Seamless 4d clothing capture and retargeting. ACM Transactions on Graphics (TOG), 36(4):73, 2017. 2
    Google ScholarLocate open access versionFindings
  • Albert Pumarola, Antonio Agudo, Alberto Sanfeliu, and Francesc Moreno-Noguer. Unsupervised person image synthesis in arbitrary poses. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8620–8628, 2018. 3
    Google ScholarLocate open access versionFindings
  • Amit Raj, Patsorn Sangkloy, Huiwen Chang, James Hays, Duygu Ceylan, and Jingwan Lu. Swapnet: Image based garment transfer. In European Conference on Computer Vision, pages 679–695. Springer, 2018. 2
    Google ScholarLocate open access versionFindings
  • Olaf Ronneberger, Philipp Fischer, and Thomas Brox. Unet: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pages 234–241. Springer, 2015. 2
    Google ScholarLocate open access versionFindings
  • Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans. In Advances in neural information processing systems, pages 2234–2242, 2016. 6
    Google ScholarLocate open access versionFindings
  • Aliaksandr Siarohin, Enver Sangineto, Stephane Lathuiliere, and Nicu Sebe. Deformable gans for pose-based human image generation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3408– 3416, 2018. 1, 2, 5, 6, 7
    Google ScholarLocate open access versionFindings
  • Sijie Song, Wei Zhang, Jiaying Liu, and Tao Mei. Unsupervised person image generation with semantic parsing transformation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2357– 2366, 2019. 3
    Google ScholarLocate open access versionFindings
  • Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Guilin Liu, Andrew Tao, Jan Kautz, and Bryan Catanzaro. Video-tovideo synthesis. arXiv preprint arXiv:1808.06601, 2018. 2
    Findings
  • Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Andrew Tao, Jan Kautz, and Bryan Catanzaro. High-resolution image synthesis and semantic manipulation with conditional gans. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 8798–8807, 2018. 2
    Google ScholarLocate open access versionFindings
  • Zhou Wang, Alan C Bovik, Hamid R Sheikh, Eero P Simoncelli, et al. Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing, 13(4):600–612, 2004. 6
    Google ScholarLocate open access versionFindings
  • Shan Yang, Tanya Ambert, Zherong Pan, Ke Wang, Licheng Yu, Tamara Berg, and Ming C Lin. Detailed garment recovery from a single-view image. arXiv preprint arXiv:1608.01250, 2016. 2
    Findings
  • Weidong Yin, Yanwei Fu, Leonid Sigal, and Xiangyang Xue. Semi-latent gan: Learning to generate and modify facial images from attributes. arXiv preprint arXiv:1704.02166, 2017. 2, 3
    Findings
  • Mihai Zanfir, Alin-Ionut Popa, Andrei Zanfir, and Cristian Sminchisescu. Human appearance transfer. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5391–5399, 2018. 2
    Google ScholarLocate open access versionFindings
  • Gang Zhang, Meina Kan, Shiguang Shan, and Xilin Chen. Generative adversarial network with spatial attention for face attribute editing. In Proceedings of the European Conference on Computer Vision (ECCV), pages 417–432, 2018. 2, 3
    Google ScholarLocate open access versionFindings
  • Han Zhang, Ian Goodfellow, Dimitris Metaxas, and Augustus Odena. Self-attention generative adversarial networks. arXiv preprint arXiv:1805.08318, 2018. 2
    Findings
  • Han Zhang, Tao Xu, Hongsheng Li, Shaoting Zhang, Xiaogang Wang, Xiaolei Huang, and Dimitris N Metaxas. Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks. In Proceedings of the IEEE International Conference on Computer Vision, pages 5907– 5915, 2017. 2
    Google ScholarLocate open access versionFindings
  • Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. Unpaired image-to-image translation using cycleconsistent adversarial networks. In Proceedings of the IEEE international conference on computer vision, pages 2223– 2232, 2017. 2
    Google ScholarLocate open access versionFindings
  • Shizhan Zhu, Raquel Urtasun, Sanja Fidler, Dahua Lin, and Chen Change Loy. Be your own prada: Fashion synthesis with structural coherence. In Proceedings of the IEEE International Conference on Computer Vision, pages 1680–1688, 2017. 2
    Google ScholarLocate open access versionFindings
  • Zhen Zhu, Tengteng Huang, Baoguang Shi, Miao Yu, Bofei Wang, and Xiang Bai. Progressive pose attention transfer for person image generation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2347–2356, 2019. 2, 3, 5, 6, 7
    Google ScholarLocate open access versionFindings
Your rating :
0

 

Tags
Comments