We have found it difficult to optimize the adversarial objective in isolation: standard procedures often lead to the well-known problem of mode collapse, where all input images map to the same output image and the optimization fails to make progress
Unpaired Image-To-Image Translation Using Cycle-Consistent Adversarial Networks
2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), no. 1 (2017): 2242-2251
Image-to-image translation is a class of vision and graphics problems where the goal is to learn the mapping between an input image and an output image using a training set of aligned image pairs. However, for many tasks, paired training data will not be available. We present an approach for learning to translate an image from a source do...更多
下载 PDF 全文
- Image-to-image translation is a class of vision and graphics problems where the goal is to learn the mapping between an input image and an output image using a training set of aligned image pairs.
- The authors seek an algorithm that can learn to translate between domains without paired input-output examples (Figure 2, right).
- The authors adopt an adversarial loss to learn the mapping such that the translated image cannot be distinguished from images in the target domain.
- Image-to-image translation is a class of vision and graphics problems where the goal is to learn the mapping between an input image and an output image using a training set of aligned image pairs
- In practice, we have found it difficult to optimize the adversarial objective in isolation: standard procedures often lead to the well-known problem of mode collapse, where all input images map to the same output image and the optimization fails to make progress 
- Our objective contains kinds of two terms: adversarial losses  for matching the distribution of generated images to the data distribution in the target domain; and a cycle consistency loss to prevent the where λ controls the relative importance of the two objectives. learned mappings G and F from contradicting each other
- We introduce a similar adversarial loss for the mapping function F : Y → X and its discriminator DX as well: i.e. LGAN(F, DX , Y, X)
- We first compare our approach against recent methods for unpaired image-to-image translation on paired datasets where ground truth input-output pairs are available for evaluation
- We study the importance of both the adversarial loss and the cycle consistency loss, and compare our full method against several variants
- The authors' objective contains kinds of two terms: adversarial losses  for matching the distribution of generated images to the data distribution in the target domain; and a cycle consistency loss to prevent the where λ controls the relative importance of the two objectives.
- Adversarial training can, in theory, learn mappings G and F that produce outputs identically distributed as target domains Y and X respectively .
- The authors first compare the approach against recent methods for unpaired image-to-image translation on paired datasets where ground truth input-output pairs are available for evaluation.
- Pixel loss + GAN  Like the method, Shrivastava et al  uses an adversarial loss to train a translation from X to Y .
- The authors exclude pixel loss + GAN and feature loss + GAN in the figures, as both of the methods fail to produce results at all close to the target domain.
- The authors evaluate the method with the cycle loss in only one direction: GAN+forward cycle loss Ex∼pdata(x)[ F (G(x)) − x 1], or GAN+backward cycle loss Ey∼pdata(y)[ G(F (y)) − y 1] (Equation 2) and find that it often incurs training instability and causes mode collapse, especially for the direction of the mapping that was removed.
- Collection style transfer (Figure 8) The authors train the model on landscape photographs downloaded from Flickr and WikiArt. Note that unlike recent work on “neural style transfer” , the method learns to mimic the style of an entire set of artworks (e.g. Van Gogh), rather than transferring the style of a single selected piece of art (e.g. Starry Night).
- Photo generation from paintings (Figure 9) For painting→photo, the authors find that it is helpful to introduce an additional loss to encourage the mapping to preserve color composition between the input and output.
- When learning the mapping between Monet’s paintings and Flickr photographs, the generator often maps paintings of daytime to photographs taken during sunset, because such a mapping may be valid under the adversarial loss and cycle consistency loss.
- On the task of dog→cat transfiguration, the learned translation degenerates to making minimal changes to the input (Figure 12).
- Table1: AMT “real vs fake” test on maps↔aerial photos
- Table2: FCN-scores for different methods, evaluated on Cityscapes labels→photos
- Table3: Classification performance of photo→labels for different methods on cityscapes
- Generative Adversarial Networks (GANs) [14, 58] have achieved impressive results in image generation [5, 35], image editing , and representation learning [35, 39, 33]. Recent methods adopt the same idea for conditional image generation applications, such as text2image , image inpainting , and future prediction , as well as to other domains like videos  and 3D models . The key to GANs’ success is the idea of an adversarial loss that forces the generated images to be, in principle, indistinguishable from real images. This is particularly powerful for image generation tasks, as this is exactly the objective that much of computer graphics aims to optimize. We adopt an adversarial loss to learn the mapping such that the translated image cannot be distinguished from images in the target domain.
Image-to-Image Translation The idea of image-to-image translation goes back at least to Hertzmann et al.’s Image Analogies , who employ a nonparametric texture model  on a single input-output training image pair. More
- This work was supported in part by NSF SMA-1514512, NSF IIS1633310, a Google Research Award, Intel Corp, and hardware donations from NVIDIA
- JYZ is supported by the Facebook Graduate Fellowship and TP is supported by the Samsung Scholarship
- Y. Aytar, L. Castrejon, C. Vondrick, H. Pirsiavash, and A. Torralba. Cross-modal scene networks. arXiv preprint arXiv:1610.09003, 2016. 3
- K. Bousmalis, N. Silberman, D. Dohan, D. Erhan, and D. Krishnan. Unsupervised pixel-level domain adaptation with generative adversarial networks. arXiv preprint arXiv:1612.05424, 2016. 3
- R. W. Brislin. Back-translation for cross-cultural research. Journal of cross-cultural psychology, 1(3):185– 216, 1970. 2, 3
- M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele. The cityscapes dataset for semantic urban scene understanding. In CVPR, 2016. 2, 6
- E. L. Denton, S. Chintala, R. Fergus, et al. Deep generative image models using a laplacian pyramid of adversarial networks. In NIPS, pages 1486–1494, 2012
- J. Donahue, P. Krahenbuhl, and T. Darrell. Adversarial feature learning. arXiv preprint arXiv:1605.09782, 2015
- V. Dumoulin, I. Belghazi, B. Poole, A. Lamb, M. Arjovsky, O. Mastropietro, and A. Courville. Adversarially learned inference. arXiv preprint arXiv:1606.00704, 2016. 5
- A. A. Efros and T. K. Leung. Texture synthesis by non-parametric sampling. In ICCV, volume 2, pages 1033–103IEEE, 1999. 2
- D. Eigen and R. Fergus. Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In ICCV, pages 2650–2658, 2015. 2
- L. A. Gatys, M. Bethge, A. Hertzmann, and E. Shechtman. Preserving color in neural artistic style transfer. arXiv preprint arXiv:1606.05897, 2016. 3
- L. A. Gatys, A. S. Ecker, and M. Bethge. Image style transfer using convolutional neural networks. CVPR, 2016. 3, 6, 8
- C. Godard, O. Mac Aodha, and G. J. Brostow. Unsupervised monocular depth estimation with left-right consistency. In CVPR, 2017. 3
- I. Goodfellow. Nips 2016 tutorial: Generative adversarial networks. arXiv preprint arXiv:1701.00160, 2016. 2, 4
- I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In NIPS, 202, 3, 4, 5
- D. He, Y. Xia, T. Qin, L. Wang, N. Yu, T. Liu, and W.-Y. Ma. Dual learning for machine translation. In NIPS, pages 820–828, 2016. 3
- K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, pages 770– 778, 204
- A. Hertzmann, C. E. Jacobs, N. Oliver, B. Curless, and D. H. Salesin. Image analogies. In SIGGRAPH, pages 327–340. ACM, 2001. 2
- G. E. Hinton and R. R. Salakhutdinov. Reducing the dimensionality of data with neural networks. Science, 313(5786):504–507, 2006. 4
- Q.-X. Huang and L. Guibas. Consistent shape maps via semidefinite programming. In Computer Graphics Forum, volume 32, pages 177–186. Wiley Online Library, 2013. 3
- P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros. Image-toimage translation with conditional adversarial networks. In CVPR, 2017. 2, 3, 4, 5
- J. Johnson, A. Alahi, and L. Fei-Fei. Perceptual losses for real-time style transfer and super-resolution. In ECCV, pages 694–711. Springer, 2016. 2, 3, 4
- L. Karacan, Z. Akata, A. Erdem, and E. Erdem. Learning to generate images of outdoor scenes from attributes and semantic layouts. arXiv preprint arXiv:1612.00215, 2016. 3
- D. P. Kingma and M. Welling. Auto-encoding variational bayes. ICLR, 2014. 3
- P.-Y. Laffont, Z. Ren, X. Tao, C. Qian, and J. Hays. Transient attributes for high-level understanding and editing of outdoor scenes. ACM Transactions on Graphics (TOG), 33(4):149, 2014. 2
- C. Ledig, L. Theis, F. Huszar, J. Caballero, A. Cunningham, A. Acosta, A. Aitken, A. Tejani, J. Totz, Z. Wang, et al. Photo-realistic single image super-resolution using a generative adversarial network. arXiv preprint arXiv:1609.04802, 2016. 4
- C. Li and M. Wand. Precomputed real-time texture synthesis with markovian generative adversarial networks. ECCV, 2016. 4
- M.-Y. Liu, T. Breuel, and J. Kautz. Unsupervised image-to-image translation networks. arXiv preprint arXiv:1703.00848, 2017. 3
- M.-Y. Liu and O. Tuzel. Coupled generative adversarial networks. In NIPS, pages 469–477, 2016. 3, 5
- J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In CVPR, pages 3431–3440, 2015. 2, 3, 6
- A. Makhzani, J. Shlens, N. Jaitly, I. Goodfellow, and B. Frey. Adversarial autoencoders. arXiv preprint arXiv:1511.05644, 2015. 4
- X. Mao, Q. Li, H. Xie, R. Y. Lau, and Z. Wang. Multiclass generative adversarial networks with the l2 loss function. arXiv preprint arXiv:1611.04076, 2016. 4
- M. Mathieu, C. Couprie, and Y. LeCun. Deep multiscale video prediction beyond mean square error. ICLR, 2016. 2
- M. F. Mathieu, J. Zhao, A. Ramesh, P. Sprechmann, and Y. LeCun. Disentangling factors of variation in deep representation using adversarial training. In NIPS, pages 5040–5048, 2016. 2
- D. Pathak, P. Krahenbuhl, J. Donahue, T. Darrell, and A. A. Efros. Context encoders: Feature learning by inpainting. CVPR, 2016. 2
- A. Radford, L. Metz, and S. Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015. 2
- S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, and H. Lee. Generative adversarial text to image synthesis. arXiv preprint arXiv:1605.05396, 2016. 2
- R. Rosales, K. Achan, and B. J. Frey. Unsupervised image translation. In iccv, pages 472–478, 2003. 3
- O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. Imagenet large scale visual recognition challenge. IJCV, 115(3):211–252, 2015. 6
- T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen. Improved techniques for training gans. arXiv preprint arXiv:1606.03498, 2016. 2
- P. Sangkloy, J. Lu, C. Fang, F. Yu, and J. Hays. Scribbler: Controlling deep image synthesis with sketch and color. In CVPR, 2017. 3
- Y. Shih, S. Paris, F. Durand, and W. T. Freeman. Datadriven hallucination of different times of day from a single outdoor photo. ACM Transactions on Graphics (TOG), 32(6):200, 2013. 2
- A. Shrivastava, T. Pfister, O. Tuzel, J. Susskind, W. Wang, and R. Webb. Learning from simulated and unsupervised images through adversarial training. arXiv preprint arXiv:1612.07828, 2016. 3, 4, 5
- K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014. 5
- N. Sundaram, T. Brox, and K. Keutzer. Dense point trajectories by gpu-accelerated large displacement optical flow. In ECCV, pages 438–451. Springer, 2010. 3
- Y. Taigman, A. Polyak, and L. Wolf. Unsupervised cross-domain image generation. arXiv preprint arXiv:1611.02200, 2016. 3, 6
- D. Turmukhambetov, N. D. Campbell, S. J. Prince, and J. Kautz. Modeling object appearance using contextconditioned component analysis. In CVPR, pages 4156– 4164, 2015. 6
- M. Twain. The Jumping Frog: in English, then in French, and then Clawed Back into a Civilized Language Once More by Patient, Unremunerated Toil. 1903. 3
- D. Ulyanov, V. Lebedev, A. Vedaldi, and V. Lempitsky. Texture networks: Feed-forward synthesis of textures and stylized images. In Int. Conf. on Machine Learning (ICML), 2016. 3
- D. Ulyanov, A. Vedaldi, and V. Lempitsky. Instance normalization: The missing ingredient for fast stylization. arXiv preprint arXiv:1607.08022, 2016. 4
- C. Vondrick, H. Pirsiavash, and A. Torralba. Generating videos with scene dynamics. In NIPS, pages 613–621, 2016. 2
- F. Wang, Q. Huang, and L. J. Guibas. Image cosegmentation via consistent functional maps. In ICCV, pages 849–856, 2013. 3
- X. Wang and A. Gupta. Generative image modeling using style and structure adversarial networks. ECCV, 2016. 2
- J. Wu, C. Zhang, T. Xue, B. Freeman, and J. Tenenbaum. Learning a probabilistic latent space of object shapes via 3d generative-adversarial modeling. In NIPS, pages 82–90, 2016. 2
- S. Xie and Z. Tu. Holistically-nested edge detection. In ICCV, 2015. 2
- Z. Yi, H. Zhang, T. Gong, Tan, and M. Gong. Dualgan: Unsupervised dual learning for image-to-image translation. In ICCV, 2017. 3
- C. Zach, M. Klopschitz, and M. Pollefeys. Disambiguating visual relations using loop constraints. In CVPR, pages 1426–1433. IEEE, 2010. 3
- R. Zhang, P. Isola, and A. A. Efros. Colorful image colorization. In ECCV, 2016. 2
- J. Zhao, M. Mathieu, and Y. LeCun. Energybased generative adversarial network. arXiv preprint arXiv:1609.03126, 2016. 2
- T. Zhou, Y. Jae Lee, S. X. Yu, and A. A. Efros. Flowweb: Joint image set alignment by weaving consistent, pixelwise correspondences. In CVPR, pages 1191–1200, 2015. 3
- T. Zhou, P. Krahenbuhl, M. Aubry, Q. Huang, and A. A. Efros. Learning dense correspondence via 3d-guided cycle consistency. In CVPR, pages 117–126, 2016. 2, 3
- J.-Y. Zhu, P. Krahenbuhl, E. Shechtman, and A. A. Efros. Generative visual manipulation on the natural image manifold. In ECCV, 2016. 2