# Unsupervised Pixel-Level Domain Adaptation with Generative Adversarial Networks

CVPR, 2017.

EI

Weibo:

Abstract:

Collecting well-annotated image datasets to train modern machine learning algorithms is prohibitively expensive for many tasks. One appealing alternative is rendering synthetic data where ground-truth annotations are generated automatically. Unfortunately, models trained purely on rendered images fail to generalize to real images. To addr...More

Code:

Data:

Introduction

- Large and well–annotated datasets such as ImageNet [9], COCO [29] and Pascal VOC [12] are considered crucial to advancing computer vision research.
- Creating such datasets is prohibitively expensive.
- One alternative is the use of synthetic data for model training.
- Models naively trained on synthetic data do not typically generalize to real images

Highlights

- Large and well–annotated datasets such as ImageNet [9], COCO [29] and Pascal VOC [12] are considered crucial to advancing computer vision research
- We perform all experiments using the same protocol to ensure fair and meaningful comparison. The performance on this validation set can serve as an upper bound of a satisfactory validation metric for unsupervised domain adaptation
- As we discuss in section 4.5, we evaluate our model in a semi-supervised setting with 1,000 labeled examples in the target domain, to confirm that pixellevel domain adaptation method is still able to improve upon the naive approach of training on this small set of target labeled examples
- We present a state-of-the-art method for performing unsupervised domain adaptation
- Our pixellevel domain adaptation method models outperform previous work on a set of unsupervised domain adaptation scenarios, and in the case of the challenging “Synthetic Cropped Linemod to Cropped Linemod” scenario, our model more than halves the error for pose estimation compared to the previous best result
- Our model decouples the process of domain adaptation from the task-specific architecture, and provides the added benefit of being easy to understand via the visualization of the adapted image outputs of the model

Results

- The authors have not found a universally applicable way to optimize hyperparameters for unsupervised domain adaptation.
- The MNIST-M digits have been generated by using MNIST digits as a binary mask to invert the colors of a background image
- It is clear from figure 3 that in the “MNIST to MNIST-M” case, the model is able to not only generate backgrounds from different noise vectors z, but it is able to learn this inversion process.
- In the depth channel it is able to learn a plausible noise model

Conclusion

- The authors present a state-of-the-art method for performing unsupervised domain adaptation.
- The authors' PixelDA models outperform previous work on a set of unsupervised domain adaptation scenarios, and in the case of the challenging “Synthetic Cropped Linemod to Cropped Linemod” scenario, the model more than halves the error for pose estimation compared to the previous best result.
- The authors' model decouples the process of domain adaptation from the task-specific architecture, and provides the added benefit of being easy to understand via the visualization of the adapted image outputs of the model

Summary

- Large and well–annotated datasets such as ImageNet [9], COCO [29] and Pascal VOC [12] are considered crucial to advancing computer vision research.
- Because our PixelDA model maps one image to another at the pixel level, we can alter the task-specific architecture without having to re-train the domain adaptation component.
- Our method outperforms the state-of-the-art unsupervised domain adaptation techniques on a range of datasets for object classification and pose estimation, while generating images that look very similar to the target domain.
- We begin by explaining our model for unsupervised pixel-level domain adaptation (PixelDA) in the context of image classification, though our method is not specific to this particular task.
- The qualitative evaluation involves the examination of the ability of our method to learn the underlying pixel adaptation process from the source to the target domain by visually inspecting the generated images.
- We evaluate our model using the aforementioned combinations of source and target datasets, and compare the performance of our model’s task architecture T to that of other state-of-the-art unsupervised domain adaptation techniques based on the same task architecture T .
- It is clear that our method is able to learn the underlying transformation process that is required to adapt the original source images to images that look like they could belong in the target domain.
- Top Row: Source RGB and Depth image pairs from Synth Cropped LineMod xs; Middle Row: The samples adapted with our model G with random noise z; Bottom Row: The nearest neighbors between the generated samples in the middle row and images from the target training set.
- Our quantitative evaluation (Tables 1 and 2) illustrates the ability of our model to adapt the source images to the target domain style but raises two questions: Is it important that the backgrounds of the source images are black and how successful are dataaugmentation strategies that use a randomly chosen background image instead?
- As demonstrated in Table 3, PixelDA is able to improve upon training ‘Source-only’ models on source images of objects on either black or random Imagenet backgrounds.
- We retrain our best model using a subset of images from the source and target domains which includes only half of the object classes for the “Synthetic Cropped Linemod” to “Cropped Linemod” scenario.
- Our PixelDA models outperform previous work on a set of unsupervised domain adaptation scenarios, and in the case of the challenging “Synthetic Cropped Linemod to Cropped Linemod” scenario, our model more than halves the error for pose estimation compared to the previous best result.
- Our model decouples the process of domain adaptation from the task-specific architecture, and provides the added benefit of being easy to understand via the visualization of the adapted image outputs of the model

- Table1: Mean classification accuracy (%) for digit datasets. The
- Table2: Mean classification accuracy and pose error for the “Synth
- Table3: Mean classification accuracy and pose error when varying the background of images from the source domain. For these experiments we used only the RGB portions of the images, as there is no trivial or typical way with which we could have added backgrounds to depth images. For comparison, we display results with black backgrounds and Imagenet backgrounds (INet), with the “Source Only” setting and with our model for the RGB-only case
- Table4: Performance of our model trained on only 6 out of 11
- Table5: The effect of using the task and content losses Lt, Lc on the standard deviation (std) of the performance of our model on the “Synth Cropped Linemod to Linemod” scenario. Lstource means we use source data to train T ; Lat dapted means we use generated data to train T ; Lc means we use our content–similarity loss. A
- Table6: Semi-supervised experiments for the “Synthetic Cropped

Related work

- Learning to perform unsupervised domain adaptation is an open theoretical and practical problem. While much prior work exists, our literature review focuses primarily on Convolutional Neural Network (CNN) methods due to their empirical superiority on the problem [14, 31, 41, 44].

Unsupervised Domain Adaptation: Ganin et al [13, 14] and Ajakan et al [3] introduced the Domain–Adversarial Neural Network (DANN): an architecture trained to extract domain-invariant features. Their model’s first few layers are shared by two classifiers: the first predicts task-specific class labels when provided with source data while the second is trained to predict the domain of its inputs. DANNs minimize the domain classification loss with respect to parameters specific to the domain classifier, while maximizing it with respect to the parameters that are common to both classifiers. This minimax optimization becomes possible in a single step via the use of a gradient reversal layer. While DANN’s approach to domain adaptation is to make the features extracted from both domains similar, our approach is to adapt the source images to look as if they were drawn from the target domain. Tzeng et al [44] and Long et al [31] proposed versions of DANNs where the maximization of the domain classification loss is replaced by the minimization of the Maximum Mean Discrepancy (MMD) metric [21], computed between features extracted from sets of samples from each domain. Ghifary et al [17] propose an alternative model in which the task loss for the source domain is combined with a reconstruction loss for the target domain, which results in learning domain-invariant features. Bousmalis et al [5] introduce a model that explicitly separates the components that are private to each domain from those that are common to both domains. They make use of a reconstruction loss for each domain, a similarity loss (eg. DANN, MMD) which encourages domain invariance, and a difference loss which encourages the common and private representation components to be complementary.

Funding

- Our method outperforms the state-of-the-art unsupervised domain adaptation techniques on a range of datasets for object classification and pose estimation, while generating images that look very similar to the target domain (see Figure 1)
- We demonstrate that while the task and content losses do not improve the overall performance of the model, they dramatically stabilize training

Study subjects and analysis

samples: 32

We optimize the objective in Equation 1 for “MNIST to USPS” and “MNIST to MNIST-M” scenarios and the one in Equation 4 for the “Synthetic Cropped Linemod to Cropped Linemod” scenario. We use batches of 32 samples from each domain and the input images are zero-centered and rescaled to [−1, 1]. In our implementation, we let G take the form of a convolutional residual neural network that maintains the resolution of the original image as shown in Figure 2. z is a vector of N z elements, each sampled from a uniform distribution zi ∼ U (−1, 1)

samples: 6060

Once G is trained, we fix its weights and pass the full training set of the source domain to generate images used for training the task-classifier T. We then evaluate the performance of T on the entire set of unobserved objects (6,060 samples), and the test set of the target domain for all objects for direct comparison with Table 2. Stability Study We also evaluate the importance of the different components of our model

samples: 1000

When a small set of. 1,000 target data is available to our model, it is able to improve upon baselines trained on either just these 1,000 samples or the synthetic training set augmented with these labeled target samples. 1000-only Synth+1000 Our PixelDA

target samples: 1000

1000-only Synth+1000 Our PixelDA. target domain against the 2 following baselines: (a) training a classifier only on these 1,000 target samples without any domain adaptation, a setting we refer to as ‘1,000-only’; and (b) training a classifier on these 1,000 target samples and the entire Synthetic Cropped Linemod training set with no domain adaptation, a setting we refer to as ‘Synth+1000’. As one can see from Table 6 our model is able to greatly improve upon the naive setting of incorporating a few target domain samples during training

Reference

- M. Abadi et al. Tensorflow: Large-scale machine learning on heterogeneous distributed systems. Preprint arXiv:1603.04467, 2016.
- D. B. F. Agakov. The im algorithm: a variational approach to information maximization. In Advances in Neural Information Processing Systems 16: Proceedings of the 2003 Conference, volume 16, page 201. MIT Press, 2004.
- H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, and M. Marchand. Domain-adversarial neural networks. In Preprint, http://arxiv.org/abs/1412.4446, 2014.
- P. Arbelaez, M. Maire, C. Fowlkes, and J. Malik. Contour detection and hierarchical image segmentation. TPAMI, 33(5):898–916, 2011.
- K. Bousmalis, G. Trigeorgis, N. Silberman, D. Krishnan, and D. Erhan. Domain separation networks. In Proc. Neural Information Processing Systems (NIPS), 2016.
- R. Caseiro, J. F. Henriques, P. Martins, and J. Batist. Beyond the shortest path: Unsupervised Domain Adaptation by Sampling Subspaces Along the Spline Flow. In CVPR, 2015.
- X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever, and P. Abbeel. Infogan: Interpretable representation learning by information maximizing generative adversarial nets. arXiv preprint arXiv:1606.03657, 2016.
- P. Christiano, Z. Shah, I. Mordatch, J. Schneider, T. Blackwell, J. Tobin, P. Abbeel, and W. Zaremba. Transfer from simulation to real world through learning deep inverse dynamics model. arXiv preprint arXiv:1610.03518, 2016.
- J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A Large-Scale Hierarchical Image Database. In CVPR09, 2009.
- J. S. Denker, W. Gardner, H. P. Graf, D. Henderson, R. Howard, W. E. Hubbard, L. D. Jackel, H. S. Baird, and I. Guyon. Neural network recognizer for hand-written zip code digits. In NIPS, pages 323–331, 1988.
- D. Eigen, C. Puhrsch, and R. Fergus. Depth map prediction from a single image using a multi-scale deep network. In NIPS, pages 2366–2374, 2014.
- M. Everingham, S. A. Eslami, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman. The pascal visual object classes challenge: A retrospective. International Journal of Computer Vision, 111(1):98–136, 2015.
- Y. Ganin and V. Lempitsky. Unsupervised domain adaptation by backpropagation. arXiv preprint arXiv:1409.7495, 2014.
- Y. Ganin et al. Domain-Adversarial Training of Neural Networks. JMLR, 17(59):1–35, 2016.
- L. A. Gatys, A. S. Ecker, and M. Bethge. A neural algorithm of artistic style. arXiv preprint arXiv:1508.06576, 2015.
- L. A. Gatys, A. S. Ecker, and M. Bethge. Image style transfer using convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2414–2423, 2016.
- M. Ghifary, W. B. Kleijn, M. Zhang, D. Balduzzi, and W. Li. Deep reconstruction-classification networks for unsupervised domain adaptation. In European Conference on Computer Vision, pages 597–613.
- B. Gong, Y. Shi, F. Sha, and K. Grauman. Geodesic flow kernel for unsupervised domain adaptation. In CVPR, pages 2066–2073. IEEE, 2012.
- I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Advances in Neural Information Processing Systems, pages 2672–2680, 2014.
- R. Gopalan, R. Li, and R. Chellappa. Domain Adaptation for Object Recognition: An Unsupervised Approach. In ICCV, 2011.
- A. Gretton, K. M. Borgwardt, M. J. Rasch, B. Scholkopf, and A. Smola. A Kernel Two-Sample Test. JMLR, pages 723–773, 2012.
- S. Hinterstoisser et al. Model based training, detection and pose estimation of texture-less 3d objects in heavily cluttered scenes. In ACCV, 2012.
- D. Q. Huynh. Metrics for 3d rotations: Comparison and analysis. Journal of Mathematical Imaging and Vision, 35(2):155–164, 2009.
- J. Johnson, A. Alahi, and L. Fei-Fei. Perceptual losses for real-time style transfer and super-resolution. arXiv preprint arXiv:1603.08155, 2016.
- M. Johnson-Roberson, C. Barto, R. Mehta, S. N. Sridhar, and R. Vasudevan. Driving in the matrix: Can virtual worlds replace human-generated annotations for real world tasks? arXiv preprint arXiv:1610.01983, 2016.
- D. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
- C. Ledig, L. Theis, F. Huszar, J. Caballero, A. Aitken, A. Tejani, J. Totz, Z. Wang, and W. Shi. Photo-realistic single image super-resolution using a generative adversarial network. arXiv preprint arXiv:1609.04802, 2016.
- T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollar, and C. L. Zitnick. Microsoft coco: Common objects in context. In European Conference on Computer Vision, pages 740–755.
- M.-Y. Liu and O. Tuzel. Coupled generative adversarial networks. arXiv preprint arXiv:1606.07536, 2016.
- M. Long and J. Wang. Learning transferable features with deep adaptation networks. ICML, 2015.
- A. Mahendran, H. Bilen, J. Henriques, and A. Vedaldi. Researchdoom and cocodoom: Learning computer vision with games. arXiv preprint arXiv:1610.02431, 2016.
- A. Odena, V. Dumoulin, and C. Olah. Deconvolution and checkerboard artifacts. http://distill.pub/2016/deconvcheckerboard/, 2016.
- A. Odena, C. Olah, and J. Shlens. Conditional Image Synthesis With Auxiliary Classifier GANs. ArXiv e-prints, Oct. 2016.
- W. Qiu and A. Yuille. Unrealcv: Connecting computer vision to unreal engine. arXiv preprint arXiv:1609.01326, 2016.
- A. Radford, L. Metz, and S. Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. CoRR, abs/1511.06434, 2015.
- S. R. Richter, V. Vineet, S. Roth, and V. Koltun. Playing for data: Ground truth from computer games. In European Conference on Computer Vision, pages 102–118.
- A. A. Rusu, M. Vecerik, T. Rothorl, N. Heess, R. Pascanu, and R. Hadsell. Sim-to-real robot learning from pixels with progressive nets. arXiv preprint arXiv:1610.04286, 2016.
- K. Saenko et al. Adapting visual category models to new domains. In ECCV. Springer, 2010.
- T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen. Improved techniques for training gans. arXiv preprint arXiv:1606.03498, 2016.
- B. Sun, J. Feng, and K. Saenko. Return of frustratingly easy domain adaptation. In AAAI. 2016.
- E. Tzeng, C. Devin, J. Hoffman, C. Finn, X. Peng, S. Levine, K. Saenko, and T. Darrell. Towards adapting deep visuomotor representations from simulated to real environments. arXiv preprint arXiv:1511.07111, 2015.
- E. Tzeng, J. Hoffman, T. Darrell, and K. Saenko. Simultaneous deep transfer across domains and tasks. In CVPR, pages 4068–4076, 2015.
- E. Tzeng, J. Hoffman, N. Zhang, K. Saenko, and T. Darrell. Deep domain confusion: Maximizing for domain invariance. Preprint arXiv:1412.3474, 2014.
- P. Wohlhart and V. Lepetit. Learning descriptors for object recognition and 3d pose estimation. In CVPR, pages 3109– 3118, 2015.
- D. Yoo, N. Kim, S. Park, A. S. Paek, and I. S. Kweon. Pixellevel domain transfer. arXiv preprint arXiv:1603.07442, 2016.
- Y. Zhu, R. Mottaghi, E. Kolve, J. J. Lim, A. Gupta, L. FeiFei, and A. Farhadi. Target-driven visual navigation in indoor scenes using deep reinforcement learning. arXiv preprint arXiv:1609.05143, 2016.

Full Text

Tags

Comments