Learning from Synthetic Animals

CVPR, pp. 12383-12392, 2019.

Cited by: 6|Views19
EI
Weibo:
To bridge the domain gap, we present a novel consistency-constrained semi-supervised learning method, which leverages both spatial and temporal constraints

Abstract:

Despite great success in human parsing, progress for parsing other deformable articulated objects, like animals, is still limited by the lack of labeled data. In this paper, we use synthetic images and ground truth generated from CAD animal models to address this challenge. To bridge the gap between real and synthetic images, we propose...More

Code:

Data:

0
Introduction
  • Thanks to the presence of large scale annotated datasets and powerful Convolutional Neural Networks(CNNs), the state of human parsing has advanced rapidly.
  • There is little previous work on parsing animals.
  • Parsing animals is important for many tasks, including, but not limited to monitoring wild animal behaviors, developing bioinspired robots, building motion capture systems, etc.
  • One main problem for parsing animals is the limit of datasets.
  • Annotating large scale datasets for animals is prohibitively expensive.
  • Most existing approaches for parsing humans, which often require enormous annotated data [1, 32], are less suited for parsing animals
Highlights
  • Thanks to the presence of large scale annotated datasets and powerful Convolutional Neural Networks(CNNs), the state of human parsing has advanced rapidly
  • When using real image labels, we show that models trained jointly on synthetic and real images achieve better results compared to models trained only on real images
  • We present our results in two different setups: the first one is under the unsupervised domain adaptation setting where real image annotations are not available; the second one is when labeled real images are available
  • To bridge the domain gap, we present a novel consistency-constrained semi-supervised learning (CC-SSL) method, which leverages both spatial and temporal constraints
  • We further demonstrate that the models trained using synthetic data achieve better generalization performance across different domains in the Visual Domain Adaptation Challenge dataset
Methods
  • The authors quantitatively test the approach on the TigDog dataset [28] in Section 4.2.
  • The authors compare the method with other popular unsupervised domain adaptation methods, such as CycleGAN [40], BDL [21] and CyCADA [14].
  • The authors qualitatively show keypoint detection of other animals where no labeled real images are available, such as elephants, sheep and dogs.
  • In order to show the domain generalization ability, the authors annotated the keypoints of animals from Visual Domain Adaptation Challenge dataset (VisDA2019).
  • In Section 4.3, the authors evaluate the models on these images from different visual domains.
  • The rich ground truth in synthetic data enables them to do more tasks beyond 2D pose estimation, so the authors visualize part segmentation on horses and tigers and demonstrate the effectiveness of multi-task learning in Section 4.4
Results
  • The authors' main results are summarized in Table 1.
  • The authors present the results in two different setups: the first one is under the unsupervised domain adaptation setting where real image annotations are not available; the second one is when labeled real images are available.
  • The authors visualize the predicted keypoints in Figure 3.
  • Even for some extreme poses, such as horse riding and lying on the ground, the method still generate accurate predictions.
  • The observations for tigers are similar
Conclusion
  • The authors present a simple yet efficient method using synthetic images to parse animals.
  • When using real image labels, the authors show that models trained jointly on synthetic and real images achieve better results compared to models trained only on real images.
  • The authors further demonstrate that the models trained using synthetic data achieve better generalization performance across different domains in the Visual Domain Adaptation Challenge dataset.
  • The authors build a synthetic dataset contains 10+ animals with diverse poses and rich ground truth and show that multi-task learning is effective
Summary
  • Introduction:

    Thanks to the presence of large scale annotated datasets and powerful Convolutional Neural Networks(CNNs), the state of human parsing has advanced rapidly.
  • There is little previous work on parsing animals.
  • Parsing animals is important for many tasks, including, but not limited to monitoring wild animal behaviors, developing bioinspired robots, building motion capture systems, etc.
  • One main problem for parsing animals is the limit of datasets.
  • Annotating large scale datasets for animals is prohibitively expensive.
  • Most existing approaches for parsing humans, which often require enormous annotated data [1, 32], are less suited for parsing animals
  • Methods:

    The authors quantitatively test the approach on the TigDog dataset [28] in Section 4.2.
  • The authors compare the method with other popular unsupervised domain adaptation methods, such as CycleGAN [40], BDL [21] and CyCADA [14].
  • The authors qualitatively show keypoint detection of other animals where no labeled real images are available, such as elephants, sheep and dogs.
  • In order to show the domain generalization ability, the authors annotated the keypoints of animals from Visual Domain Adaptation Challenge dataset (VisDA2019).
  • In Section 4.3, the authors evaluate the models on these images from different visual domains.
  • The rich ground truth in synthetic data enables them to do more tasks beyond 2D pose estimation, so the authors visualize part segmentation on horses and tigers and demonstrate the effectiveness of multi-task learning in Section 4.4
  • Results:

    The authors' main results are summarized in Table 1.
  • The authors present the results in two different setups: the first one is under the unsupervised domain adaptation setting where real image annotations are not available; the second one is when labeled real images are available.
  • The authors visualize the predicted keypoints in Figure 3.
  • Even for some extreme poses, such as horse riding and lying on the ground, the method still generate accurate predictions.
  • The observations for tigers are similar
  • Conclusion:

    The authors present a simple yet efficient method using synthetic images to parse animals.
  • When using real image labels, the authors show that models trained jointly on synthetic and real images achieve better results compared to models trained only on real images.
  • The authors further demonstrate that the models trained using synthetic data achieve better generalization performance across different domains in the Visual Domain Adaptation Challenge dataset.
  • The authors build a synthetic dataset contains 10+ animals with diverse poses and rich ground truth and show that multi-task learning is effective
Tables
  • Table1: Horse and tiger 2D pose estimation accuracy PCK@0.05. Synthetic data are with randomized background and textures. Synthetic only shows results when no real image label is available, Synthetic + Real are cases when real image labels are available. In both scenarios, our proposed CC-SSL based methods achieve the best performance
  • Table2: Horse and tiger 2D pose estimation accuracy PCK@0.05 on VisDA2019. We present our results under two settings: Visible Kpts Accuracy only accounts for visible keypoints; Full Kpts Accuracy also includes self-occluded keypoints. Under all settings, our proposed methods achieve better performance than baseline Real
  • Table3: Horse and tiger 2D pose estimation PCK@0.05 with multi-task learning. We show models can generalize better to real images trained jointly using 2D keypoints and part segmentation
Download tables as Excel
Related work
  • 2.1. Animal Parsing

    Though there exists large scale datasets containing animals for classification, detection, and instance segmentation, there are only a small number of datasets built for pose estimation [28, 39, 5, 27, 20] and animal part segmentation [8]. Besides, annotating keypoints or parts is timeconsuming and these datasets only cover a tiny portion of animal species in the world.

    Due to the lack of annotations, synthetic data has been widely used to address the problem [43, 3, 44, 45]. Similar to SMPL models [24] for humans, [45] proposes a method to learn articulated SMAL shape models for animals. Later, [44] extracts more 3D shape details and is able to model new species. Unfortunately, these methods are built on manually extracted silhouettes and keypoint annotations. Recently, [43] proposes to copy texture from real animals and predicts 3D mesh of animals in an end-to-end manner. Most related to our method is [3], where authors propose a method to estimate animal poses on real images using synthetic silhouettes. Different from [3] which requires an additional robust segmentation model for real images during inference, our strategy does not require any additional models.
Funding
  • Proposes a novel consistency-constrained semi-supervised learning method
  • Demonstrates the effectiveness of our method on highly deformable animals, such as horses and tigers
  • Proposes a method where models are trained using synthetic CAD models
  • Shows that our models achieve similar performance to models trained on real data, but without using any annotation of real images
  • Proposes an effective method which allows for accurate keypoint prediction across domains
Reference
  • Mykhaylo Andriluka, Leonid Pishchulin, Peter V. Gehler, and Bernt Schiele. 2d human pose estimation: New benchmark and state of the art analysis. In CVPR, pages 3686– 3693, 2014. 1
    Google ScholarLocate open access versionFindings
  • Yoshua Bengio, Jerome Louradour, Ronan Collobert, and Jason Weston. Curriculum learning. In ICML, pages 41–48, 2009. 3
    Google ScholarLocate open access versionFindings
  • Benjamin Biggs, Thomas Roddick, Andrew W. Fitzgibbon, and Roberto Cipolla. Creatures great and SMAL: recovering the shape and motion of animals from video. CoRR, abs/1811.05804, 2018. 2
    Findings
  • Konstantinos Bousmalis, George Trigeorgis, Nathan Silberman, Dilip Krishnan, and Dumitru Erhan. Domain separation networks. In NeurIPS, pages 343–351, 2016. 2
    Google ScholarLocate open access versionFindings
  • Jinkun Cao, Hongyang Tang, Haoshu Fang, Xiaoyong Shen, Cewu Lu, and Yu-Wing Tai. Cross-domain adaptation for animal pose estimation. CoRR, abs/1908.05806, 2019. 2
    Findings
  • Angel X. Chang, Thomas A. Funkhouser, Leonidas J. Guibas, Pat Hanrahan, Qi-Xing Huang, Zimo Li, Silvio Savarese, Manolis Savva, Shuran Song, Hao Su, Jianxiong Xiao, Li Yi, and Fisher Yu. Shapenet: An information-rich 3d model repository. CoRR, abs/1512.03012, 2015. 1
    Findings
  • Wenzheng Chen, Huan Wang, Yangyan Li, Hao Su, Zhenhua Wang, Changhe Tu, Dani Lischinski, Daniel Cohen-Or, and Baoquan Chen. Synthesizing training images for boosting human 3d pose estimation. In 3DV, pages 479–488, 2016. 1, 3
    Google ScholarLocate open access versionFindings
  • Xianjie Chen, Roozbeh Mottaghi, Xiaobai Liu, Sanja Fidler, Raquel Urtasun, and Alan L. Yuille. Detect what you can: Detecting and representing objects using holistic models and body parts. In CVPR, pages 1979–1986, 2014. 2, 6
    Google ScholarLocate open access versionFindings
  • Jaehoon Choi, Taekyung Kim, and Changick Kim. Selfensembling with gan-based data augmentation for domain adaptation in semantic segmentation. CoRR, abs/1909.00589, 2013
    Findings
  • Yifan Ding, Liqiang Wang, Deliang Fan, and Boqing Gong. A semi-supervised two-stage approach to learning from noisy labels. In 2018 IEEE Winter Conference on Applications of Computer Vision, WACV 2018, Lake Tahoe, NV, USA, March 12-15, 2018, pages 1215–1224, 2018. 3
    Google ScholarLocate open access versionFindings
  • Alexey Dosovitskiy, Philipp Fischer, Eddy Ilg, Philip Hausser, Caner Hazirbas, Vladimir Golkov, Patrick van der Smagt, Daniel Cremers, and Thomas Brox. Flownet: Learning optical flow with convolutional networks. In ICCV, pages 2758–2766, 2015. 1
    Google ScholarLocate open access versionFindings
  • Geoffrey French, Michal Mackiewicz, and Mark H. Fisher. Self-ensembling for visual domain adaptation. In ICLR, 2018. 3
    Google ScholarLocate open access versionFindings
  • Sheng Guo, Weilin Huang, Haozhi Zhang, Chenfan Zhuang, Dengke Dong, Matthew R. Scott, and Dinglong Huang. Curriculumnet: Weakly supervised learning from large-scale web images. In ECCV, pages 139–154, 2018. 3
    Google ScholarLocate open access versionFindings
  • Judy Hoffman, Eric Tzeng, Taesung Park, Jun-Yan Zhu, Phillip Isola, Kate Saenko, Alexei A. Efros, and Trevor Darrell. Cycada: Cycle-consistent adversarial domain adaptation. In ICML, pages 1994–2003, 2018. 1, 2, 3, 6, 7
    Google ScholarLocate open access versionFindings
  • Xun Huang, Ming-Yu Liu, Serge J. Belongie, and Jan Kautz. Multimodal unsupervised image-to-image translation. In ECCV, pages 179–196, 2018. 2
    Google ScholarLocate open access versionFindings
  • Lu Jiang, Deyu Meng, Qian Zhao, Shiguang Shan, and Alexander G. Hauptmann. Self-paced curriculum learning. In AAAI, pages 2694–2700, 2015. 3
    Google ScholarLocate open access versionFindings
  • Youngdong Kim, Junho Yim, Juseung Yun, and Junmo Kim. NLNL: negative learning for noisy labels. CoRR, abs/1908.07387, 2019. 3
    Findings
  • Samuli Laine and Timo Aila. Temporal ensembling for semisupervised learning. In ICLR, 2017. 3
    Google ScholarLocate open access versionFindings
  • Dong-Hyun Lee. Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks. In Workshop on Challenges in Representation Learning, ICML, volume 3, page 2, 2013. 3
    Google ScholarLocate open access versionFindings
  • Shuyuan Li, Jianguo Li, Weiyao Lin, and Hanlin Tang. Amur tiger re-identification in the wild. CoRR, abs/1906.05586, 2019. 2
    Findings
  • Yunsheng Li, Lu Yuan, and Nuno Vasconcelos. Bidirectional learning for domain adaptation of semantic segmentation. In CVPR, pages 6936–6945, 2019. 3, 6, 7
    Google ScholarLocate open access versionFindings
  • Ming-Yu Liu, Thomas Breuel, and Jan Kautz. Unsupervised image-to-image translation networks. In NeurIPS, pages 700–708, 2017. 2
    Google ScholarLocate open access versionFindings
  • Mingsheng Long, Yue Cao, Jianmin Wang, and Michael I. Jordan. Learning transferable features with deep adaptation networks. In ICML, pages 97–105, 2015. 2
    Google ScholarLocate open access versionFindings
  • Matthew Loper, Naureen Mahmood, Javier Romero, Gerard Pons-Moll, and Michael J. Black. SMPL: a skinned multiperson linear model. ACM Trans. Graph., 34(6):248:1– 248:16, 2015. 1, 2
    Google ScholarLocate open access versionFindings
  • Zak Murez, Soheil Kolouri, David J. Kriegman, Ravi Ramamoorthi, and Kyungnam Kim. Image to image translation for domain adaptation. In CVPR, pages 4500–4509, 2018. 3
    Google ScholarLocate open access versionFindings
  • Alejandro Newell, Kaiyu Yang, and Jia Deng. Stacked hourglass networks for human pose estimation. In Computer Vision - ECCV 2016 - 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part VIII, pages 483–499, 2016. 6, 7
    Google ScholarLocate open access versionFindings
  • David Novotny, Diane Larlus, and Andrea Vedaldi. I have seen enough: Transferring parts across categories. In BMVC, 2016. 2
    Google ScholarLocate open access versionFindings
  • Luca Del Pero, Susanna Ricco, Rahul Sukthankar, and Vittorio Ferrari. Articulated motion discovery using pairs of trajectories. In CVPR, pages 2151–2160, 2015. 2, 6
    Google ScholarLocate open access versionFindings
  • Aayush Prakash, Shaad Boochoon, Mark Brophy, David Acuna, Eric Cameracci, Gavriel State, Omer Shapira, and Stan Birchfield. Structured domain randomization: Bridging the reality gap by context-aware synthetic data. In ICRA, pages 7249–7255, 2019. 1
    Google ScholarLocate open access versionFindings
  • Ilija Radosavovic, Piotr Dollar, Ross B. Girshick, Georgia Gkioxari, and Kaiming He. Data distillation: Towards omnisupervised learning. In CVPR, pages 4119–4128, 2018. 3, 4
    Google ScholarLocate open access versionFindings
  • Aruni Roy Chowdhury, Prithvijit Chakrabarty, Ashish Singh, SouYoung Jin, Huaizu Jiang, Liangliang Cao, and Erik G. Learned-Miller. Automatic adaptation of object detectors to new domains using self-training. In CVPR, pages 780–790, 2019. 3
    Google ScholarLocate open access versionFindings
  • Benjamin Sapp and Ben Taskar. MODEC: multimodal decomposable models for human pose estimation. In CVPR, pages 3674–3681, 2013. 1
    Google ScholarLocate open access versionFindings
  • Baochen Sun and Kate Saenko. Deep CORAL: correlation alignment for deep domain adaptation. In ECCV Workshops, pages 443–450, 2016. 2
    Google ScholarLocate open access versionFindings
  • Jonathan Tremblay, Aayush Prakash, David Acuna, Mark Brophy, Varun Jampani, Cem Anil, Thang To, Eric Cameracci, Shaad Boochoon, and Stan Birchfield. Training deep networks with synthetic data: Bridging the reality gap by domain randomization. In CVPR, pages 969–977, 2018. 1
    Google ScholarLocate open access versionFindings
  • Eric Tzeng, Judy Hoffman, Trevor Darrell, and Kate Saenko. Simultaneous deep transfer across domains and tasks. In ICCV, pages 4068–4076, 2015. 2
    Google ScholarLocate open access versionFindings
  • Eric Tzeng, Judy Hoffman, Kate Saenko, and Trevor Darrell. Adversarial discriminative domain adaptation. In CVPR, pages 2962–2971, 2017. 2, 3
    Google ScholarLocate open access versionFindings
  • Eric Tzeng, Judy Hoffman, Ning Zhang, Kate Saenko, and Trevor Darrell. Deep domain confusion: Maximizing for domain invariance. CoRR, abs/1412.3474, 2014. 2
    Findings
  • Gul Varol, Javier Romero, Xavier Martin, Naureen Mahmood, Michael J. Black, Ivan Laptev, and Cordelia Schmid. Learning from synthetic humans. In CVPR, pages 4627– 4635, 2017. 1, 3
    Google ScholarLocate open access versionFindings
  • P. Welinder, S. Branson, T. Mita, C. Wah, F. Schroff, S. Belongie, and P. Perona. Caltech-UCSD Birds 200. Technical Report CNS-TR-2010-001, California Institute of Technology, 2010. 2
    Google ScholarFindings
  • Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A. Efros. Unpaired image-to-image translation using cycleconsistent adversarial networks. In ICCV 2017, pages 2242– 2251, 2017. 2, 6, 7
    Google ScholarLocate open access versionFindings
  • Yang Zou, Zhiding Yu, B. V. K. Vijaya Kumar, and Jinsong Wang. Unsupervised domain adaptation for semantic segmentation via class-balanced self-training. In ECCV, pages 297–313, 2018. 3
    Google ScholarLocate open access versionFindings
  • Yang Zou, Zhiding Yu, Xiaofeng Liu, B. V. K. Vijaya Kumar, and Jinsong Wang. Confidence regularized self-training. CoRR, abs/1908.09822, 2019. 3
    Findings
  • Silvia Zuffi, Angjoo Kanazawa, Tanya Y. Berger-Wolf, and Michael J. Black. Three-d safari: Learning to estimate zebra pose, shape, and texture from images ”in the wild”. CoRR, abs/1908.07201, 2019. 2
    Findings
  • Silvia Zuffi, Angjoo Kanazawa, and Michael J. Black. Lions and tigers and bears: Capturing non-rigid, 3d, articulated shape from images. In CVPR, pages 3955–3963, 2018. 2
    Google ScholarLocate open access versionFindings
  • Silvia Zuffi, Angjoo Kanazawa, David W. Jacobs, and Michael J. Black. 3d menagerie: Modeling the 3d shape and pose of animals. In CVPR, pages 5524–5532, 2017. 2
    Google ScholarLocate open access versionFindings
Your rating :
0

 

Tags
Comments