## AI helps you reading Science

## AI Insight

AI extracts a summary of this paper

Weibo:

# Variational Interaction Information Maximization for Cross-domain Disentanglement

NIPS 2020, (2020)

EI

Keywords

Abstract

Cross-domain disentanglement is the problem of learning representations partitioned into domain-invariant and domain-specific representations, which is a key to successful domain transfer or measuring semantic distance between two domains. Grounded in information theory, we cast the simultaneous learning of domain-invariant and domain-s...More

Code:

Data:

Introduction

- There have been great interests in learning disentangled representation for various purposes, such as identifying sources of variation [3, 4, 15, 18, 20] for interpretability, obtaining representation invariant to nuisance factors [1, 8, 30, 34, 39, 40], and domain transfer [12, 26, 28, 35, 44, 47].
- The problem requires a model to learn a representation explicitly separated into three parts: domain-invariant representation shared across two data domains and domain-specific representations exclusive to each domain
- This task is challenging since those representations must be (1) disentangled so that they are independent to one another, while (2) informative in such a way that every factor of variation is captured in the right part of the representation.
- It is not obvious to interpret each module or identify key factors that contribute to disentanglement in their models

Highlights

- There have been great interests in learning disentangled representation for various purposes, such as identifying sources of variation [3, 4, 15, 18, 20] for interpretability, obtaining representation invariant to nuisance factors [1, 8, 30, 34, 39, 40], and domain transfer [12, 26, 28, 35, 44, 47]
- Many models have been proposed to tackle important tasks related to cross-domain disentanglement, such as image-to-image translation [12, 26, 28, 35, 44] and Zero-Shot Sketch Based Image Retrieval (ZS-sketch-based image retrieval (SBIR)) [6, 9, 22, 23, 27, 38]
- We compare Information Auto-Encoder (IIAE) with various baselines, SAE [23], FRWGAN [9], ZSIH [38], CAAE [22], SEM-PCYC [6], and LCALE [27], which are designed for ZS-SBIR or general zero shot learning
- Given a sketch of cannon as a query, a motorcycle and a saw are wrongly retrieved by IIAE, but the motorcycle is semantically relevant to the cannon due to its wheels whereas the saw is visually close to the motorcycle
- The proposed approach, coined Interaction Information Auto-Encoder, extends the Variational Auto-Encoder (VAE) with a novel regularization inspired by information theory, which are principled, interpretable, and nicely integrated into Evidence Lower Bound (ELBO) objective to encourage disentanglement of domain-specific and shared representations
- We further show that our model achieves the state-of-the-art performance in the zero-shot sketch based image retrieval task, even without external knowledge
- The effectiveness of the proposed method is demonstrated on multiple applications, such as image-to-image translation and image retrieval

Methods

- Consider a set of paired data sampled from an unknown joint distribution (x, y) ∼ pD(x, y), where each element of a pair x ∈ X and y ∈ Y is extracted from different domains X and Y , respectively.
- X and y can be images in different styles sharing the same semantic content, or images of different content sharing the same factors of variation
- Given this data, the goal of cross-domain disentanglement is to find the structured representation that can be factorized into three parts: domain-specific representations ZX and ZY that capture the distinctive and exclusive characteristics of each domain X and Y , respectively, and the shared representation ZS that captures common factors shared across the domains.
- The authors leave all the implementation details and hyperparameter settings in D in the supplementary material

Results

- The result implies that IIAE successfully learns to associate semantic structure of sketches and images while being generalized well to unseen classes, which can be explained by two different information constraints on the shared representation; Eq (13) enforces ZS to discard domain specific information while Eq (4) encourages ZS to be a minimal sufficient statistic so that it generalizes well to unseen classes.
- Additional visualization of the ZS-SBIR results is in the supplementary material C.3

Conclusion

- The authors investigate an approach for cross-domain disentanglement. The proposed approach, coined Interaction Information Auto-Encoder, extends the VAE with a novel regularization inspired by information theory, which are principled, interpretable, and nicely integrated into ELBO objective to encourage disentanglement of domain-specific and shared representations.
- The authors' method provides an information theoretic perspective on representation learning, and is likely to accelerate research in areas that involve datasets with two data domains with some common factors of variation.
- One of such areas is image to image translation the authors tackled in this paper.
- The authors do not see any serious consequences of system failure

Summary

## Introduction:

There have been great interests in learning disentangled representation for various purposes, such as identifying sources of variation [3, 4, 15, 18, 20] for interpretability, obtaining representation invariant to nuisance factors [1, 8, 30, 34, 39, 40], and domain transfer [12, 26, 28, 35, 44, 47].- The problem requires a model to learn a representation explicitly separated into three parts: domain-invariant representation shared across two data domains and domain-specific representations exclusive to each domain
- This task is challenging since those representations must be (1) disentangled so that they are independent to one another, while (2) informative in such a way that every factor of variation is captured in the right part of the representation.
- It is not obvious to interpret each module or identify key factors that contribute to disentanglement in their models
## Objectives:

The authors' objective is training the generative model pθ(x, y) that maximizes the joint distribution pD(x, y) by optimizing θ, and disentangles the exclusive representations ZX and ZY from the shared representation ZS.- The authors' goal is to learn a latent variable model with maximum likelihood objective (ELBO in Eq (3)) under the the information regularization for cross-domain disentanglement (Eq (11))
## Methods:

Consider a set of paired data sampled from an unknown joint distribution (x, y) ∼ pD(x, y), where each element of a pair x ∈ X and y ∈ Y is extracted from different domains X and Y , respectively.- X and y can be images in different styles sharing the same semantic content, or images of different content sharing the same factors of variation
- Given this data, the goal of cross-domain disentanglement is to find the structured representation that can be factorized into three parts: domain-specific representations ZX and ZY that capture the distinctive and exclusive characteristics of each domain X and Y , respectively, and the shared representation ZS that captures common factors shared across the domains.
- The authors leave all the implementation details and hyperparameter settings in D in the supplementary material
## Results:

The result implies that IIAE successfully learns to associate semantic structure of sketches and images while being generalized well to unseen classes, which can be explained by two different information constraints on the shared representation; Eq (13) enforces ZS to discard domain specific information while Eq (4) encourages ZS to be a minimal sufficient statistic so that it generalizes well to unseen classes.- Additional visualization of the ZS-SBIR results is in the supplementary material C.3
## Conclusion:

The authors investigate an approach for cross-domain disentanglement. The proposed approach, coined Interaction Information Auto-Encoder, extends the VAE with a novel regularization inspired by information theory, which are principled, interpretable, and nicely integrated into ELBO objective to encourage disentanglement of domain-specific and shared representations.- The authors' method provides an information theoretic perspective on representation learning, and is likely to accelerate research in areas that involve datasets with two data domains with some common factors of variation.
- One of such areas is image to image translation the authors tackled in this paper.
- The authors do not see any serious consequences of system failure

- Table1: Cross-domain translation results in MNIST-CDCB [<a class="ref-link" id="c12" href="#r12">12</a>] (top half) and Cars [<a class="ref-link" id="c36" href="#r36">36</a>] (bottom half) generated by IIAE. In MNIST-CDCB, domain-specific factors are color variation in the background (X) and the foreground (Y ) while the common factor is the digit identity. In Cars, domain-specific factors only exists in Y , views in 23 different yaw angles, while the ones in X is fixed to the front-view. The shared factor is the car identity
- Table2: Shared (exclusive) representation based retrieval on MNIST-CDCB [<a class="ref-link" id="c12" href="#r12">12</a>], Maps [<a class="ref-link" id="c17" href="#r17">17</a>], and Facades [<a class="ref-link" id="c41" href="#r41">41</a>] dataset. CD/CB stand for colored digit/background, S/M stand for satellite/map, and F/L stand for facade/label respectively
- Table3: Evaluation on the Sketchy Extended dataset [<a class="ref-link" id="c29" href="#r29">29</a>, <a class="ref-link" id="c37" href="#r37">37</a>]. WordEmb stands for word embedding

Related work

- Invariant representation Representation learning [25] focuses on feature extraction from the data that is informative to the given task. Information bottleneck (IB) [40] was introduced as an information theoretic regularization method to achieve minimal sufficient encoding by constraining the amount of information that latent variable encodes observed variable. IB enables the encoder to filter out nuisance factors and thus to generalize well. IB is later extended to deep VIB [1], which parameterizes IB with a neural network and optimizes the variational lower bound of the IB objective. VIB showed a close relationship to VAEs [21] and β-VAEs [15] by extending their models to unsupervised learning. Based on VIB, several methods were developed [34, 39] to learn encoders that capture only the factors of variation invariant to the given attribute. Similarly, a variant of VIB was proposed by [8] to learn a domain invariant representation by discarding domain specific variations. GRL [10] is another approach to achieve an invariant approach, which has been widely adopted to the tasks such as unlearning the bias in the input data [19], domain adaptation [10, 12], and zero-shot image retrieval [5]. The idea of learning invariant representations in zero-shot learning has been explored as well [6, 9, 22, 23, 27, 38], aiming to achieve domain-invariant representation by regularizing the model with multiple tasks or objectives.

Funding

- Acknowledgments and Disclosure of Funding This work was supported by the National Research Foundation (NRF) of Korea (NRF2019R1A2C1087634 and NRF-2019M3F2A1072238), the Ministry of Science and Information communication Technology (MSIT) of Korea (IITP No 2020-0-00940, IITP No 2019-0-00075, IITP No 2017-0-01779, IITP No 2020-0-00153, and IITP No 2016-0-00464), the ETRI (Contract No 20ZS1100), and Samsung Electronics

Study subjects and analysis

pairs: 10000

In MNIST-CDCB [12] dataset, each pair (x, y) consists of two images of the same digit but in different color patterns. Specifically, images in domain X have color variations in the background, while the ones in domain Y have variations in the foreground.We use 50,000 / 10,000 pairs of train/test samples following [24]. Cars [36] is a dataset of car CAD images with equally spaced variations in orientation, 4 different angles in pitch and 24 in yaw

pairs: 92

Cars [36] is a dataset of car CAD images with equally spaced variations in orientation, 4 different angles in pitch and 24 in yaw. We employ 92 pairs of (x, y) per a car, where x is fixed as a frontal view of every pitch, and y is rotated view of rest 23 different angles in yaw. Out of those 16,836 pairs of 183 cars, we assigned 16,192 pairs of 176 cars to train set and 644 pairs of 7 cars to test set

pairs: 16836

We employ 92 pairs of (x, y) per a car, where x is fixed as a frontal view of every pitch, and y is rotated view of rest 23 different angles in yaw. Out of those 16,836 pairs of 183 cars, we assigned 16,192 pairs of 176 cars to train set and 644 pairs of 7 cars to test set. Method Translating an image across domains (X → Y or Y → X) can be done naturally by our method

pairs: 106

In Facades [41] dataset, each pair (x, y) is made up of an image of semantic label map and photo of the same building. We use 400 / 100 / 106 pairs of train/valid/test samples following [41]. In Maps [17] dataset, each pair (x, y) is composed of an image of map and a satellite image of the same area

pairs: 1098

In Maps [17] dataset, each pair (x, y) is composed of an image of map and a satellite image of the same area. We use 1096 / 1098 pairs of train/test samples following [17]. Results Following [12], we compute the nearest neighbor using the Euclidean distance and evaluate the performance by the Recall@1 metric.1

Reference

- A. Alemi, I. Fischer, J. Dillon, and K. Murphy. Deep variational information bottleneck. In ICLR, 2017.
- A. Bell. The co-information lattice, 921–926. In Proceedings of the 4th International Symposium on Independent Component Analysis and Blind Source Separation (ICA2003), Nara, Japan, 2003.
- R. T. Q. Chen, X. Li, R. Grosse, and D. Duvenaud. Isolating sources of disentanglement in variational autoencoders. In Advances in Neural Information Processing Systems, 2018.
- X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever, and P. Abbeel. Infogan: Interpretable representation learning by information maximizing generative adversarial nets. In Advances in neural information processing systems, pages 2172–2180, 2016.
- S. Dey, P. Riba, A. Dutta, J. Llados, and Y.-Z. Song. Doodle to search: Practical zero-shot sketchbased image retrieval. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2019.
- A. Dutta and Z. Akata. Semantically tied paired cycle consistency for zero-shot sketch-based image retrieval. In CVPR, 2019.
- B. Esmaeili, H. Wu, S. Jain, A. Bozkurt, N. Siddharth, B. Paige, D. H. Brooks, J. Dy, and J.-W. van de Meent. Structured disentangled representations. In K. Chaudhuri and M. Sugiyama, editors, Proceedings of Machine Learning Research, volume 89 of Proceedings of Machine Learning Research, pages 2525–2534. PMLR, 16–18 Apr 2019.
- M. Federici, A. Dutta, P. Forré, N. Kushman, and Z. Akata. Learning robust representations via multi-view information bottleneck. In International Conference on Learning Representations, 2020.
- R. Felix, V. B. Kumar, I. Reid, and G. Carneiro. Multi-modal cycle-consistent generalized zero-shot learning. In Proceedings of the European Conference on Computer Vision (ECCV), pages 21–37, 2018.
- Y. Ganin and V. Lempitsky. Unsupervised domain adaptation by backpropagation. In F. Bach and D. Blei, editors, Proceedings of the 32nd International Conference on Machine Learning, volume 37 of Proceedings of Machine Learning Research, pages 1180–1189, Lille, France, 07–09 Jul 2015. PMLR.
- S. Gao, R. Brekelmans, G. V. Steeg, and A. Galstyan. Auto-encoding total correlation explanation. In K. Chaudhuri and M. Sugiyama, editors, Proceedings of Machine Learning Research, volume 89 of Proceedings of Machine Learning Research, pages 1157–1166. PMLR, 16–18 Apr 2019.
- A. Gonzalez-Garcia, J. van de Weijer, and Y. Bengio. Image-to-image translation for crossdomain disentanglement. 2018.
- I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 27, pages 2672–2680. Curran Associates, Inc., 2014.
- I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 2, NIPS’14, page 2672–2680, Cambridge, MA, USA, 20MIT Press.
- I. Higgins, L. Matthey, A. Pal, C. Burgess, X. Glorot, M. Botvinick, S. Mohamed, and A. Lerchner. beta-vae: Learning basic visual concepts with a constrained variational framework. Iclr, 2 (5):6, 2017.
- W.-N. Hsu, Y. Zhang, and J. Glass. Unsupervised learning of disentangled and interpretable representations from sequential data. In Advances in neural information processing systems, pages 1878–1889, 2017.
- P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros. Image-to-image translation with conditional adversarial networks. CVPR, 2017.
- Y. Jeong and H. O. Song. Learning discrete and continuous factors of data via alternating disentanglement. In International Conference on Machine Learning (ICML), 2019.
- B. Kim, H. Kim, K. Kim, S. Kim, and J. Kim. Learning not to learn: Training deep neural networks with biased data. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 9012–9020, 2019.
- H. Kim and A. Mnih. Disentangling by factorising. In J. Dy and A. Krause, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 2649–2658, Stockholmsmässan, Stockholm Sweden, 10–15 Jul 2018. PMLR.
- D. P. Kingma and M. Welling. Auto-encoding variational bayes. In Y. Bengio and Y. LeCun, editors, 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Conference Track Proceedings, 2014.
- S. Kiran Yelamarthi, S. Krishna Reddy, A. Mishra, and A. Mittal. A zero-shot framework for sketch based image retrieval. In The European Conference on Computer Vision (ECCV), September 2018.
- E. Kodirov, T. Xiang, and S. Gong. Semantic autoencoder for zero-shot learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3174–3183, 2017.
- Y. LeCun. The mnist database of handwritten digits. Technical report, 1998. URL http://yann.lecun.com/exdb/mnist/.
- Y. LeCun, Y. Bengio, and G. Hinton. Deep learning. nature, 521(7553):436–444, 2015.
- H.-Y. Lee, H.-Y. Tseng, J.-B. Huang, M. Singh, and M.-H. Yang. Diverse image-to-image translation via disentangled representations. In Proceedings of the European conference on computer vision (ECCV), pages 35–51, 2018.
- K. Lin, X. Xu, L. Gao, Z. Wang, and H. T. Shen. Learning cross-aligned latent embeddings for zero-shot cross-modal retrieval. In Association for the Advancement of Artificial Intelligence, 2020.
- A. H. Liu, Y.-C. Liu, Y.-Y. Yeh, and Y.-C. F. Wang. A unified feature disentangler for multidomain image translation and manipulation. In Advances in Neural Information Processing Systems 31, pages 2590–2599, 2018.
- L. Liu, F. Shen, Y. Shen, X. Liu, and L. Shao. Deep sketch hashing: Fast free-hand sketchbased image retrieval. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2862–2871, 2017.
- C. Louizos, K. Swersky, Y. Li, M. Welling, and R. S. Zemel. The variational fair autoencoder. In Y. Bengio and Y. LeCun, editors, 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings, 2016.
- W. McGill. Multivariate information transmission. Transactions of the IRE Professional Group on Information Theory, 4(4):93–111, 1954.
- T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pages 3111–3119, 2013.
- G. A. Miller. Wordnet: a lexical database for english. Communications of the ACM, 38(11): 39–41, 1995.
- D. Moyer, S. Gao, R. Brekelmans, A. Galstyan, and G. Ver Steeg. Invariant representations without adversarial training. In Advances in Neural Information Processing Systems, pages 9084–9093, 2018.
- O. Press, T. Galanti, S. Benaim, and L. Wolf. Emerging disentanglement in auto-encoder based unsupervised image content transfer. In International Conference on Learning Representations, 2019.
- S. E. Reed, Y. Zhang, Y. Zhang, and H. Lee. Deep visual analogy-making. In Advances in neural information processing systems, pages 1252–1260, 2015.
- P. Sangkloy, N. Burnell, C. Ham, and J. Hays. The sketchy database: learning to retrieve badly drawn bunnies. ACM Transactions on Graphics (TOG), 35(4):1–12, 2016.
- Y. Shen, L. Liu, F. Shen, and L. Shao. Zero-shot sketch-image hashing. In Proceeding of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
- J. Song, P. Kalluri, A. Grover, S. Zhao, and S. Ermon. Learning controllable fair representations. international conference on artificial intelligence and statistics, 2018.
- N. Tishby, F. C. Pereira, and W. Bialek. The information bottleneck method. arXiv preprint physics/0004057, 2000.
- R. Tylecek and R. Šára. Spatial pattern templates for recognition of objects with regular structure. In J. Weickert, M. Hein, and B. Schiele, editors, Pattern Recognition, pages 364–374, Berlin, Heidelberg, 2013. Springer Berlin Heidelberg. ISBN 978-3-642-40602-7.
- S. Watanabe. Information theoretical analysis of multivariate correlation. IBM Journal of research and development, 4(1):66–82, 1960.
- Q. Xie, Z. Dai, Y. Du, E. H. Hovy, and G. Neubig. Controllable invariance through adversarial feature learning. In I. Guyon, U. von Luxburg, S. Bengio, H. M. Wallach, R. Fergus, S. V. N. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 4-9 December 2017, Long Beach, CA, USA, pages 585–596, 2017.
- X. Yu, Y. Chen, S. Liu, T. Li, and G. Li. Multi-mapping image-to-image translation via learning disentanglement. In Advances in Neural Information Processing Systems, pages 2990–2999, 2019.
- S. Zhao, J. Song, and S. Ermon. Learning hierarchical features from deep generative models. In D. Precup and Y. W. Teh, editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 4091– 4099, International Convention Centre, Sydney, Australia, 06–11 Aug 2017. PMLR.
- J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Computer Vision (ICCV), 2017 IEEE International Conference on, 2017.
- J.-Y. Zhu, R. Zhang, D. Pathak, T. Darrell, A. A. Efros, O. Wang, and E. Shechtman. Toward multimodal image-to-image translation. In Advances in Neural Information Processing Systems, 2017.

Tags

Comments

数据免责声明

页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果，我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问，可以通过电子邮件方式联系我们：report@aminer.cn