# Implicit Rank-Minimizing Autoencoder

NIPS 2020, 2020.

EI

Weibo:

Abstract:

An important component of autoencoders is the method by which the information capacity of the latent representation is minimized or limited. In this work, the rank of the covariance matrix of the codes is implicitly minimized by relying on the fact that gradient descent learning in multi-layer linear networks leads to minimum-rank solut...More

Code:

Data:

Introduction

- Optimizing a linear multi-layer neural network through gradient descent leads to a low-rank solution.
- This method consists in inserting extra linear layers between the encoder and the decoder of a standard autoencoder.
- The authors empirically demonstrate IRMAE’s regularization behavior through a synthetic dataset and show that it learns good representation with a much smaller latent dimension.

Highlights

- Optimizing a linear multi-layer neural network through gradient descent leads to a low-rank solution
- This phenomenon is known as implicit regularization and has been extensively studied under the context of matrix factorization [9, 1, 19], linear regression [22, 6], logistic regression [23], and linear convolutional neural networks [8]
- The baseline model, L1 regularization, L2 regularization, Implicit Rank-Minimizing Autoencoder (IRMAE) with l = 2 yields excellent reconstructions on validation set. This result shows that IRMAE with l = 2 is able to learn good latent representation with a rank close to intrinsic dimension, while L1, L2 regularization tends to use a much larger latent space
- The rank of the covariance matrix of the codes is implicitly minimized by relying on the fact that gradient descent learning in multi-layer linear networks leads to minimum-rank solutions
- We demonstrate the validity of the method on several image generation and representation learning tasks

Results

- The authors demonstrate superior representation learning performance of the method against a standard deterministic autoencoder and comparable performance to a variational autoencoder on MNIST dataset and CelebA dataset through a variety of generative tasks, including interpolation, sample generation from noise, PCA interpolation in low dimension, and a downstream classification task.
- The authors proposed a method of inserting extra linear layers in deep neural networks for rank regularization;
- The authors demonstrated a superior performance of the method compared to a standard deterministic autoencoder and a variational autoencoder on a variety of generative and downstream classification tasks.
- Implicit rank-minimizing autoencoder consists in adding extra linear matrices W1, W2, · · · , Wl between the encoder and decoder, where Wi ∈ Rd×d are randomly initialized.
- These matrices encourage latent variables to use a lower number of dimensions and effectively minimize the rank of the covariance matrix of the latent space.
- This result shows that IRMAE with l = 2 is able to learn good latent representation with a rank close to intrinsic dimension, while L1, L2 regularization tends to use a much larger latent space.
- The authors evaluate the model on a variety of representation learning tasks: interpolation between data points, sample generation from random noise, downstream classification task, PCA interpolation in latent space.
- Additional experiments are demonstrated in the supplementary material, including comparing IRMAE to other deterministic AEs, comparing IRMAE against AEs with various latent dimension, effect of varying linear layer depth in IRMAE.
- The authors perform several ablation studies to verify that the effect of dimensionality reduction comes from the extra linear neural network and its optimization dynamics.
- The rank of the covariance matrix of the codes is implicitly minimized by relying on the fact that gradient descent learning in multi-layer linear networks leads to minimum-rank solutions.

Conclusion

- By inserting a number of extra linear layers between the encoder and the decoder, the system spontaneously learns representations with a low effective dimension.
- The model, dubbed Implicit Rank-Minimizing Autoencoder (IRMAE), is simple, deterministic, and low-rank latent space.
- The authors demonstrate the validity of the method on several image generation and representation learning tasks.

Summary

- Optimizing a linear multi-layer neural network through gradient descent leads to a low-rank solution.
- This method consists in inserting extra linear layers between the encoder and the decoder of a standard autoencoder.
- The authors empirically demonstrate IRMAE’s regularization behavior through a synthetic dataset and show that it learns good representation with a much smaller latent dimension.
- The authors demonstrate superior representation learning performance of the method against a standard deterministic autoencoder and comparable performance to a variational autoencoder on MNIST dataset and CelebA dataset through a variety of generative tasks, including interpolation, sample generation from noise, PCA interpolation in low dimension, and a downstream classification task.
- The authors proposed a method of inserting extra linear layers in deep neural networks for rank regularization;
- The authors demonstrated a superior performance of the method compared to a standard deterministic autoencoder and a variational autoencoder on a variety of generative and downstream classification tasks.
- Implicit rank-minimizing autoencoder consists in adding extra linear matrices W1, W2, · · · , Wl between the encoder and decoder, where Wi ∈ Rd×d are randomly initialized.
- These matrices encourage latent variables to use a lower number of dimensions and effectively minimize the rank of the covariance matrix of the latent space.
- This result shows that IRMAE with l = 2 is able to learn good latent representation with a rank close to intrinsic dimension, while L1, L2 regularization tends to use a much larger latent space.
- The authors evaluate the model on a variety of representation learning tasks: interpolation between data points, sample generation from random noise, downstream classification task, PCA interpolation in latent space.
- Additional experiments are demonstrated in the supplementary material, including comparing IRMAE to other deterministic AEs, comparing IRMAE against AEs with various latent dimension, effect of varying linear layer depth in IRMAE.
- The authors perform several ablation studies to verify that the effect of dimensionality reduction comes from the extra linear neural network and its optimization dynamics.
- The rank of the covariance matrix of the codes is implicitly minimized by relying on the fact that gradient descent learning in multi-layer linear networks leads to minimum-rank solutions.
- By inserting a number of extra linear layers between the encoder and the decoder, the system spontaneously learns representations with a low effective dimension.
- The model, dubbed Implicit Rank-Minimizing Autoencoder (IRMAE), is simple, deterministic, and low-rank latent space.
- The authors demonstrate the validity of the method on several image generation and representation learning tasks.

- Table1: FID score (smaller is better) for samples of various models for MNIST/CelebA
- Table2: Downstream classification on MNIST dataset. We add a MLP head on top of the pretrained encoder by each method. Thus, all models share the same architecture. We do not perform fine tuning on the pretrained encoder except with the purely supervised version. Representation learned by IRMAE obtains significantly higher accuracy compared to baselines and supervised version in the low labeled data regime

Related work

- The implicit regularization provided by gradient descent optimization is widely believed to be one of the keys to deep neural networks’ generalization ability. Many works focusing on linear cases are trying to study this behavior empirically and theoretically. Soudry et al [23] show that implicit bias helps to learn logistic regression. Saxe et al [22] study a 2-layer linear regression and theoretically demonstrated that continuous gradient descent could lead to a low-rank solution. Gidel et al [6] extend such theory to a discrete case for linear regression problems. In the field of matrix factorization, Gunasekar et al [9] theoretically prove that gradient descent can derive minimal nuclear norm solution. Arora et al [1] extend this concept to the deep linear network case by theoretically and empirically demonstrating that a deep linear network can derive low-rank solutions. Gunasekar et al [8] prove that gradient descent has a regularization effect in linear convolutional networks. All these works are trying to understand why gradient descent can help generalization in existing approaches. On the contrary, we take advantage of this phenomenon to develop better algorithms. Also, the current implicit regularization study requires a small gradient and vanishing initialization, while our method is universal and can be used with all types of optimizers and allow combination with more complicated components.

Reference

- Sanjeev Arora, Nadav Cohen, Wei Hu, and Yuping Luo. Implicit regularization in deep matrix factorization. In Advances in Neural Information Processing Systems (NeurIPS ’19), pages 7413–7424, 2019.
- Yoshua Bengio, Aaron C. Courville, and Pascal Vincent. Representation learning: A review and new perspectives. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35:1798–1828, 2013.
- Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. arXiv preprint arXiv:2002.05709, 2020.
- Eizaburo Doi and Michael S. Lewicki. Sparse coding of natural images using an overcomplete set of limited capacity units. In L. K. Saul, Y. Weiss, and L. Bottou, editors, Advances in Neural Information Processing Systems 17, pages 377–38MIT Press, 2005.
- Partha Ghosh, Mehdi S. M. Sajjadi, Antonio Vergari, Michael Black, and Bernhard Scholkopf. From variational to deterministic autoencoders. In International Conference on Learning Representations (ICLR ’20), 2020.
- Gauthier Gidel, Francis Bach, and Simon Lacoste-Julien. Implicit regularization of discrete gradient dynamics in linear neural networks. In Advances in Neural Information Processing Systems 32 (NeurIPS ’19), pages 3202–3211, 2019.
- Rotislav Goroshin and Yann LeCun. Saturating auto-encoders. In International Conference on Learning Representations (ICLR2013), April 2013.
- Suriya Gunasekar, Jason D. Lee, Daniel Soudry, and Nathan Srebro. Implicit bias of gradient descent on linear convolutional networks. In Advances in Neural Information Processing Systems (NeurIPS ’18), pages 9461–9471, 2018.
- Suriya Gunasekar, Blake E Woodworth, Srinadh Bhojanapalli, Behnam Neyshabur, and Nati Srebro. Implicit regularization in matrix factorization. In Advances in Neural Information Processing Systems 30 (NeurIPS ’17), pages 6151–6159, 2017.
- Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual representation learning. arXiv preprint arXiv:1911.05722, 2019.
- Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in Neural Information Processing Systems 30 (NeurIPS ’17), pages 6626–6637, 2017.
- Irina Higgins, Loïc Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew M Botvinick, Shakir Mohamed, and Alexander Lerchner. beta-vae: Learning basic visual concepts with a constrained variational framework. In International Conference on Learning Representations (ICLR ’20), 2020.
- Diederik P. Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2014.
- Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. In Proceedings of the IEEE, volume 86, pages 2278–2324, 1998.
- Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild. 2015 IEEE International Conference on Computer Vision (ICCV ’15), pages 3730–3738, 2015.
- Ishan Misra and Laurens van der Maaten. Self-supervised learning of pretext-invariant representations. In Conference on Computer Vision and Pattern Recognition (CVPR ’20), 2020.
- Andrew Ng. Sparse autoencoder. 2000.
- Marc’Aurelio Ranzato, Y-Lan Boureau, and Yann LeCun. Sparse feature learning for deep belief networks. In Advances in Neural Information Processing Systems (NIPS 2007), volume 20, 2007.
- Noam Razin and Nadav Cohen. Implicit regularization in deep learning may not be explainable by norms. arXiv preprint arXiv:2005.06398, 2020.
- Salah Rifai, Pascal Vincent, Xavier Muller, Xavier Glorot, and Yoshua Bengio. Contractive auto-encoders: Explicit invariance during feature extraction. In International Conference on Machine Learning (ICML ’11), 2011.
- D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning internal representations by error propagation. In Parallel Distributed Processing: Explorations in the Microstructure of Cognition, pages 318–362. MIT Press.
- Andrew M. Saxe, James L. McClelland, and Surya Ganguli. A mathematical theory of semantic development in deep neural networks. Proceedings of the National Academy of Sciences of the United States of America, 116 23:11537–11546, 2019.
- Daniel Soudry, Elad Hoffer, and Nathan Srebro. The implicit bias of gradient descent on separable data. In International Conference on Learning Representations (ICLR ’18), 2018.
- Ilya Tolstikhin, Olivier Bousquet, Sylvain Gelly, and Bernhard Schoelkopf. Wasserstein autoencoders. In International Conference on Learning Representations (ICLR ’18), 2018.
- Aäron van den Oord, Oriol Vinyals, and Koray Kavukcuoglu. Neural discrete representation learning. In Advances in Neural Information Processing Systems (NeurIPS ’17), 2017.
- Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and Pierre-Antoine Manzagol. Extracting and composing robust features with denoising autoencoders. In International Conference on Machine Learning (ICML ’08), 2008.

Full Text

Tags

Comments