AI帮你理解科学

AI 生成解读视频

AI抽取解析论文重点内容自动生成视频


pub
生成解读视频

AI 溯源

AI解析本论文相关学术脉络


Master Reading Tree
生成 溯源树

AI 精读

AI抽取本论文的概要总结


微博一下
Our work focuses on sparsifying the latent space of a conditional variational autoencoder

Evidential Sparsification of Multimodal Latent Spaces in Conditional Variational Autoencoders

NIPS 2020, (2020)

被引用0|浏览24
EI
下载 PDF 全文
引用
微博一下

摘要

Discrete latent spaces in variational autoencoders have been shown to effectively capture the data distribution for many real-world problems such as natural language understanding, human intent prediction, and visual scene representation. However, discrete latent spaces need to be sufficiently large to capture the complexities of real-w...更多

代码

数据

0
简介
  • Variational autoencoders (VAEs) with discrete latent spaces have recently shown great success in real-world applications, such as natural language processing [1], image generation [2, 3], and human intent prediction [4].
  • Prohibitively large discrete latent spaces are required to accurately learn complex data distributions [10, 11], thereby causing difficulties in interpretability and rendering downstream tasks computationally challenging.
重点内容
  • Variational autoencoders (VAEs) with discrete latent spaces have recently shown great success in real-world applications, such as natural language processing [1], image generation [2, 3], and human intent prediction [4]
  • Contributions We introduce a novel method grounded in evidential theory for sparsifying the discrete latent space of a trained conditional variational autoencoder (CVAE)
  • We consider CVAE architectures designed for the tasks of class-conditioned image generation and pedestrian trajectory prediction
  • Model We present a proof of concept of our method for sparsifying multimodal discrete latent spaces on the CVAE architecture shown in Fig. 1
  • We present a fully analytical methodology for post hoc discrete latent space reduction in CVAEs
  • Our work focuses on sparsifying the latent space of a conditional variational autoencoder (CVAE)
方法
  • The authors consider CVAE architectures designed for the tasks of class-conditioned image generation and pedestrian trajectory prediction
  • These real-world tasks require high degrees of distributional multimodality.
  • The authors compare the method to the softmax distribution and the popular class-reduction technique sparsemax which achieves a sparse distribution by projecting an input vector onto the probability simplex [24]
  • By design, both the method and sparsemax compute an implicit threshold for each input query post hoc.
  • The code to reproduce the results can be found at: https://github.com/sisl/EvidentialSparsification
结果
  • The authors intentionally use a simplistic CVAE architecture trained on a reasonably simple task to 1) demonstrate the capability of the latent space reduction technique to improve performance post hoc and 2) characterize the results.
  • The authors' algorithm achieves an 89% reduction in the 512 latent classes required to represent each latent variable, while maintaining the multimodality of the distribution.
  • The authors' filtered latent space kept 2 − 12 latent classes out of 25 total (51.7% of the test set resulted in a six dimensional latent space), achieving more than a 50% reduction
结论
  • The authors present a fully analytical methodology for post hoc discrete latent space reduction in CVAEs.
  • The authors leave the investigation of evidential latent space reduction at training time to future work.
  • The authors intend the work to be applicable in the domain of robotics.
  • The authors' work is more broadly applicable to any domain that would benefit from sparsifying the discrete latent distribution of a pre-trained CVAE.
  • The authors extensively validate the work empirically, the authors do not provide theoretical safety guarantees for the removed latent classes, requiring sufficient safety testing for any downstream task.
  • The authors hope that the contribution will enable future positive research outcomes within the fields of robotics, generative modeling, and evidential theory
总结
  • Introduction:

    Variational autoencoders (VAEs) with discrete latent spaces have recently shown great success in real-world applications, such as natural language processing [1], image generation [2, 3], and human intent prediction [4].
  • Prohibitively large discrete latent spaces are required to accurately learn complex data distributions [10, 11], thereby causing difficulties in interpretability and rendering downstream tasks computationally challenging.
  • Methods:

    The authors consider CVAE architectures designed for the tasks of class-conditioned image generation and pedestrian trajectory prediction
  • These real-world tasks require high degrees of distributional multimodality.
  • The authors compare the method to the softmax distribution and the popular class-reduction technique sparsemax which achieves a sparse distribution by projecting an input vector onto the probability simplex [24]
  • By design, both the method and sparsemax compute an implicit threshold for each input query post hoc.
  • The code to reproduce the results can be found at: https://github.com/sisl/EvidentialSparsification
  • Results:

    The authors intentionally use a simplistic CVAE architecture trained on a reasonably simple task to 1) demonstrate the capability of the latent space reduction technique to improve performance post hoc and 2) characterize the results.
  • The authors' algorithm achieves an 89% reduction in the 512 latent classes required to represent each latent variable, while maintaining the multimodality of the distribution.
  • The authors' filtered latent space kept 2 − 12 latent classes out of 25 total (51.7% of the test set resulted in a six dimensional latent space), achieving more than a 50% reduction
  • Conclusion:

    The authors present a fully analytical methodology for post hoc discrete latent space reduction in CVAEs.
  • The authors leave the investigation of evidential latent space reduction at training time to future work.
  • The authors intend the work to be applicable in the domain of robotics.
  • The authors' work is more broadly applicable to any domain that would benefit from sparsifying the discrete latent distribution of a pre-trained CVAE.
  • The authors extensively validate the work empirically, the authors do not provide theoretical safety guarantees for the removed latent classes, requiring sufficient safety testing for any downstream task.
  • The authors hope that the contribution will enable future positive research outcomes within the fields of robotics, generative modeling, and evidential theory
表格
  • Table1: Downstream classification performance on 1600 sampled images (25 samples × 64 classes) shows that our sparse distribution maintains the original softmax performance, unlike sparsemax. For comparison, the classifier was evaluated on a held-out subset of 1920 images from the original miniImageNet training set. Higher is better for all metrics and bold highlights the best performing latent distributions
  • Table2: Our proposed distribution maintains performance with softmax, while sparsifying the latent space. It also maintains the multimodality of the original distribution, unlike sparsemax that collapses to a unimodal distribution, as seen when computing the minimum over the top five most likely latent modes. Direct sampling metrics were computed over 2000 samples from the Trajectron++ network and the top five metrics were computed over 500 samples per latent class. For all metrics lower is better and the best performance is highlighted in bold
Download tables as Excel
相关工作
  • In recent years, a number of new perspectives on the softmax function have been presented. The Gumbel-Softmax distribution was introduced to allow backpropagation through categorical distributions, giving rise to the popularity of discrete latent spaces within CVAE architectures [7, 29]. Related works in low-dimensional encodings for VAEs generally focus on regularization [36] and enforcing structure in the latent space [37] during training, but they do not sparsify the latent space post hoc. Sensoy et al [38] present an evidential approach to epistemic uncertainty estimation in neural networks for classification tasks. They propose learning Dirichlet distribution parameters to form a distribution over softmax functions. The Dirichlet parameters serve as evidence towards singleton classes, resulting in a loss that regularizes misleading evidence towards the vacuous mass. Unlike Sensoy et al [38], we focus on post hoc distributional sparsification rather than capturing epistemic uncertainty. Duch and Itert [39] suggest a post hoc modification to further disperse uncertainty among the classes, thus flattening the softmax distribution to improve classification performance. We are interested in the opposite objective of removing the classes that have probability mass assigned to them due to pure uncertainty allocation, in this way generating a more sharply peaked distribution.
基金
  • We intentionally use a simplistic CVAE architecture trained on a reasonably simple task to 1) demonstrate the capability of our latent space reduction technique to improve performance post hoc and 2) easily characterize the results
  • Our algorithm achieves an 89% reduction in the 512 latent classes required to represent each latent variable, while maintaining the multimodality of the distribution
  • Our filtered latent space kept 2 − 12 latent classes out of 25 total (51.7% of the test set resulted in a six dimensional latent space), achieving more than a 50% reduction
研究对象与分析
samples: 25
. Downstream classification performance on 1600 sampled images (25 samples × 64 classes) shows that our sparse distribution maintains the original softmax performance, unlike sparsemax. For comparison, the classifier was evaluated on a held-out subset of 1920 images from the original miniImageNet training set. Higher is better for all metrics and bold highlights the best performing latent distributions. Our proposed distribution maintains performance with softmax, while sparsifying the latent space. It also maintains the multimodality of the original distribution, unlike sparsemax that collapses to a unimodal distribution, as seen when computing the minimum over the top five most likely latent modes. Direct sampling metrics were computed over 2000 samples from the Trajectron++ network and the top five metrics were computed over 500 samples per latent class. For all metrics lower is better and the best performance is highlighted in bold

samples: 2000
Downstream classification performance on 1600 sampled images (25 samples × 64 classes) shows that our sparse distribution maintains the original softmax performance, unlike sparsemax. For comparison, the classifier was evaluated on a held-out subset of 1920 images from the original miniImageNet training set. Higher is better for all metrics and bold highlights the best performing latent distributions. Our proposed distribution maintains performance with softmax, while sparsifying the latent space. It also maintains the multimodality of the original distribution, unlike sparsemax that collapses to a unimodal distribution, as seen when computing the minimum over the top five most likely latent modes. Direct sampling metrics were computed over 2000 samples from the Trajectron++ network and the top five metrics were computed over 500 samples per latent class. For all metrics lower is better and the best performance is highlighted in bold. The CVAE architecture used for MNIST image generation. The last layer in each MLP is a softmax layer. At test time, p(z | y) is used to sample the latent space

samples: 100
Our filtered distribution (green) outperforms the softmax (blue) and sparsemax (orange) baselines across training iterations on the MNIST dataset. Lower is better for distance metrics to the target distribution. Behavior prediction results on the ETH pedestrian dataset [35] show that our method selects distinct, interpretable modes in the latent space while capturing the ground truth. In contrast, sparsemax occasionally misses the ground truth due to its aggressive filtering scheme. We averaged 100 samples from the Trajectron++’s output for each latent class. The CVAE architecture used for NotMNIST image generation. The last layer in each encoder network block is the softmax layer. At test time, p(z | y) is used to sample the latent space; thus, only the input query y serves as input to the encoder

引用论文
  • Yishu Miao and Phil Blunsom. Language as a latent variable: Discrete generative models for sentence compression. In Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 319–328, 2016.
    Google ScholarLocate open access versionFindings
  • Aaron van den Oord, Oriol Vinyals, and Koray Kavukcuoglu. Neural discrete representation learning. In Advances in Neural Information Processing Systems (NeurIPS), pages 6306–6315, 2017.
    Google ScholarLocate open access versionFindings
  • Ali Razavi, Aaron van den Oord, and Oriol Vinyals. Generating diverse high-fidelity images with VQ-VAE-2. In Advances in Neural Information Processing Systems (NeurIPS), pages 14837–14847, 2019.
    Google ScholarLocate open access versionFindings
  • Tim Salzmann, Boris Ivanovic, Punarjay Chakravarty, and Marco Pavone. Trajectron++: Dynamically-feasible trajectory forecasting with heterogeneous data. arXiv, 2020.
    Google ScholarFindings
  • Edward Schmerling, Karen Leung, Wolf Vollprecht, and Marco Pavone. Multimodal probabilistic model-based planning for human-robot interaction. In International Conference on Robotics and Automation (ICRA), pages 1–9. IEEE, 2018.
    Google ScholarLocate open access versionFindings
  • Thomas M Moerland, Joost Broekens, and Catholijn M Jonker. Learning multimodal transition dynamics for model-based reinforcement learning. In 29th Benelux Conference on Artificial Intelligence, page 362, 2017.
    Google ScholarLocate open access versionFindings
  • Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with Gumbel-Softmax. arXiv, 2016.
    Google ScholarFindings
  • Boris Ivanovic and Marco Pavone. The Trajectron: Probabilistic multi-agent trajectory modeling with dynamic spatiotemporal graphs. In International Conference on Computer Vision (ICCV). IEEE, 2019.
    Google ScholarLocate open access versionFindings
  • Rachit Singh. Sequential discrete latent variables for language modeling. Bachelor’s thesis, Harvard University, 2018.
    Google ScholarFindings
  • Lukasz Kaiser, Samy Bengio, Aurko Roy, Ashish Vaswani, Niki Parmar, Jakob Uszkoreit, and Noam Shazeer. Fast decoding in sequence models using discrete latent variables. In International Conference on Machine Learning (ICML), pages 2390–2399, 2018.
    Google ScholarLocate open access versionFindings
  • Arash Vahdat, William Macready, Zhengbing Bian, Amir Khoshaman, and Evgeny Andriyash. DVAE++: Discrete variational autoencoders with overlapping transformations. In International Conference on Machine Learning (ICML), pages 5035–5044, 2018.
    Google ScholarLocate open access versionFindings
  • Brian Ichter, James Harrison, and Marco Pavone. Learning sampling distributions for robot motion planning. In International Conference on Robotics and Automation (ICRA), pages 7087–7094. IEEE, 2018.
    Google ScholarLocate open access versionFindings
  • Sandeep Chinchali, Apoorva Sharma, James Harrison, Amine Elhafsi, Daniel Kang, Evgenya Pergament, Eyal Cidon, Sachin Katti, and Marco Pavone. Network offloading policies for cloud robotics: a learning-based approach. In Robotics: Science and Systems (RSS), 2019.
    Google ScholarLocate open access versionFindings
  • Alex X. Lee, Richard Zhang, Frederik Ebert, Pieter Abbeel, Chelsea Finn, and Sergey Levine. Stochastic adversarial video prediction. arXiv, 2018.
    Google ScholarFindings
  • Namhoon Lee, Wongun Choi, Paul Vernaza, Christopher B Choy, Philip HS Torr, and Manmohan Chandraker. DESIRE: Distant future prediction in dynamic scenes with interacting agents. In Conference on Computer Vision and Pattern Recognition (CVPR), pages 336–345. IEEE, 2017.
    Google ScholarLocate open access versionFindings
  • Kihyuk Sohn, Honglak Lee, and Xinchen Yan. Learning structured output representation using deep conditional generative models. In Advances in Neural Information Processing Systems (NeurIPS), pages 3483–3491, 2015.
    Google ScholarLocate open access versionFindings
  • Arthur P. Dempster. A generalization of Bayesian inference. Classic works of the DempsterShafer Theory of Belief Functions, pages 73–104, 2008.
    Google ScholarFindings
  • Fabio Cuzzolin. Visions of a generalized probability theory. arXiv, 2018.
    Google ScholarFindings
  • Thierry Denoeux. Logistic regression, neural networks and Dempster-Shafer theory: A new perspective. Knowledge-Based Systems, 2019.
    Google ScholarLocate open access versionFindings
  • Glenn Shafer. A mathematical theory of evidence, volume 42. Princeton University Press, 1976.
    Google ScholarFindings
  • Thierry Denoeux. Introduction to belief functions. 4th School on Belief Functions and their Applications, 2017.
    Google ScholarFindings
  • Barry R. Cobb and Prakash P. Shenoy. On the plausibility transformation method for translating belief function models to probability models. International Journal of Approximate Reasoning, 41(3):314–330, 2006.
    Google ScholarLocate open access versionFindings
  • Philippe Smets. Belief functions: the disjunctive rule of combination and the generalized Bayesian theorem. International Journal of Approximate Reasoning, 9(1):1–35, 1993.
    Google ScholarLocate open access versionFindings
  • Andre Martins and Ramon Astudillo. From softmax to sparsemax: A sparse model of attention and multi-label classification. In International Conference on Machine Learning (ICML), pages 1614–1623, 2016.
    Google ScholarLocate open access versionFindings
  • Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
    Google ScholarLocate open access versionFindings
  • Oriol Vinyals, Charles Blundell, Timothy Lillicrap, and Daan Wierstra. Matching networks for one shot learning. In Advances in Neural Information Processing Systems (NeurIPS), pages 3630–3638, 2016.
    Google ScholarLocate open access versionFindings
  • Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-MNIST: a novel image dataset for benchmarking machine learning algorithms. arXiv, 2017.
    Google ScholarFindings
  • Yaroslav Bulatov. NotMNIST dataset. Google (Books/OCR), Tech. Rep.[Online]. Available: http://yaroslavvb.blogspot.it/2011/09/notmnist-dataset.html, 2, 2011.
    Findings
  • Chris J. Maddison, Andriy Mnih, and Yee Whye Teh. The concrete distribution: A continuous relaxation of discrete random variables. In International Conference on Learning Representations (ICLR), 2017.
    Google ScholarLocate open access versionFindings
  • Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. ImageNet: A large-scale hierarchical image database. In Conference on Computer Vision and Pattern Recognition (CVPR), pages 248–255. IEEE, 2009.
    Google ScholarLocate open access versionFindings
  • Aaron Van den Oord, Nal Kalchbrenner, Lasse Espeholt, Oriol Vinyals, Alex Graves, and Koray Kavukcuoglu. Conditional image generation with PixelCNN decoders. In Advances in Neural Information Processing Systems (NeurIPS), pages 4790–4798, 2016.
    Google ScholarLocate open access versionFindings
  • Sergey Zagoruyko and Nikos Komodakis. Wide Residual Networks. In Proceedings of the British Machine Vision Conference (BMVC), pages 87.1–87.12, 2016.
    Google ScholarLocate open access versionFindings
  • Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick, Shakir Mohamed, and Alexander Lerchner. Beta-VAE: Learning basic visual concepts with a constrained variational framework. In International Conference on Learning Representations (ICLR), volume 2, page 6, 2017.
    Google ScholarLocate open access versionFindings
  • Shengjia Zhao, Jiaming Song, and Stefano Ermon. InfoVAE: Balancing learning and inference in variational autoencoders. In Conference on Artificial Intelligence. AAAI, 2019.
    Google ScholarLocate open access versionFindings
  • Stefano Pellegrini, Andreas Ess, Konrad Schindler, and Luc Van Gool. You’ll never walk alone: Modeling social behavior for multi-target tracking. In International Conference on Computer Vision (ICCV), pages 261–268. IEEE, 2009.
    Google ScholarLocate open access versionFindings
  • Emile Mathieu, Tom Rainforth, N Siddharth, and Yee Whye Teh. Disentangling disentanglement in variational autoencoders. In International Conference on Machine Learning (ICML), pages 4402–4412, 2019.
    Google ScholarLocate open access versionFindings
  • Adam Kosiorek, Hyunjik Kim, Yee Whye Teh, and Ingmar Posner. Sequential attend, infer, repeat: Generative modelling of moving objects. In Advances in Neural Information Processing Systems (NeurIPS), pages 8606–8616, 2018.
    Google ScholarLocate open access versionFindings
  • Murat Sensoy, Lance Kaplan, and Melih Kandemir. Evidential deep learning to quantify classification uncertainty. In Advances in Neural Information Processing Systems (NeurIPS), pages 3179–3189, 2018.
    Google ScholarLocate open access versionFindings
  • Włodzisław Duch and Łukasz Itert. A posteriori corrections to classification methods. In Neural Networks and Soft Computing, pages 406–411.
    Google ScholarLocate open access versionFindings
  • Anirban Laha, Saneem Ahmed Chemmengath, Priyanka Agrawal, Mitesh Khapra, Karthik Sankaranarayanan, and Harish G Ramaswamy. On controllable sparse alternatives to softmax. In Advances in Neural Information Processing Systems (NeurIPS), pages 6422–6432, 2018.
    Google ScholarLocate open access versionFindings
  • Gonçalo M Correia, Vlad Niculae, Wilker Aziz, and André FT Martins. Efficient marginalization of discrete and structured latent variables via sparsity. arXiv preprint arXiv:2007.01919, 2020.
    Findings
  • Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In International Conference on Learning Representations (ICLR), 2015.
    Google ScholarLocate open access versionFindings
  • 3. For every positive integer n and every collection A1,..., An of subsets of Z, Bel(A1 ∪ An) ≥ Bel(Ai) − Bel(Ai ∩ Aj +...
    Google ScholarFindings
作者
Masha Itkina
Masha Itkina
Boris Ivanovic
Boris Ivanovic
Ransalu Senanayake
Ransalu Senanayake
您的评分 :
0

 

标签
评论
小科