# To Relieve Your Headache of Training an MRF, Take AdVIL

ICLR, 2020.

EI

Weibo:

Abstract:

We propose a black-box algorithm called {\it Adversarial Variational Inference and Learning} (AdVIL) to perform inference and learning on a general Markov random field (MRF). AdVIL employs two variational distributions to approximately infer the latent variables and estimate the partition function of an MRF, respectively. The two variati...More

Highlights

- Markov random fields (MRFs) find applications in a variety of machine learning areas (Krahenbuhl & Koltun, 2011; Salakhutdinov & Larochelle, 2010; Lafferty et al, 2001)
- We propose Adversarial Variational Inference and Learning (AdVIL) to relieve some headache of learning an Markov random field model
- We evaluate AdVIL in various undirected generative models, including restricted Boltzmann machines (RBM) (Ackley et al, 1985), deep Boltzmann machines (DBM) (Salakhutdinov & Hinton, 2009), and Gaussian restricted Boltzmann machines (GRBM) (Hinton & Salakhutdinov, 2006), on several real datasets
- We empirically demonstrate that (1) compared to the black-box NVIL (Kuleshov & Ermon, 2017) method, AdVIL provides a tighter estimate of the log partition function and achieves much better log-likelihood results; and (2) compared to contrastive divergence based methods (Hinton, 2002; Welling & Sutton, 2005), AdVIL can deal with a broader family of Markov random field without model-specific analysis and obtain better results when the model structure gets complex as in deep Boltzmann machines
- We present a detailed analysis of AdVIL in restricted Boltzmann machines, whose energy function is defined as E(v, h) = −b v −v W h−c h
- We show the ability of AdVIL to learn a Gaussian restricted Boltzmann machines on the continuous Frey faces dataset

Summary

- Markov random fields (MRFs) find applications in a variety of machine learning areas (Krahenbuhl & Koltun, 2011; Salakhutdinov & Larochelle, 2010; Lafferty et al, 2001).
- NVIL introduces a variational distribution and derives an upper bound of the partition function in a general MRF, in the same spirit as amortized inference (Kingma & Welling, 2013; Rezende et al, 2014; Mnih & Gregor, 2014) for directed models.
- AdVIL introduces a variational encoder to infer the latent variables, which provides an upper bound of the free energy.
- AdVIL introduces a variational decoder for the MRF, which provides a lower bound of the log partition function.
- We empirically demonstrate that (1) compared to the black-box NVIL (Kuleshov & Ermon, 2017) method, AdVIL provides a tighter estimate of the log partition function and achieves much better log-likelihood results; and (2) compared to contrastive divergence based methods (Hinton, 2002; Welling & Sutton, 2005), AdVIL can deal with a broader family of MRFs without model-specific analysis and obtain better results when the model structure gets complex as in DBM.
- Existing traditional methods (Neal, 2001; Hinton, 2002; Winn & Bishop, 2005; Wainwright & Jordan, 2006; Rother et al, 2007) can be used to estimate the log partition function but are nontrivial to be extended to learn general MRFs. Some methods (Winn & Bishop, 2005; Neal, 2001) require an expensive inference procedure for each update of the model and others (Hinton, 2002; Rother et al, 2007) cannot be directly applied to general cases (e.g., DBM).
- AdVIL obtains the objective function in a unified perspective on the black-box inference and learning in general MRFs. Note that dealing with latent variables in MRFs is nontrivial (Kim & Bengio, 2016) and existing work focuses on fully observable models.
- We would like to demonstrate that AdVIL has the ability to deal with highly intractable models such as a DBM conveniently and effectively, compared to standard CD-based methods (Hinton, 2002; Welling & Hinton, 2002; Welling & Sutton, 2005) and NVIL (Kuleshov & Ermon, 2017).
- The key to AdVIL is a double variational trick that approximates the negative free energy and the log partition function separately.
- Empirical results show that AdVIL can deal with a broad family of MRFs in a fully black-box manner and outperforms both the standard contrastive divergence method and the black-box NVIL algorithm.
- Though AdVIL shows promising results, we emphasize that the black-box learning and inference of the MRFs are far from completely solved, especially on high-dimensional data.
- We conjecture that AdVIL is comparable to CD in RBM and superior to VCD in DBM on larger datasets if AdVIL can be trained to nearly converge based on our current results

- Table1: Anneal importance sampling (AIS) results in RBM. The results are recorded on the test set according to the best validation performance and averaged over three runs. AdVIL outperforms NVIL consistently and significantly. See the standard deviations in Appendix E.5
- Table2: AIS results in DBM. The results are recorded according to the best validation performance and averaged by three runs. AdVIL achieves higher averaged AIS results on five out of eight datasets and has a better overall performance than VCD. See the standard deviations in Appendix E.5
- Table3: Dimensions of the visible variables and sizes of the train, validation and test splits
- Table4: The model structures in RBM experiments
- Table5: The model structures in DBM experiments
- Table6: The AIS results of NVIL and AdVIL in RBM with the means and standard deviations. The results are averaged over three runs with different random seeds
- Table7: The AIS results of VCD-1 and AdVIL in DBM with the means and standard deviations. The results are averaged over three runs with different random seeds

Related work

- Existing traditional methods (Neal, 2001; Hinton, 2002; Winn & Bishop, 2005; Wainwright & Jordan, 2006; Rother et al, 2007) can be used to estimate the log partition function but are nontrivial to be extended to learn general MRFs. Some methods (Winn & Bishop, 2005; Neal, 2001) require an expensive inference procedure for each update of the model and others (Hinton, 2002; Rother et al, 2007) cannot be directly applied to general cases (e.g., DBM). Among these methods, contrastive divergence (CD) (Hinton, 2002) is proven effective in certain types of models and it is closely related to AdVIL. Indeed, the partial derivative of θ in AdVIL is: ∂L2(θ, φ, ψ) ∂θ EPD (v)Q(h|v) ∂ E(v, h) ∂θ − Eq(v,h) (12)

which also involves a positive phase and a negative phase naturally and is quite similar to Eqn (3). However, notably, the two phases average over the (v, h) pairs and only require the knowledge of the energy function without any further assumption of the model in AdVIL. Therefore, AdVIL is more suitable to general MRFs than CD (See empirical evidence in Sec. 5.3).

Funding

- This work was supported by the National Key Research and Development Program of China (No 2017YFA0700904), NSFC Projects (Nos. 61620106010, U19B2034, U1811461), Beijing NSF Project (No L172037), Beijing Academy of Artificial Intelligence (BAAI), Tsinghua-Huawei Joint Research Program, a grant from Tsinghua Institute for Guo Qiang, Tiangong Institute for Intelligent Computing, the JP Morgan Faculty Research Program and the NVIDIA NVAIL Program with GPU/DGX Acceleration
- Li was supported by the Chinese postdoctoral innovative talent support program and Shuimu Tsinghua Scholar

Full Text

Tags

Comments