# Latent Alignment and Variational Attention

ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 31 (NIPS 2018), 2018.

EI

Weibo:

Abstract:

Neural attention has become central to many state-of-the-art models in natural language processing and related domains. Attention networks are an easy-to-train and effective method for softly simulating alignment; however, the approach does not marginalize over latent alignments in a probabilistic sense. This property makes it difficult t...More

Code:

Data:

Introduction

- Attention networks [6] have quickly become the foundation for state-of-the-art models in natural language understanding, question answering, speech recognition, image captioning, and more [15, 81, 16, 14, 63, 80, 71, 62].
- Hard attention [80], makes this connection explicit by introducing a latent variable for alignment and optimizing a bound on the log marginal likelihood using policy gradients.
- The function f gives a distribution over the output, e.g. an exponential family
- To fit this model to data, the authors set the model parameters θ by maximizing the log marginal likelihood of training examples (x, x, y):2 max log p(y = y | x, x) = max log Ez[f (x, z; θ)y] θ θ.

Highlights

- Attention networks [6] have quickly become the foundation for state-of-the-art models in natural language understanding, question answering, speech recognition, image captioning, and more [15, 81, 16, 14, 63, 80, 71, 62]
- Still the latent alignment approach remains appealing for several reasons: (a) latent variables facilitate reasoning about dependencies in a probabilistically principled way, e.g. allowing composition with other models, (b) posterior inference provides a better basis for model analysis and partial predictions than strictly feed-forward models, which have been shown to underperform on alignment in machine translation [38], and (c) directly maximizing marginal likelihood may lead to better results
- We are careful to use alignment to refer to this probabilistic model (Section 2.1), and soft and hard attention to refer to two particular inference approaches used in the literature to estimate alignment models (Section 2.2)
- On the larger WMT 2017 English-German task, the superior performance of variational attention persists: our baseline soft attention reaches 24.10 BLEU score, while variational attention reaches 24.98. Note that this only reflects a reasonable setting without exhaustive tuning, yet we show that we can train variational attention at scale
- For Visual question answering the trend is largely similar, and results for NLL with variational attention improve on soft attention and hard attention
- Attention methods are ubiquitous tool for areas like natural language processing; they are difficult to use as latent variable models

Methods

- Setup For NMT the authors mainly use the IWSLT dataset [13].
- This dataset is relatively small, but has become a standard benchmark for experimental NMT models.
- To show that variational attention scales to large datasets, the authors experiment on the WMT 2017 English-German dataset [8], following the preprocessing in [74] except that the authors use newstest2017 as the test set.
- As the authors are interested in intrinsic evaluation in addition to the standard VQA metric, the authors randomly select half of the standard validation set as the test set.7 ( the numbers provided are not strictly comparable to existing work.) While the preprocessing is the same as [2], the numbers are worse than previously reported as the authors do not apply any of the commonly-utilized techniques to improve performance on VQA such as data augmentation and label smoothing

Results

**Results and Discussion**

Table 1 shows the main results. The authors first note that hard attention underperforms soft attention, even when its expectation is enumerated.- For NMT, on the IWSLT 2014 German-English task, variational attention with enumeration and sampling performs comparably to optimizing the log marginal likelihood, despite the fact that it is optimizing a lower bound
- The authors believe that this is due to the use of q(z), which conditions on the entire source/target and potentially provides better training signal to p(z | x, x) through the KL term.

Conclusion

- Attention methods are ubiquitous tool for areas like natural language processing; they are difficult to use as latent variable models.
- This work explores alternative approaches to latent alignment, through variational attention with promising result.
- Future work will experiment with scaling the method on larger-scale tasks and in more complex models, such as multi-hop attention models, transformer models, and structured models, as well as utilizing these latent variables for interpretability and as a way to incorporate prior knowledge

Summary

## Introduction:

Attention networks [6] have quickly become the foundation for state-of-the-art models in natural language understanding, question answering, speech recognition, image captioning, and more [15, 81, 16, 14, 63, 80, 71, 62].- Hard attention [80], makes this connection explicit by introducing a latent variable for alignment and optimizing a bound on the log marginal likelihood using policy gradients.
- The function f gives a distribution over the output, e.g. an exponential family
- To fit this model to data, the authors set the model parameters θ by maximizing the log marginal likelihood of training examples (x, x, y):2 max log p(y = y | x, x) = max log Ez[f (x, z; θ)y] θ θ.
## Methods:

Setup For NMT the authors mainly use the IWSLT dataset [13].- This dataset is relatively small, but has become a standard benchmark for experimental NMT models.
- To show that variational attention scales to large datasets, the authors experiment on the WMT 2017 English-German dataset [8], following the preprocessing in [74] except that the authors use newstest2017 as the test set.
- As the authors are interested in intrinsic evaluation in addition to the standard VQA metric, the authors randomly select half of the standard validation set as the test set.7 ( the numbers provided are not strictly comparable to existing work.) While the preprocessing is the same as [2], the numbers are worse than previously reported as the authors do not apply any of the commonly-utilized techniques to improve performance on VQA such as data augmentation and label smoothing
## Results:

**Results and Discussion**

Table 1 shows the main results. The authors first note that hard attention underperforms soft attention, even when its expectation is enumerated.- For NMT, on the IWSLT 2014 German-English task, variational attention with enumeration and sampling performs comparably to optimizing the log marginal likelihood, despite the fact that it is optimizing a lower bound
- The authors believe that this is due to the use of q(z), which conditions on the entire source/target and potentially provides better training signal to p(z | x, x) through the KL term.
## Conclusion:

Attention methods are ubiquitous tool for areas like natural language processing; they are difficult to use as latent variable models.- This work explores alternative approaches to latent alignment, through variational attention with promising result.
- Future work will experiment with scaling the method on larger-scale tasks and in more complex models, such as multi-hop attention models, transformer models, and structured models, as well as utilizing these latent variables for interpretability and as a way to incorporate prior knowledge

- Table1: Evaluation on NMT and VQA for the various models. E column indicates whether the expectation is calculated via enumeration (Enum) or a single sample (Sample) during training. For NMT we evaluate intrinsically on perplexity (PPL) (lower is better) and extrinsically on BLEU (higher is better), where for BLEU we perform beam search with beam size 10 and length penalty (see Appendix B for further details). For VQA we evaluate intrinsically on negative log-likelihood (NLL) (lower is better) and extrinsically on VQA evaluation metric (higher is better). All results except for relaxed attention use enumeration at test time
- Table2: left) considers test inference for variational attention, comparing enumeration to K-max with K = 5. For all methods exact enumeration is better, however K-max is a reasonable approximation. Left) Performance change on NMT from exact decoding to K-Max decoding with K = 5. (see section 5 for definition of K-max decoding). (Right) Test perplexity of different approaches while varying K to estimate Ez[p(y|x, x)]. Dotted lines compare soft baseline and variational with full enumeration. right) shows the PPL of different models as we increase K. Good performance requires K > 1, but we only get marginal benefits for K > 5. Finally, we observe that it is possible to train with soft attention and test using K-Max with a small performance drop (Soft KMax in Table 2 (right)). This possibly indicates that soft attention models are approximating latent alignment models. On the other hand, training with latent alignments and testing with soft attention performed badly
- Table3: left) shows these results in context compared to the best reported values for this task. Even with sampling, our system improves on the state-of-the-art. On the larger WMT 2017 English-German task, the superior performance of variational attention persists: our baseline soft attention reaches 24.10 BLEU score, while variational attention reaches 24.98. Note that this only reflects a reasonable setting without exhaustive tuning, yet we show that we can train variational attention at scale. For VQA the trend is largely similar, and results for NLL with variational attention improve on soft attention and hard attention. However the task-specific evaluation metrics are slightly worse. Left) Comparison against the best prior work for NMT on the IWSLT 2014 German-English test set. (Upper Right) Comparison of inference alternatives of variational attention on IWSLT 2014. (Lower Right) Comparison of different models in terms of implied discrete entropy (lower = more certain alignment)

Related work

- Latent alignment has long been a core problem in NLP, starting with the seminal IBM models [11], HMM-based alignment models [75], and a fast log-linear reparameterization of the IBM 2 model [20]. Neural soft attention models were originally introduced as an alternative approach for neural machine translation [6], and have subsequently been successful on a wide range of tasks (see [15] for a review of applications). Recent work has combined neural attention with traditional alignment [18, 72] and induced structure/sparsity [48, 33, 44, 85, 54, 55, 49], which can be combined with the variational approaches outlined in this paper.

In contrast to soft attention models, hard attention [80, 3] approaches use a single sample at training time instead of a distribution. These models have proven much more difficult to train, and existing works typically treat hard attention as a black-box reinforcement learning problem with log-likelihood as the reward [80, 3, 53, 26, 19]. Two notable exceptions are [4, 41]: both utilize amortized variational inference to learn a sampling distribution which is used obtain importance-sampled estimates of the log marginal likelihood [12]. Our method uses uses different estimators and targets the single sample approach for efficiency, allowing the method to be employed for NMT and VQA applications.

Funding

- This project was supported by a Facebook Research Award (Low Resource NMT)
- YK is supported by a Google AI PhD Fellowship
- YD is supported by a Bloomberg Research Award
- AMR gratefully acknowledges the support of NSF CCF-1704834 and an Amazon AWS Research award

Reference

- David Alvarez-Melis and Tommi S Jaakkola. A Causal Framework for Explaining the Predictions of Black-Box Sequence-to-Sequence Models. In Proceddings of EMNLP, 2017.
- Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. Bottom-up and Top-Down Attention for Image Captioning and Visual Question Answering. In Proceedings of CVPR, 2018.
- Jimmy Ba, Volodymyr Mnih, and Koray Kavukcuoglu. Multiple Object Recognition with Visual Attention. In Proceedings of ICLR, 2015.
- Jimmy Ba, Ruslan R Salakhutdinov, Roger B Grosse, and Brendan J Frey. Learning Wake-Sleep Recurrent Attention Models. In Proceedings of NIPS, 2015.
- Dzmitry Bahdanau, Philemon Brakel, Kelvin Xu, Anirudh Goyal, Ryan Lowe, Joelle Pineau, Aaron Courville, and Yoshua Bengio. An Actor-Critic Algorithm for Sequence Prediction. In Proceedings of ICLR, 2017.
- Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural Machine Translation by Jointly Learning to Align and Translate. In Proceedings of ICLR, 2015.
- Hareesh Bahuleyan, Lili Mou, Olga Vechtomova, and Pascal Poupart. Variational Attention for Sequenceto-Sequence Models. arXiv:1712.08207, 2017.
- Ondrej Bojar, Christian Buck, Rajen Chatterjee, Christian Federmann, Yvette Graham, Barry Haddow, Matthias Huck, Antonio Jimeno Yepes, Philipp Koehn, and Julia Kreutzer. Proceedings of the second conference on machine translation. In Proceedings of the Second Conference on Machine Translation. Association for Computational Linguistics, 2017.
- Jorg Bornschein, Andriy Mnih, Daniel Zoran, and Danilo J. Rezende. Variational Memory Addressing in Generative Models. In Proceedings of NIPS, 2017.
- Peter F Brown, Vincent J Della Pietra, Stephen A Della Pietra, and Robert L Mercer. The Mathematics of Statistical Machine Translation: Parameter Estimation. Computational linguistics, 19(2):263–311, 1993.
- Peter F. Brown, Vincent J. Della Pietra, Stephen A. Della Pietra, and Robert L. Mercer. The mathematics of statistical machine translation: Parameter estimation. Comput. Linguist., 19(2):263–311, June 1993.
- Yuri Burda, Roger Grosse, and Ruslan Salakhutdinov. Importance Weighted Autoencoders. In Proceedings of ICLR, 2015.
- Mauro Cettolo, Jan Niehues, Sebastian Stuker, Luisa Bentivogli, and Marcello Federico. Report on the 11th IWSLT evaluation campaign. In Proceedings of IWSLT, 2014.
- William Chan, Navdeep Jaitly, Quoc Le, and Oriol Vinyals. Listen, Attend and Spell. arXiv:1508.01211, 2015.
- Kyunghyun Cho, Aaron Courville, and Yoshua Bengio. Describing Multimedia Content using Attentionbased Encoder-Decoder Networks. In IEEE Transactions on Multimedia, 2015.
- Jan Chorowski, Dzmitry Bahdanau, Dmitriy Serdyuk, Kyunghyun Cho, and Yoshua Bengio. AttentionBased Models for Speech Recognition. In Proceedings of NIPS, 2015.
- Junyoung Chung, Kyle Kastner, Laurent Dinh, Kratarth Goel, Aaron Courville, and Yoshua Bengio. A Recurrent Latent Variable Model for Sequential Data. In Proceedings of NIPS, 2015.
- Trevor Cohn, Cong Duy Vu Hoang, Ekaterina Vymolova, Kaisheng Yao, Chris Dyer, and Gholamreza Haffari. Incorporating Structural Alignment Biases into an Attentional Neural Translation Model. In Proceedings of NAACL, 2016.
- Yuntian Deng, Anssi Kanervisto, Jeffrey Ling, and Alexander M Rush. Image-to-Markup Generation with Coarse-to-Fine Attention. In Proceedings of ICML, 2017.
- Chris Dyer, Victor Chahuneau, and Noah A. Smith. A Simple, Fast, and Effective Reparameterization of IBM Model 2. In Proceedings of NAACL, 2013.
- Sergey Edunov, Myle Ott, Michael Auli, David Grangier, and Marc’Aurelio Ranzato. Classical Structured Prediction Losses for Sequence to Sequence Learning. In Proceedings of NAACL, 2018.
- Marco Fraccaro, Soren Kaae Sonderby, Ulrich Paquet, and Ole Winther. Sequential Neural Models with Stochastic Layers. In Proceedings of NIPS, 2016.
- Anirudh Goyal, Alessandro Sordoni, Marc-Alexandre Cote, Nan Rosemary Ke, and Yoshua Bengio. Z-Forcing: Training Stochastic Recurrent Networks. In Proceedings of NIPS, 2017.
- Will Grathwohl, Dami Choi, Yuhuai Wu, Geoffrey Roeder, and David Duvenaud. Backpropagation through the Void: Optimizing control variates for black-box gradient estimation. In Proceedings of ICLR, 2018.
- Jiatao Gu, Zhengdong Lu, Hang Li, and Victor OK Li. Incorporating Copying Mechanism in Sequence-toSequence Learning. 2016.
- Caglar Gulcehre, Sarath Chandar, Kyunghyun Cho, and Yoshua Bengio. Dynamic Neural Turing Machine with Soft and Hard Addressing Schemes. arXiv:1607.00036, 2016.
- Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep Residual Learning for Image Recognition. In Proceedings of CVPR, 2016.
- Matthew D Hoffman, David M Blei, Chong Wang, and John Paisley. Stochastic variational inference. The Journal of Machine Learning Research, 14(1):1303–1347, 2013.
- Po-Sen Huang, Chong Wang, Sitao Huang, Dengyong Zhou, and Li Deng. Towards neural phrase-based machine translation. In Proceedings of ICLR, 2018.
- Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with gumbel-softmax. arXiv preprint arXiv:1611.01144, 2016.
- Eric Jang, Shixiang Gu, and Ben Poole. Categorical Reparameterization with Gumbel-Softmax. In Proceedings of ICLR, 2017.
- Martin Jankowiak and Fritz Obermeyer. Pathwise Derivatives Beyond the Reparameterization Trick. In Proceedings of ICML, 2018.
- Yoon Kim, Carl Denton, Luong Hoang, and Alexander M. Rush. Structured Attention Networks. In Proceedings of ICLR, 2017.
- Yoon Kim, Sam Wiseman, Andrew C Miller, David Sontag, and Alexander M Rush. Semi-amortized variational autoencoders. arXiv preprint arXiv:1802.02550, 2018.
- Diederik P. Kingma and Jimmy Ba. Adam: A Method for Stochastic Optimization. In Proceedings of ICLR, 2015.
- Diederik P. Kingma and Max Welling. Auto-Encoding Variational Bayes. In Proceedings of ICLR, 2014.
- Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, et al. Moses: Open source toolkit for statistical machine translation. In Proceedings of the 45th annual meeting of the ACL on interactive poster and demonstration sessions, pages 177–180. Association for Computational Linguistics, 2007.
- Philipp Koehn and Rebecca Knowles. Six Challenges for Neural Machine Translation. arXiv:1706.03872, 2017.
- Rahul G. Krishnan, Dawen Liang, and Matthew Hoffman. On the Challenges of Learning with Inference Networks on Sparse, High-dimensional Data. In Proceedings of AISTATS, 2018.
- Rahul G. Krishnan, Uri Shalit, and David Sontag. Structured Inference Networks for Nonlinear State Space Models. In Proceedings of AAAI, 2017.
- Dieterich Lawson, Chung-Cheng Chiu, George Tucker, Colin Raffel, Kevin Swersky, and Navdeep Jaitly. Learning Hard Alignments in Variational Inference. In Proceedings of ICASSP, 2018.
- Jason Lee, Elman Mansimov, and Kyunghyun Cho. Deterministic Non-Autoregressive Neural Sequence Modeling by Iterative Refinement. arXiv:1802.06901, 2018.
- Tao Lei, Regina Barzilay, and Tommi Jaakkola. Rationalizing Neural Rredictions. In Proceedings of EMNLP, 2016.
- Yang Liu and Mirella Lapata. Learning Structured Text Representations. In Proceedings of TACL, 2017.
- Minh-Thang Luong, Hieu Pham, and Christopher D. Manning. Effective Approaches to Attention-based Neural Machine Translation. In Proceedings of EMNLP, 2015.
- Xuezhe Ma, Yingkai Gao, Zhiting Hu, Yaoliang Yu, Yuntian Deng, and Eduard Hovy. Dropout with Expectation-linear Regularization. In Proceedings of ICLR, 2017.
- Chris J. Maddison, Andriy Mnih, and Yee Whye Teh. The Concrete Distribution: A Continuous Relaxation of Discrete Random Variables. In Proceedings of ICLR, 2017.
- André F. T. Martins and Ramón Fernandez Astudillo. From Softmax to Sparsemax: A Sparse Model of Attention and Multi-Label Classification. In Proceedings of ICML, 2016.
- Arthur Mensch and Mathieu Blondel. Differentiable Dynamic Programming for Structured Prediction and Attention. In Proceedings of ICML, 2018.
- Andriy Mnih and Karol Gregor. Neural Variational Inference and Learning in Belief Networks. In Proceedings of ICML, 2014.
- Andriy Mnih and Danilo J. Rezende. Variational Inference for Monte Carlo Objectives. In Proceedings of ICML, 2016.
- Andriy Mnih and Danilo J Rezende. Variational inference for monte carlo objectives. arXiv preprint arXiv:1602.06725, 2016.
- Volodymyr Mnih, Nicola Heess, Alex Graves, and Koray Kavukcuoglu. Recurrent Models of Visual Attention. In Proceedings of NIPS, 2015.
- Vlad Niculae and Mathieu Blondel. A Regularized Framework for Sparse and Structured Neural Attention. In Proceedings of NIPS, 2017.
- Vlad Niculae, André F. T. Martins, Mathieu Blondel, and Claire Cardie. SparseMAP: Differentiable Sparse Structured Inference. In Proceedings of ICML, 2018.
- Roman Novak, Michael Auli, and David Grangier. Iterative Refinement for Machine Translation. arXiv:1610.06602, 2016.
- Jeffrey Pennington, Richard Socher, and Christopher D. Manning. GloVe: Global Vectors for Word Representation. In Proceedings of EMNLP, 2014.
- Colin Raffel, Minh-Thang Luong, Peter J Liu, Ron J Weiss, and Douglas Eck. Online and Linear-Time Attention by Enforcing Monotonic Alignments. In Proceedings of ICML, 2017.
- Rajesh Ranganath, Sean Gerrish, and David M. Blei. Black Box Variational Inference. In Proceedings of AISTATS, 2014.
- Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. In Proceedings of NIPS, 2015.
- Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic Backpropagation and Approximate Inference in Deep Generative Models. In Proceedings of ICML, 2014.
- Tim Rocktäschel, Edward Grefenstette, Karl Moritz Hermann, Tomas Kocisky, and Phil Blunsom. Reasoning about Entailment with Neural Attention. In Proceedings of ICLR, 2016.
- Alexander M. Rush, Sumit Chopra, and Jason Weston. A Neural Attention Model for Abstractive Sentence Summarization. In Proceedings of EMNLP, 2015.
- Philip Schulz, Wilker Aziz, and Trevor Cohn. A Stochastic Decoder for Neural Machine Translation. In Proceedings of ACL, 2018.
- Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural Machine Translation of Rare Words with Subword Units. In Proceedings of ACL, 2016.
- Iulian Vlad Serban, Alessandro Sordoni, Laurent Charlin Ryan Lowe, Joelle Pineau, Aaron Courville, and Yoshua Bengio. A Hierarchical Latent Variable Encoder-Decoder Model for Generating Dialogues. In Proceedings of AAAI, 2017.
- Shiv Shankar, Siddhant Garg, and Sunita Sarawagi. Surprisingly Easy Hard-Attention for Sequence to Sequence Learning. In Proceedings of EMNLP, 2018.
- Bonggun Shin, Falgun H Chokshi, Timothy Lee, and Jinho D Choi. Classification of Radiology Reports Using Neural Attention Models. In Proceedings of IJCNN, 2017.
- Akash Srivastava and Charles Sutton. Autoencoding Variational Inference for Topic Models. In Proceedings of ICLR, 2017.
- Jinsong Su, Shan Wu, Deyi Xiong, Yaojie Lu, Xianpei Han, and Biao Zhang. Variational Recurrent Neural Machine Translation. In Proceedings of AAAI, 2018.
- Sainbayar Sukhbaatar, Arthur Szlam, Jason Weston, and Rob Fergus. End-To-End Memory Networks. In Proceedings of NIPS, 2015.
- Zhaopeng Tu, Zhengdong Lu, Yang Liu, Xiaohua Liu, and Hang Li. Modeling Coverage for Neural Machine Translation. In Proceedings of ACL, 2016.
- George Tucker, Andriy Mnih, Chris J. Maddison, Dieterich Lawson, and Jascha Sohl-Dickstein. REBAR: Low-variance, Unbiased Gradient Estimates for Discrete Latent Variable Models. In Proceedings of NIPS, 2017.
- Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is All You Need. In Proceedings of NIPS, 2017.
- Stephan Vogel, Hermann Ney, and Christoph Tillmann. HMM-based Word Alignment in Statistical Translation. In Proceedings of COLING, 1996.
- Ronald J. Williams. Simple Statistical Gradient-following Algorithms for Connectionist Reinforcement Learning. Machine Learning, 8, 1992.
- Sam Wiseman and Alexander M. Rush. Sequence-to-Sequence learning as Beam Search Optimization. In Proceedings of EMNLP, 2016.
- Shijie Wu, Pamela Shapiro, and Ryan Cotterell. Hard Non-Monotonic Attention for Character-Level Transduction. In Proceedings of EMNLP, 2018.
- Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Klaus Macherey Qin Gao, Jeff Klingner, Apurva Shah, Melvin Johnson, Xiaobing Liu, Lukasz Kaiser, Stephan Gouws, Yoshikiyo Kato, Taku Kudo, Hideto Kazawa, Keith Stevens, Nishant Patil George Kurian, Wei Wang, Cliff Young, Jason Smith, Jason Riesa, Alex Rudnick, Oriol Vinyals, Greg Corrado, Macduff Hughes, and Jeffrey Dean. Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation. arXiv:1609.08144, 2016.
- Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. In Proceedings of ICML, 2015.
- Zichao Yang, Kiaodong He, Jianfeng Gao, Li Deng, and Alex Smola. Stacked Attention Networks for Image Question Answering. In Proceedings of CVPR, 2016.
- Lei Yu, Phil Blunsom, Chris Dyer, Edward Grefenstette, and Tomas Kocisky. The Neural Noisy Channel. In Proceedings of ICLR, 2017.
- Lei Yu, Jan Buys, and Phil Blunsom. Online Segment to Segment Neural Transduction. In Proceedings of EMNLP, 2016.
- Biao Zhang, Deyi Xiong, Jinsong Su, Hong Duan, and Min Zhang. Variational Neural Machine Translation. In Proceedings of EMNLP, 2016.
- Chen Zhu, Yanpeng Zhao, Shuaiyi Huang, Kewei Tu, and Yi Ma. Structured Attentions for Visual Question Answering. In Proceedings of ICCV, 2017.

Full Text

Tags

Comments