MaskGAN: Better Text Generation via Filling in the ______

ICLR, Volume abs/1801.07736, 2018.

Cited by: 150|Bibtex|Views143
EI
Other Links: dblp.uni-trier.de|academic.microsoft.com|arxiv.org
Weibo:
We found that policy gradient methods were effective in conjunction with a learned critic, but the highly active research on training with discrete nodes may present even more stable training procedures

Abstract:

Neural text generation models are often autoregressive language models or seq2seq models. Neural autoregressive and seq2seq models that generate text by sampling words sequentially, with each word conditioned on the previous model, are state-of-the-art for several machine translation and summarization benchmarks. These benchmarks are ofte...More

Code:

Data:

0
Introduction
  • Recurrent Neural Networks (RNNs) (Graves et al, 2012) are the most common generative model for sequences as well as for sequence labeling tasks.
  • Text is typically generated from these models by sampling from a distribution that is conditioned on the previous word and a hidden state that consists of a representation of the words generated so far
  • These are typically trained with maximum likelihood in an approach known as teacher forcing, where ground-truth words are fed back into the model to be conditioned on for generating the following parts of the sentence.
Highlights
  • Recurrent Neural Networks (RNNs) (Graves et al, 2012) are the most common generative model for sequences as well as for sequence labeling tasks
  • We introduce a text generation model trained on in-filling (MaskGAN)
  • We found that policy gradient methods were effective in conjunction with a learned critic, but the highly active research on training with discrete nodes may present even more stable training procedures
  • The in-filling would fill in reasonable subsequences that became implausible in the context of the adjacent surrounding words. We suspect another promising avenue would be to consider Generative Adversarial Networks-training with attention-only models as in Vaswani et al (2017)
  • We show that MaskGAN samples on a larger dataset (IMDB reviews) is significantly better than the corresponding tuned MaskMLE model as shown by human evaluation
  • We show we can produce high-quality samples despite the MaskGAN model having much higher perplexity on the ground-truth test set
Methods
  • The authors first perform pretraining.
  • First the authors train a language model using standard maximum likelihood training.
  • The authors use the pretrained language model weights for the seq2seq encoder and decoder modules.
  • With these language models, the authors pretrain the seq2seq model on the in-filling task using maximum likelihood, in particular, the attention parameters as described in Luong et al (2015).
  • Initial algorithms did not include a critic, but the authors found that the inclusion of the critic decreased the variance of the gradient estimates by an order of magnitude which substantially improved training
Results
  • Evaluation of generative models continues to be an open-ended research question. The authors seek heuristic metrics that the authors believe will be correlated with human-evaluation.
  • BLEU score (Papineni et al, 2002) is used extensively in machine translation where one can compare the quality of candidate translations from the reference
  • Motivated by this metric, the authors compute the number of unique n-grams produced by the generator that occur in the validation corpus for small n.
  • The authors choose not to focus on architectures or hyperparameter configurations that led to small reductions in validation perplexity, but rather, searched for those that improved the heuristic evaluation metrics
Conclusion
  • The authors' work further supports the case for matching the training and inference procedures in order to produce higher quality language samples.
  • The MaskGAN algorithm directly achieves this through GAN-training and improved the generated samples as assessed by human evaluators.
  • The authors generally found training where contiguous blocks of words were masked produced better samples.
  • In general the authors think the proposed contiguous in-filling task is a good approach to reduce mode collapse and help with training stability for textual GANs. The authors show that MaskGAN samples on a larger dataset (IMDB reviews) is significantly better than the corresponding tuned MaskMLE model as shown by human evaluation.
  • ACKNOWLEDGEMENTS The authors would like to thank George Tucker, Jascha Sohl-Dickstein, Jon Shlens, Ryan Sepassi, Jasmine Collins, Irwan Bello, Barret Zoph, Gabe Pereyra, Eric Jang and the Google Brain team, the first year residents who humored them listening and commenting on almost every conceivable variation of this core idea
Summary
  • Introduction:

    Recurrent Neural Networks (RNNs) (Graves et al, 2012) are the most common generative model for sequences as well as for sequence labeling tasks.
  • Text is typically generated from these models by sampling from a distribution that is conditioned on the previous word and a hidden state that consists of a representation of the words generated so far
  • These are typically trained with maximum likelihood in an approach known as teacher forcing, where ground-truth words are fed back into the model to be conditioned on for generating the following parts of the sentence.
  • Methods:

    The authors first perform pretraining.
  • First the authors train a language model using standard maximum likelihood training.
  • The authors use the pretrained language model weights for the seq2seq encoder and decoder modules.
  • With these language models, the authors pretrain the seq2seq model on the in-filling task using maximum likelihood, in particular, the attention parameters as described in Luong et al (2015).
  • Initial algorithms did not include a critic, but the authors found that the inclusion of the critic decreased the variance of the gradient estimates by an order of magnitude which substantially improved training
  • Results:

    Evaluation of generative models continues to be an open-ended research question. The authors seek heuristic metrics that the authors believe will be correlated with human-evaluation.
  • BLEU score (Papineni et al, 2002) is used extensively in machine translation where one can compare the quality of candidate translations from the reference
  • Motivated by this metric, the authors compute the number of unique n-grams produced by the generator that occur in the validation corpus for small n.
  • The authors choose not to focus on architectures or hyperparameter configurations that led to small reductions in validation perplexity, but rather, searched for those that improved the heuristic evaluation metrics
  • Conclusion:

    The authors' work further supports the case for matching the training and inference procedures in order to produce higher quality language samples.
  • The MaskGAN algorithm directly achieves this through GAN-training and improved the generated samples as assessed by human evaluators.
  • The authors generally found training where contiguous blocks of words were masked produced better samples.
  • In general the authors think the proposed contiguous in-filling task is a good approach to reduce mode collapse and help with training stability for textual GANs. The authors show that MaskGAN samples on a larger dataset (IMDB reviews) is significantly better than the corresponding tuned MaskMLE model as shown by human evaluation.
  • ACKNOWLEDGEMENTS The authors would like to thank George Tucker, Jascha Sohl-Dickstein, Jon Shlens, Ryan Sepassi, Jasmine Collins, Irwan Bello, Barret Zoph, Gabe Pereyra, Eric Jang and the Google Brain team, the first year residents who humored them listening and commenting on almost every conceivable variation of this core idea
Tables
  • Table1: Conditional samples from PTB for both MaskGAN and MaskMLE models
  • Table2: Language model (unconditional) sample from PTB for MaskGAN
  • Table3: Conditional samples from IMDB for both MaskGAN and MaskMLE models
  • Table4: Language model (unconditional) sample from IMDB for MaskGAN
  • Table5: The perplexity is calculated using a pre-trained language model that is equivalent to the decoder (in terms of architecture and size) used in the MaskMLE and MaskGAN models. This language model was used to initialize both models
  • Table6: Diversity statistics within 1000 unconditional samples of PTB news snippets (20 words each)
  • Table7: A Mechanical Turk blind heads-up evaluation between pairs of models trained on IMDB reviews. 100 reviews (each 40 words long) from each model are unconditionally sampled and randomized. Raters are asked which sample is preferred between each pair. 300 ratings were obtained for each model pair comparison
  • Table8: A Mechanical Turk blind heads-up evaluation between pairs of models trained on PTB. 100 news snippets (each 20 words long) from each model are unconditionally sampled and randomized. Raters are asked which sample is preferred between each pair. 300 ratings were obtained for each model pair comparison
Download tables as Excel
Related work
  • Research into reliably extending GAN training to discrete spaces and discrete sequences has been a highly active area. GAN training in a continuous setting allows for fully differentiable computations, permitting gradients to be passed through the discriminator to the generator. Discrete elements break this differentiability, leading researchers to either avoid the issue and reformulate the problem, work in the continuous domain or to consider RL methods.

    SeqGAN (Yu et al, 2017) trains a language model by using policy gradients to train the generator to fool a CNN-based discriminator that discriminates between real and synthetic text. Both the generator and discriminator are pretrained on real and fake data before the phase of training with policy gradients. During training they then do Monte Carlo rollouts in order to get a useful loss signal per word. Follow-up work then demonstrated text generation without pretraining with RNNs (Press et al, 2017). Additionally (Zhang et al, 2017) produced results with an RNN generator by matching high- dimensional latent representations.
Funding
  • Proposes to improve sample quality using Generative Adversarial Networks , which explicitly train the generator to produce high quality samples and have shown a lot of success in image generation
  • Introduces an actor-critic conditional GAN that fills in missing text conditioned on the surrounding context
  • Shows qualitatively and quantitatively, evidence that this produces more realistic conditional and unconditional text samples compared to a maximum likelihood trained model
  • Reduces the impact of these problems by training our model on a text fill-in-the-blank or in-filling task
Study subjects and analysis
word samples: 40
The Mechanical Turk results show that MaskGAN generates superior human-looking samples to MaskMLE on the IMDB dataset. However, on the smaller PTB dataset (with 20 word instead of 40 word samples), the results are closer. We also show results with SeqGAN (trained with the same network size and vocabulary size) as MaskGAN, which show that MaskGAN produces superior samples to SeqGAN

Reference
  • Dzmitry Bahdanau, Philemon Brakel, Kelvin Xu, Anirudh Goyal, Ryan Lowe, Joelle Pineau, Aaron Courville, and Yoshua Bengio. An actor-critic algorithm for sequence prediction. In International Conference on Learning Representations, 2017.
    Google ScholarLocate open access versionFindings
  • Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and Noam Shazeer. Scheduled sampling for sequence prediction with recurrent neural networks. In Advances in Neural Information Processing Systems, pp. 1171–1179, 2015.
    Google ScholarLocate open access versionFindings
  • Yoshua Bengio, Rejean Ducharme, Pascal Vincent, and Christian Jauvin. A neural probabilistic language model. Journal of machine learning research, 3(Feb):1137–1155, 2003.
    Google ScholarLocate open access versionFindings
  • Samuel R Bowman, Luke Vilnis, Oriol Vinyals, Andrew M Dai, Rafal Jozefowicz, and Samy Bengio. Generating sentences from a continuous space. In 20th SIGNLL Conference on Computational Natural Language Learning (CoNLL), 2016.
    Google ScholarLocate open access versionFindings
  • Tong Che, Yanran Li, Ruixiang Zhang, R Devon Hjelm, Wenjie Li, Yangqiu Song, and Yoshua Bengio. Maximum-likelihood augmented discrete generative adversarial networks. arXiv preprint arXiv:1702.07983, 2017.
    Findings
  • Thomas Degris, Patrick M Pilarski, and Richard S Sutton. Model-free reinforcement learning with continuous action in practice. In American Control Conference (ACC), 2012, pp. 2177–2182. IEEE, 2012.
    Google ScholarLocate open access versionFindings
  • Yarin Gal and Zoubin Ghahramani. A theoretically grounded application of dropout in recurrent neural networks. In Advances in Neural Information Processing Systems 29, pp. 1019–1027, 2016.
    Google ScholarLocate open access versionFindings
  • Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680, 2014.
    Google ScholarLocate open access versionFindings
  • Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron Courville. Improved training of wasserstein gans. arXiv preprint arXiv:1704.00028, 2017.
    Findings
  • R Devon Hjelm, Athul Paul Jacob, Tong Che, Kyunghyun Cho, and Yoshua Bengio. Boundaryseeking generative adversarial networks. arXiv preprint arXiv:1702.08431, 2017.
    Findings
  • Hakan Inan, Khashayar Khosravi, and Richard Socher. Tying word vectors and word classifiers: A loss framework for language modeling. In International Conference on Learning Representations, 2017.
    Google ScholarLocate open access versionFindings
  • Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with gumbel-softmax. In International Conference on Learning Representations, 2017.
    Google ScholarLocate open access versionFindings
  • Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In International Conference on Learning Representations, 2015.
    Google ScholarLocate open access versionFindings
  • Alex Lamb, Anirudh Goyal, Ying Zhang, Saizheng Zhang, Aaron Courville, and Yoshua Bengio. Professor forcing: A new algorithm for training recurrent networks. In Advances In Neural Information Processing Systems, pp. 4601–4609, 2016.
    Google ScholarLocate open access versionFindings
  • Jiwei Li, Will Monroe, Tianlin Shi, Alan Ritter, and Dan Jurafsky. Adversarial learning for neural dialogue generation. In Conference on Empirical Methods in Natural Language Processing, 2017.
    Google ScholarLocate open access versionFindings
  • Minh-Thang Luong, Hieu Pham, and Christopher D Manning. Effective approaches to attention-based neural machine translation. In Conference on Empirical Methods in Natural Language Processing, pp. 1412–1421, 2015.
    Google ScholarLocate open access versionFindings
  • Andrew L Maas, Raymond E Daly, Peter T Pham, Dan Huang, Andrew Y Ng, and Christopher Potts. Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies-Volume 1, pp. 142–150. Association for Computational Linguistics, 2011.
    Google ScholarLocate open access versionFindings
  • Mitchell P Marcus, Mary Ann Marcinkiewicz, and Beatrice Santorini. Building a large annotated corpus of english: The penn treebank. Computational linguistics, 19(2):313–330, 1993.
    Google ScholarLocate open access versionFindings
  • Tomas Mikolov, Martin Karafiat, Lukas Burget, Jan Cernocky, and Sanjeev Khudanpur. Recurrent neural network based language model. In Interspeech, volume 2, pp. 3, 2010.
    Google ScholarLocate open access versionFindings
  • Takeru Miyato, Andrew M Dai, and Ian Goodfellow. Virtual adversarial training for semi-supervised text classification. In International Conference on Learning Representations, volume 1050, pp. 25, 2017.
    Google ScholarLocate open access versionFindings
  • Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics, pp. 311–318. Association for Computational Linguistics, 2002.
    Google ScholarLocate open access versionFindings
  • Ofir Press and Lior Wolf. Using the output embedding to improve language models. In 15th Conference of the European Chapter of the Association for Computational Linguistics, pp. 157– 163, 2017.
    Google ScholarLocate open access versionFindings
  • Ofir Press, Amir Bar, Ben Bogin, Jonathan Berant, and Lior Wolf. Language generation with recurrent generative adversarial networks without pre-training. arXiv preprint arXiv:1706.01399, 2017.
    Findings
  • Sai Rajeswar, Sandeep Subramanian, Francis Dutil, Christopher Pal, and Aaron Courville. Adversarial generation of natural language. In 2nd Workshop on Representation Learning for NLP, 2017.
    Google ScholarLocate open access versionFindings
  • Marc’Aurelio Ranzato, Sumit Chopra, Michael Auli, and Wojciech Zaremba. Sequence level training with recurrent neural networks. arXiv preprint arXiv:1511.06732, 2015.
    Findings
  • Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pp. 3104–3112, 2014.
    Google ScholarLocate open access versionFindings
  • Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction, volume 1. MIT press Cambridge, 1998.
    Google ScholarFindings
  • Richard S Sutton, David A McAllester, Satinder P Singh, and Yishay Mansour. Policy gradient methods for reinforcement learning with function approximation. In Advances in neural information processing systems, pp. 1057–1063, 2000.
    Google ScholarLocate open access versionFindings
  • Lucas Theis, Aaron van den Oord, and Matthias Bethge. A note on the evaluation of generative models. In International Conference on Learning Representations, 2016.
    Google ScholarLocate open access versionFindings
  • George Tucker, Andriy Mnih, Chris J Maddison, Dieterich Lawson, and Jascha Sohl-Dickstein. Rebar: Low-variance, unbiased gradient estimates for discrete latent variable models. In 31st Conference on Neural Information Processing Systems (NIPS 2017), 2017.
    Google ScholarLocate open access versionFindings
  • Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In 31st Conference on Neural Information Processing Systems (NIPS 2017), 2017.
    Google ScholarLocate open access versionFindings
  • Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, Jeff Klingner, Apurva Shah, Melvin Johnson, Xiaobing Liu, Lukasz Kaiser, Stephan Gouws, Yoshikiyo Kato, Taku Kudo, Hideto Kazawa, Keith Stevens, George Kurian, Nishant Patil, Wei Wang, Cliff Young, Jason Smith, Jason Riesa, Alex Rudnick, Oriol Vinyals, Greg Corrado, Macduff Hughes, and Jeffrey Dean. Google’s neural machine translation system: Bridging the gap between human and machine translation. CoRR, abs/1609.08144, 2016. URL http://arxiv.org/abs/1609.08144.
    Findings
  • Lantao Yu, Weinan Zhang, Jun Wang, and Yong Yu. SeqGAN: Sequence generative adversarial nets with policy gradient. In Association for the Advancement of Artificial Intelligence, pp. 2852–2858, 2017.
    Google ScholarLocate open access versionFindings
  • Published as a conference paper at ICLR 2018 Yizhe Zhang, Zhe Gan, Kai Fan, Zhi Chen, Ricardo Henao, Dinghan Shen, and Lawrence Carin.
    Google ScholarLocate open access versionFindings
  • Adversarial feature matching for text generation. arXiv preprint arXiv:1706.03850, 2017. Barret Zoph and Quoc V Le. Neural architecture search with reinforcement learning. In International
    Findings
  • Conference on Learning Representations, 2017.
    Google ScholarFindings
  • Our model was trained with the Adam method for stochastic optimization (Kingma & Ba, 2015) with the default Tensorflow exponential decay rates of β1 = 0.99 and β2 = 0.999. Our model uses 2-layers of 650 unit LSTMs for both the generator and discriminator, 650 dimensional word embeddings, variational dropout. We used Bayesian hyperparameter tuning to tune the variational dropout rate and learning rates for the generator, discriminator and critic. We perform 3 gradient descent steps on the discriminator for every step on the generator and critic.
    Google ScholarFindings
  • We share the embedding and softmax weights of the generator as proposed in Bengio et al. (2003); Press & Wolf (2017); Inan et al. (2017). Furthermore, to improve convergence speed, we share the embeddings of the generator and the discriminator. Additionally, as noted in our architectural section, our critic shares all of the discriminator parameters with the exception of the separate output head to estimate the value. Both our generator and discriminator use variational recurrent dropout (Gal & Ghahramani, 2016).
    Google ScholarLocate open access versionFindings
  • Pitch Black was a complete shock to me when I first saw it back in 1979 I was really looking forward Pitch Black was a complete shock to me when I first saw it back in 1976 The promos were very well Pitch Black was a complete shock to me when I first saw it back in the days when I was a
    Google ScholarFindings
  • Black was a complete shock to me when I first saw it back in 1969 I live in New Zealand Pitch Black was a complete shock to me when I first saw it back in 1951 It was funny All Interiors Pitch Black was a complete shock to me when I first saw it back in the day and I was in
    Google ScholarFindings
Full Text
Your rating :
0

 

Tags
Comments