AI helps you reading Science
AI generates interpretation videos
AI extracts and analyses the key points of the paper to generate videos automatically
AI parses the academic lineage of this thesis
AI extracts a summary of this paper
Our analysis continues along this path by examining the performance of these models under a sweep of temperatures. We believe this difference to be of utmost importance, as it is the necessary ingredient towards definitively showing MLE models outperform the currently proposed GA...
Language GANs Falling Short
Traditional natural language generation (NLG) models are trained using maximum likelihood estimation (MLE) which differs from the sample generation inference procedure. During training the ground truth tokens are passed to the model, however, during inference, the model instead reads its previously generated samples - a phenomenon coined ...More
PPT (Upload PPT)
- GANs originally were applied on continuous data like images.
- This is because the training procedure relied on backpropagation through the discriminator into the generator.
- [Yu et al, 2017] estimate the gradient to the generator via REINFORCE policy gradients [Williams, 1992]
- In their formulation, the discriminator evaluates full sequences.
- To provide error attribution earlier for incomplete sequences and to reduce the variance of gradients they perform k Monte-Carlo rollouts until the sentence is completed
- Our analysis continues along this path by examining the performance of these models under a sweep of temperatures. We believe this difference to be of utmost importance, as it is the necessary ingredient towards definitively showing MLE models outperform the currently proposed GAN variants on quality-diversity global metrics. Empowered with this evaluation approach, we examine several recent GAN text generation models and compare against an MLE baseline
- This research demonstrates that well-adjusted language models are a remarkably strong baseline and that temperature sweeping can provide a very clear characterization of model performance
- GAN-based generative models have been proven effective on real-valued data, but there exist many difficult pernicious issues of moving to discrete data
- On the datasets and tasks considered, potential issues caused by exposure bias were less than the issues of training GANs in discrete data
- Empowered with this evaluation approach, the authors examine several recent GAN text generation models and compare against an MLE baseline.
- The experiments consist of two parts: synthetic data generation long text generation.
- The authors use EMNLP2017 News1 for the long-text generation task [Guo et al, 2017].
- This corpus has become a common benchmark for neural text generation
- This research demonstrates that well-adjusted language models are a remarkably strong baseline and that temperature sweeping can provide a very clear characterization of model performance.
- GAN-based generative models have been proven effective on real-valued data, but there exist many difficult pernicious issues of moving to discrete data.
- These issues must be overcome before they will improve over the strong MLE baselines.
- GAN training may prove fruitful eventually, but this research lays forth clear boundaries that it must first surpass
- Table1: Effect of the temperature on samples in a LM trained on the EMNLP17 News dataset. At its implicit temperature i.e. at α = 1.0, the samples are syntactically correct but often lack in global coherence. The sample quality varies predictably with temperature. At α > 1.0, the syntax breaks down and at α = 0.0 the model always outputs the same sequence. There is an interesting value at α = 0.7 where samples are of high quality and diversity
- Table2: NLLoracle measured on the synthetic task. All results are taken from their respective papers. An MLE-trained model with reduced temperature easily improves upon these GAN variants, producing the highest quality sample
- Table3: BLEU (left) and Self-BLEU (right) on test data of EMNLPNEWS 2017. (Higher BLEU and lower Self-BLEU is better)
- Table4: BLEU (left) and Self-BLEU (right) on test data of Image COCO. (Higher BLEU and lower Self-BLEU is better)
- Table5: Three randomly sampled sentences from our model with closest BLEU scores to the training set’s. The sentences have poor semantics or global coherence. They are also not perfect grammatically speaking
- Table6: Samples from SeqGAN taken from [<a class="ref-link" id="cGuo_et+al_2017_a" href="#rGuo_et+al_2017_a">Guo et al, 2017</a>]
- Table7: Samples from LeakGAN taken from [<a class="ref-link" id="cGuo_et+al_2017_a" href="#rGuo_et+al_2017_a">Guo et al, 2017</a>]
- Table8: Samples from our MLE with temperature to match BLEU scores reported in [<a class="ref-link" id="cGuo_et+al_2017_a" href="#rGuo_et+al_2017_a">Guo et al, 2017</a>]
- Concurrent with our work, [Semeniuta et al, 2018] demonstrated the issues of local n-gram metrics. Their extensive empirical evaluation of GAN models and language models (LM) did not result in evidence of GAN-trained models outperforming on the new and improved global metrics from [Cífka et al, 2018]. Our analysis continues along this path by examining the performance of these models under a sweep of temperatures. We believe this difference to be of utmost importance, as it is the necessary ingredient towards definitively showing MLE models outperform the currently proposed GAN variants on quality-diversity global metrics.
- [Ackley et al., 1988] Ackley, D. H., Hinton, G. E., and Sejnowski, T. J. (1988). Connectionist models and their implications: Readings from cognitive science. chapter A Learning Algorithm for Boltzmann Machines, pages 285–307. Ablex Publishing Corp., Norwood, NJ, USA.
- [Che et al., 2017] Che, T., Li, Y., Zhang, R., Hjelm, R. D., Li, W., Song, Y., and Bengio, Y. (2017). Maximum-likelihood augmented discrete generative adversarial networks. arXiv preprint arXiv:1702.07983.
- [Chen et al., 2018] Chen, L., Dai, S., Tao, C., Shen, D., Gan, Z., Zhang, H., Zhang, Y., and Carin, L. (2018). Adversarial text generation via feature-mover’s distance. arXiv preprint arXiv:1809.06297.
- [Cífka et al., 2018] Cífka, O., Severyn, A., Alfonseca, E., and Filippova, K. (2018). Eval all, trust a few, do wrong to none: Comparing sentence generation models. arXiv preprint arXiv:1804.07972.
- [Fedus et al., 2018] Fedus, W., Goodfellow, I., and Dai, A. M. (2018). Maskgan: Better text generation via filling in the _. arXiv preprint arXiv:1801.07736.
- [Guo et al., 2017] Guo, J., Lu, S., Cai, H., Zhang, W., Yu, Y., and Wang, J. (2017). Long text generation via adversarial training with leaked information. arXiv preprint arXiv:1709.08624.
- [Lin et al., 2017] Lin, K., Li, D., He, X., Zhang, Z., and Sun, M.-T. (2017). Adversarial ranking for language generation. In Advances in Neural Information Processing Systems, pages 3155–3165.
- [Lu et al., 2018a] Lu, S., Yu, L., Zhang, W., and Yu, Y. (2018a). Cot: Cooperative training for generative modeling. https://arxiv.org/pdf/1804.03782.pdf.
- [Lu et al., 2018b] Lu, S., Zhu, Y., Zhang, W., Wang, J., and Yu, Y. (2018b). Neural text generation: Past, present and beyond. arXiv preprint arXiv:1803.07133.
- [Semeniuta et al., 2018] Semeniuta, S., Severyn, A., and Gelly, S. (2018). On accurate evaluation of gans for language generation. arXiv preprint arXiv:1806.04936.
- [Shi et al., 2018] Shi, Z., Chen, X., Xipeng, Q., and Huang, X. (2018). Towards diverse text generation with inverse reinforcement learning. https://arxiv.org/pdf/1804.11258.pdf.
- [Vezhnevets et al., 2017] Vezhnevets, A. S., Osindero, S., Schaul, T., Heess, N., Jaderberg, M., Silver, D., and Kavukcuoglu, K. (2017). Feudal networks for hierarchical reinforcement learning. arXiv preprint arXiv:1703.01161.
- [Williams, 1992] Williams, R. J. (1992). Simple statistical gradient-following algorithms for connectionist reinforcement learning. In Reinforcement Learning, pages 5–32. Springer.
- [Williams and Peng, 1991] Williams, R. J. and Peng, J. (1991). Function optimization using connectionist reinforcement learning algorithms. Connection Science, 3(3):241–268.
- [Xu et al., 2018] Xu, J., Sun, X., Ren, X., Lin, J., Wei, B., and Li, W. (2018). Dp-gan: Diversitypromoting generative adversarial network for generating informative and diversified text. arXiv preprint arXiv:1802.01345.
- [Yu et al., 2017] Yu, L., Zhang, W., Wang, J., and Yu, Y. (2017). Seqgan: Sequence generative adversarial nets with policy gradient.
- [Zhang et al., 2017] Zhang, Y., Gan, Z., Fan, K., Chen, Z., Henao, R., Shen, D., and Carin, L. (2017). Adversarial feature matching for text generation. arXiv preprint arXiv:1706.03850.
- [Zhu et al., 2018] Zhu, Y., Lu, S., Zheng, L., Guo, J., Zhang, W., Wang, J., and Yu, Y. (2018). Texygen: A benchmarking platform for text generation models. SIGIR.
- The authors of SeqGAN and LeakGAN open sourced a benchmarking platform to support research on open-domain text generation models, called TexyGEN [Zhu et al., 2018]. Later, the same authors published [Lu et al., 2018b]. This work reviews the current state of Neural Text Generation. They used the TexyGEN platform to benchmarks multiples models, including theirs. It is thus fair to assume that those numbers are their officials. These numbers are the ones reported in the experimental section. Furthermore, we use the exact datasets found on Texygen to train our models.
- We also report results from [Lu et al., 2018b] for MaliGAN, RankGAN and TextGAN. Because, they are not from the same authors as Texygen and [Lu et al., 2018b], we need to assume that results could be slightly better because researchers in general are biased towards working harder on their models at the expense of baselines.
- The [Lu et al., 2018a] paper doesn’t report results on Image CoCo. They do report results on the EMNLP2017 News dataset. However, at the time of this writing, they do not report Self-BLEU for their top performing model (this is the one with α = 1.5). Moreover, their official implementation is not completed yet, as it only works on synthetic Data. For these reasons, we didn’t report results for CoT in the real data experiment.
- The [Shi et al., 2018] paper reports results on Image CoCo. However, they use the 80,000 sentences long training dataset. At the time of this writing, there is no official implementation. For these reasons, we didn’t report results for IRL in the real data experiment.
- MaskGAN [Fedus et al., 2018] is a conditional generative models. For this reason, it doesn’t fit in the scope of this work.
- SeqGAN [Yu et al., 2017]
- MaliGAN [Che et al., 2017] 0.76 0.44 0.17 0.08 0.91 0.72 0.47 0.25
- RankGAN [Lin et al., 2017] 0.69 0.39 0.18 0.09 0.90 0.68 0.45 0.30
- TextGAN [Zhang et al., 2017] 0.21 0.17 0.15 0.13 1.00 0.98 0.97 0.96
- LeakGAN [Guo et al., 2017] 0.84 0.65 0.44 0.27 0.94 0.82 0.67 0.51