AI helps you reading Science
AI generates interpretation videos
AI extracts and analyses the key points of the paper to generate videos automatically
AI parses the academic lineage of this thesis
AI extracts a summary of this paper
Experiments on neural machine translation, text summarization, and text generation have demonstrated the effectiveness of our student-forcing optimal transport algorithm, yielding improved performance over strong baselines on these tasks
Improving Text Generation with Student Forcing Optimal Transport
EMNLP 2020, pp.9144-9156, (2020)
Neural language models are often trained with maximum likelihood estimation (MLE), where the next word is generated conditioned on the ground-truth word tokens. During testing, however, the model is instead conditioned on previously generated tokens, resulting in what is termed exposure bias. To reduce this gap between training and testin...More
PPT (Upload PPT)
- Natural language generation is an essential component of many NLP applications, such as machine translation (Bahdanau et al, 2015), image captioning (You et al, 2016), text summarization (See et al, 2017), dialogue systems (Vinyals and Le, 2015), and machine comprehension (Nguyen et al, 2016).
- Generating human-like natural language is typically cast as predicting a sequence of consecutive words in a recurrent manner.
- In Recurrent Neural Network (RNN) models, this is known as Teacher-Forcing (TF) (Williams and Zipser, 1989), due to the use of ground-truth tokens for next-token prediction.
- The model is required to use outputs from the last step instead of the unseen ground-truth, which is often referred to as Student-Forcing (SF).
- There is a discrepancy between training and inference, accumulating errors along the sequencegeneration trajectory (Ranzato et al, 2016a)
- Natural language generation is an essential component of many NLP applications, such as machine translation (Bahdanau et al, 2015), image captioning (You et al, 2016), text summarization (See et al, 2017), dialogue systems (Vinyals and Le, 2015), and machine comprehension (Nguyen et al, 2016)
- Our work provides the following contributions: i) We introduce a novel method for text generation called Student-Forcing optimal transport (OT) (SFOT), leveraging OT loss to improve long-term sequence sampling. ii) A new context-preserving OT approach is proposed to effectively match a text sequence with order information. iii) We examine the necessity of integrating OT with Student-Forcing via Imitation Learning. iv) The proposed models are robust demonstrated by extensive empirical evaluations on Neural Machine Translation (NMT), Text Summarization, and Neural Text Generation (NLG)
- Besides the difference in SF decoding and TF decoding in two methods, we propose a technique on “Contextualized OT with Order-Preserving Regularizer”, which improves both student-forcing optimal transport (SFOT) and TFOT, as shown in Table 4
- In Section 2.3, we provide theoretical justification on why SFOT can reduce exposure bias, while TFOT still suffers from it: TFOT is based on partial expert trajectories and induces a bias occupancy measure, while our proposed method SFOT uses previous self-generated words and can obtain an optimal policy
- We have introduced SFOT to mitigate exposure bias in text generation
- Experiments on neural machine translation, text summarization, and text generation have demonstrated the effectiveness of our SFOT algorithm, yielding improved performance over strong baselines on these tasks
- To reasonably select the best model along the temperature sweep, the authors are motivated by (Gu et al, 2019) and propose the BLEU-F1 score to evaluate model.
- Figure 5 shows the BLEU-F1 score versus reverse temperature on MLE and SFOT.
- The authors observed that the best temperature for MLE model is 1/1.5 and for SFOT is 1/1.4.
- Figure 5 indicates that the SFOT model consistently improves the MLE model on the BLEU-F1 score.
- Under similar Self-BLEU score, SFOT significantly improves the quality of LeakGAN (Guo et al, 2018), the best GAN by BLEU metric
- The authors have introduced SFOT to mitigate exposure bias in text generation. The proposed model captures positional and contextual information of word tokens in OT matching.
- Experiments on neural machine translation, text summarization, and text generation have demonstrated the effectiveness of the SFOT algorithm, yielding improved performance over strong baselines on these tasks
- Table1: VI-EN and EN-VI translation BLEU scores
- Table2: DE-EN and EN-DE translation BLEU scores
- Table3: Comparison of German-to-Enlish translation examples. For each example, we show the human translation (reference) and the translation from MLE, TFOT, and SFOT. We highlight the key phrase differences between reference and translation outputs in blue and red, and annotate translation errors in bold. In the first example, SFOT correctly maintains all the information in “since winning in May election” by translating to “since his election victory in May”, whereras MLE only generates “in May” and TFOT also misses “winning” in the reference. In the second example, SFOT successfully keeps the information “Beijing”, whereas MLE generates wrong words “expiration of” and TFOT changes “Beijing” to “government”
- Table4: BLEU scores for VI-EN and EN-VI ablation study
- Table5: Results of text summarization on English Gigawords dataset
- Table6: Human evaluation of NLG on EMNLP news 2017 dataset. 100 generated sentences from each model are rated 1-5, with means and standard deviations reported. Real sentences were rated 4.21 ± 0.44
- Table7: Examples generated by SFOT in NLG experiments
- Text Generation Natural Language Generation (NLG) is a challenging NLP task. Neural language models parameterized by autogressive architectures are widely used for NLG. To improve the global control ability of generated sentences, variational auto-encoders are considered for language generation (Bowman et al, 2016; Fu et al., 2019; Fang et al, 2019; Li et al, 2020a). Recently, GPT-2 (Radford et al, 2019) and GPT3 (Brown et al, 2020) improve the generation fluency via pre-training on massive text corpus. All of them are trained with MLE using TeacherForcing, which are known to suffer from exposure bias in principle(Bengio et al, 2015). Several methods have been proposed to solve the problem, including (Shao et al, 2018; Zhang et al, 2019). Adversarial training techniques were also proposed (Yu et al, 2017; Zhu et al, 2018; Che et al, 2017; Lin et al, 2017; Guo et al, 2018; Chen et al, 2018; Li et al, 2020b; Yang et al, 2019; Zhang et al, 2018; Liang et al, 2018). However, adversarial-based NLG models can suffer from gradient vanishing and unstable training. Indeed, (Caccia et al, 2018) argues that a temperature sweeping approach on MLE can outperform GAN-based models. Our model further improves this work by adopting a principled sequence-matching loss via optimal transport and achieve state-of-the-art results on NLG tasks.
Study subjects and analysis
native speakers: 10
To reasonably select the best model along the temperature sweep, we are motivated by (Gu et al, 2019) and propose the BLEU-F1 score to evaluate model. Ten native speakers are asked to rate each sentence in the scale 1 to 5 in terms of readability and meaningfulness the trade-off between the quality and diversity simultaneously, defined as BLEU-F1
2 × BLEU × (1-Self-BLEU) BLEU + (1-Self-BLEU) (12)
Figure 5 shows the BLEU-F1 score versus reverse temperature on MLE and SFOT. We observed that the best temperature for MLE model is 1/1.5 and for SFOT is 1/1.4
2 × BLEU × (1-Self-BLEU) BLEU + (1-Self-BLEU) (12)
Figure 5 shows the BLEU-F1 score versus reverse temperature on MLE and SFOT. We observed that the best temperature for MLE model is 1/1.5 and for SFOT is 1/1.4
native speakers: 10
To reasonably select the best model along the temperature sweep, we are motivated by (Gu et al, 2019) and propose the BLEU-F1 score to evaluate model. Ten native speakers are asked to rate each sentence in the scale 1 to 5 in terms of readability and meaningfulness the trade-off between the quality and diversity simultaneously, defined as. 2 × BLEU × (1-Self-BLEU) BLEU + (1-Self-BLEU)
- Fritz Albregtsen et al. 2008. Statistical texture measures computed from gray level coocurrence matrices. Image processing laboratory, department of informatics, university of oslo, 5.
- David Alvarez-Melis and Tommi S Jaakkola. 2018. Gromov-wasserstein alignment of word embedding spaces. In EMNLP.
- Martin Arjovsky, Soumith Chintala, and Leon Bottou. 2017. Wasserstein generative adversarial networks. In ICML.
- Dzmitry Bahdanau, Philemon Brakel, Kelvin Xu, Anirudh Goyal, Ryan Lowe, Joelle Pineau, Aaron Courville, and Yoshua Bengio. 2017. An actor-critic algorithm for sequence prediction. In ICLR.
- Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 201Neural machine translation by jointly learning to align and translate. In ICLR.
- Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and Noam Shazeer. 2015. Scheduled sampling for sequence prediction with recurrent neural networks. In NIPS.
- Samuel R Bowman, Luke Vilnis, Oriol Vinyals, Andrew M Dai, Rafal Jozefowicz, and Samy Bengio. 2016. Generating sentences from a continuous space. CONLL.
- Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. arXiv preprint arXiv:2005.14165.
- Massimo Caccia, Lucas Caccia, William Fedus, Hugo Larochelle, Joelle Pineau, and Laurent Charlin. 2018. Language gans falling short. arXiv preprint arXiv:1811.02549.
- Mauro Cettolo, Jan Niehues, Sebastian Stuker, Luisa Bentivogli, Roldano Cattoni, and Marcello Federico. 2015. The iwslt 2015 evaluation campaign. In IWSLT.
- Tong Che, Yanran Li, Ruixiang Zhang, R Devon Hjelm, Wenjie Li, Yangqiu Song, and Yoshua Bengio. 2017. Maximum-likelihood augmented discrete generative adversarial networks. In CoRR.
- Liqun Chen, Shuyang Dai, Chenyang Tao, Haichao Zhang, Zhe Gan, Dinghan Shen, Yizhe Zhang, Guoyin Wang, Ruiyi Zhang, and Lawrence Carin. 2018. Adversarial text generation via featuremover’s distance. In NIPS.
- Liqun Chen, Yizhe Zhang, Ruiyi Zhang, Chenyang Tao, Zhe Gan, Haichao Zhang, Bai Li, Dinghan Shen, Changyou Chen, and Lawrence Carin. 2019. Improving sequence-to-sequence learning via optimal transport. In ICLR.
- Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
- Nan Ding and Radu Soricut. 2017. Cold-start reinforcement learning with softmax policy gradient. In Advances in Neural Information Processing Systems, pages 2817–2826.
- Le Fang, Chunyuan Li, Jianfeng Gao, Wen Dong, and Changyou Chen. 2019. Implicit deep latent variable models for text generation. EMNLP.
- Hao Fu, Chunyuan Li, Xiaodong Liu, Jianfeng Gao, Asli Celikyilmaz, Lawrence Carin, et al. 2019. Cyclical annealing schedule: A simple approach to mitigating KL vanishing. NAACL.
- Aude Genevay, Gabriel Peyre, and Marco Cuturi. 20Learning generative models with sinkhorn divergences. AISTATS.
- David Graff, Junbo Kong, Ke Chen, and Kazuaki Maeda. 2003. English gigaword. Linguistic Data Consortium, Philadelphia, 4(1):34.
- Alex Graves and Navdeep Jaitly. 2014. Towards endto-end speech recognition with recurrent neural networks. In ICML.
- Xiaodong Gu, Kyunghyun Cho, Jungwoo Ha, and Sunghun Kim. 2019. Dialogwae: Multimodal response generation with conditional wasserstein auto-encoder. ICLR.
- Jiaxian Guo, Sidi Lu, Han Cai, Weinan Zhang, Yong Yu, and Jun Wang. 2018. Long text generation via adversarial training with leaked information. In AAAI.
- Kelvin Guu, Tatsunori B Hashimoto, Yonatan Oren, and Percy Liang. 2018. Generating sentences by editing prototypes. TACL.
- Sepp Hochreiter and Jurgen Schmidhuber. 1997. Long short-term memory. Neural computation.
- Ferenc Huszar. 2015. How (not) to train your generative model: Scheduled sampling, likelihood, adversary? In CoRR.
- Guillaume Klein, Yoon Kim, Yuntian Deng, Jean Senellart, and Alexander M. Rush. 2017. OpenNMT: Open-source toolkit for neural machine translation. In ACL.
- Matt Kusner, Yu Sun, Nicholas Kolkin, and Kilian Weinberger. 2015. From word embeddings to document distances. In ICML.
- Alex M Lamb, Anirudh Goyal Alias Parth Goyal, Ying Zhang, Saizheng Zhang, Aaron C Courville, and Yoshua Bengio. 2016. Professor forcing: A new algorithm for training recurrent networks. In NIPS.
- Chunyuan Li, Xiang Gao, Yuan Li, Xiujun Li, Baolin Peng, Yizhe Zhang, and Jianfeng Gao. 2020a. Optimus: Organizing sentences via pre-trained modeling of a latent space. arXiv preprint arXiv:2004.04092.
- Dianqi Li, Yizhe Zhang, Hao Peng, Liqun Chen, Chris Brockett, Ming-Ting Sun, and Bill Dolan. 2020b. Contextualized perturbation for textual adversarial attack. arXiv preprint arXiv:2009.07502.
- Jiwei Li, Will Monroe, Tianlin Shi, Sebastien Jean, Alan Ritter, and Dan Jurafsky. 2017. Adversarial learning for neural dialogue generation. In EMNLP.
- Kevin J Liang, Chunyuan Li, Guoyin Wang, and Lawrence Carin. 2018. Generative adversarial network training is a continual learning problem. arXiv preprint arXiv:1811.11083.
- Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. Text Summarization Branches Out.
- Kevin Lin, Dianqi Li, Xiaodong He, Zhengyou Zhang, and Ming-Ting Sun. 2017. Adversarial ranking for language generation. In NIPS.
- Hao Liu, Yihao Feng, Yi Mao, Dengyong Zhou, Jian Peng, and Qiang Liu. 2018. Action-depedent control variates for policy optimization via stein’s identity. In ICLR.
- Giulia Luise, Alessandro Rudi, Massimiliano Pontil, and Carlo Ciliberto. 2018. Differential properties of sinkhorn approximation for learning with wasserstein distance. arXiv:1805.11897.
- Minh-Thang Luong, Eugene Brevdo, and Rui Zhao. 2017. Neural machine translation (seq2seq) tutorial. https://github.com/tensorflow/nmt.
- Minh-Thang Luong and Christopher D Manning. 2015. Stanford neural machine translation systems for spoken language domains. In IWSLT.
- Minh-Thang Luong, Hieu Pham, and Christopher D Manning. 2015a. Effective approaches to attentionbased neural machine translation. In EMNLP.
- Minh-Thang Luong, Ilya Sutskever, Quoc V Le, Oriol Vinyals, and Wojciech Zaremba. 2015b. Addressing the rare word problem in neural machine translation. In ACL.
- Tomas Mikolov, Martin Karafiat, Lukas Burget, Jan Cernocky, and Sanjeev Khudanpur. 2010. Recurrent neural network based language model. In ISCA.
- Tri Nguyen, Mir Rosenberg, Xia Song, Jianfeng Gao, Saurabh Tiwary, Rangan Majumder, and Li Deng. 2016. Ms marco: A human generated machine reading comprehension dataset. In NIPS.
- Mohammad Norouzi, Samy Bengio, Navdeep Jaitly, Mike Schuster, Yonghui Wu, Dale Schuurmans, et al. 2016. Reward augmented maximum likelihood for neural structured prediction. In Advances In Neural Information Processing Systems, pages 1723–1731.
- Kishore Papineni, Salim Roukos, Todd Ward, and WeiJing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In ACL.
- Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations. In NAACL.
- Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Improving language understanding by generative pre-training. URL https://s3us-west-2.amazonaws.com/openai-assets/researchcovers/languageunsupervised/language understanding paper.pdf.
- Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language models are unsupervised multitask learners. OpenAI Blog.
- Prajit Ramachandran, Peter J Liu, and Quoc V Le. 2017. Unsupervised pretraining for sequence to sequence learning. In EMNLP.
- Marc’Aurelio Ranzato, Sumit Chopra, Michael Auli, and Wojciech Zaremba. 2016a. Sequence level training with recurrent neural networks. CoRR.
- Marc’Aurelio Ranzato, Sumit Chopra, Michael Auli, and Wojciech Zaremba. 2016b. Sequence level training with recurrent neural networks. In ICLR.
- Steven J Rennie, Etienne Marcheret, Youssef Mroueh, Jerret Ross, and Vaibhava Goel. 2017. Self-critical sequence training for image captioning. In CVPR.
- Alexander M Rush, Sumit Chopra, and Jason Weston. 2015. A neural attention model for abstractive sentence summarization. arXiv preprint arXiv:1509.00685.
- Hiroaki Sakoe, Seibi Chiba, A Waibel, and KF Lee. 1990. Dynamic programming algorithm optimization for spoken word recognition. Readings in speech recognition.
- Ruslan Salakhutdinov. 2015. Learning deep generative models. Annual Review of Statistics and Its Application, 2:361–385.
- Abigail See, Peter J Liu, and Christopher D Manning. 2017. Get to the point: Summarization with pointergenerator networks. In ACL.
- Rico Sennrich, Barry Haddow, and Alexandra Birch. 2015. Neural machine translation of rare words with subword units. In ACL.
- Chenze Shao, Yang Feng, and Xilin Chen. 2018. Greedy search with probabilistic n-gram matching for neural machine translation. arXiv preprint arXiv:1809.03132.
- Bing Su, Xiaoqing Ding, Changsong Liu, and Ying Wu. 2015. Heteroscedastic max-min distance analysis. In CVPR.
- Bing Su and Gang Hua. 2018. Order-preserving optimal transport for distances between sequences. IEEE transactions on pattern analysis and machine intelligence.
- Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to sequence learning with neural networks. In NIPS.
- Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In NIPS.
- Oriol Vinyals and Quoc Le. 2015. A neural conversational model. In ICML workshop.
- Xin Wang, Wenhu Chen, Yuan-Fang Wang, and William Yang Wang. 2018. No metrics are perfect: Adversarial reward learning for visual storytelling. In ACL.
- Ronald J Williams and David Zipser. 1989. A learning algorithm for continually running fully recurrent neural networks. Neural computation, 1(2):270– 280.
- Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. 2016. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144.
- Yujia Xie, Xiangfeng Wang, Ruijia Wang, and Hongyuan Zha. 2018. A fast proximal point method for Wasserstein distance. In arXiv:1802.04307.
- Qian Yang, Dinghan Shen, Yong Cheng, Wenlin Wang, Guoyin Wang, Lawrence Carin, et al. 2019. An endto-end generative architecture for paraphrase generation. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3123–3133.
- Quanzeng You, Hailin Jin, Zhaowen Wang, Chen Fang, and Jiebo Luo. 2016. Image captioning with semantic attention. In CVPR.
- Lantao Yu, Weinan Zhang, Jun Wang, and Yong Yu. 2017. Seqgan: Sequence generative adversarial nets with policy gradient. In AAAI.
- Wojciech Zaremba, Ilya Sutskever, and Oriol Vinyals. 2014. Recurrent neural network regularization. arXiv preprint arXiv:1409.2329.
- Ruiyi Zhang, Changyou Chen, Zhe Gan, Wenlin Wang, Liqun Chen, Dinghan Shen, Guoyin Wang, and Lawrence Carin. 2018. Sequence generation with guider network. arXiv preprint arXiv:1811.00696.
- Wen Zhang, Yang Feng, Fandong Meng, Di You, and Qun Liu. 2019. Bridging the gap between training and inference for neural machine translation. arXiv preprint arXiv:1906.02448.
- Yizhe Zhang, Zhe Gan, Kai Fan, Zhi Chen, Ricardo Henao, Dinghan Shen, and Lawrence Carin. 2017. Adversarial feature matching for text generation. In ICML.
- Yaoming Zhu, Sidi Lu, Lei Zheng, Jiaxian Guo, Weinan Zhang, Jun Wang, and Yong Yu. 2018. Texygen: a benchmarking platform for text generation models. In SIGIR.
- Dataset Two standard datasets are tested for NMT tasks. The first one is a small-scale EnglishVietnamese corpus from the IWSLT 2015 Evaluation Campaign (Cettolo et al., 2015), which is a parallel corpus of TED-talks and contains 133K sentence pairs. We follow the pre-processing procedure in (Luong and Manning, 2015) by replacing words with frequencies less than 5 with unk. As a result, our vocabulary reduces to 17K for English and 7.7K for Vietnamese. We use TED tst2012 as development set and TED tst2013 as the test set. For a large-scale dataset, we select an
- English-German corpus from the WMT16 Evaluate Campaign5, which contains 4.5M sentence pairs. Newstest 2013 is used as the development set and Newstest 2015 is used as the test set. We conduct the sub-word tokenization on the corpus using the Byte Pair Encoding (BPE) method (Sennrich et al., 2015). Following Klein et al. (2017), we set the vocabulary size of both English and German to 32K.
- Setup We use Google’s Neural Machine Translation (GNMT) system (Wu et al., 2016) as our baseline MLE model, which follows the standard architecture and hyper-parameters6 for fair comparison. All other models are built on top of with same network structure. We evaluate the model performance using BLEU scores (Papineni et al., 2002). We set OT weighting parameter λ = 0.1 and order-preserving penalty weighting parameter β = 0.1.
- For English-Vietnamese translation tasks (i.e., EN-VI or VI-EN), we follow the setup in (Sutskever et al., 2014; Luong et al., 2015b,a). We use one bidirectional LSTM layer with 512 hidden units as encoder and two-layer LSTM with 512 hidden units at each layer as decoder. The embedding dimension is set as 512. We follow the attention method described in (Luong et al., 2015a) and use dropout with probability 0.2 as suggested by (Zaremba et al., 2014). All parameters are initialized uniformly between [−0.1, 0.1]. We train the model for 12 epochs with 12 epochs using Stochastic Gradient Decent (SGD). For the first 8 epochs, we set learning rate as 1.0. After that, we anneal the learning rate at half at every epoch.
- For English-German translation tasks (i.e., ENGE or GE-EN), we adopt a stacked LSTM with a 2-layer bidirectional of 1024 units as encoder and 4-layer LSTM with units 1024 as decoder. The embedding dimension is set to 1024. We adopt the attention used in (Wu et al., 2016). We train the model for 10 epochs. For the first 5 epochs, we set the learning rate as 1 and then halving the learning rate every half epoch.
- We use a widely accepted English Gigawords corpus (Graff et al., 2003) for the text summariza-
- 5http://statmt.org/wmt16 6https://github.com/tensorflow/nmt tion task. We follow the pre-process in (Rush et al., 2015). The dataset is sampled and split into train/dev/test set with size 200K/8K/2K.
- The government’s decision to extend its coal policy vote will be announced in the first half of 2017.