An Actor-Critic Algorithm for Sequence Prediction

international conference on learning representations, 2017.

Cited by: 380|Bibtex|Views230
EI
Other Links: dblp.uni-trier.de|academic.microsoft.com|arxiv.org
Weibo:
We showed that our method leads to significant improvements over maximum likelihood training on both a synthetic task and a machine translation benchmark

Abstract:

We present an approach to training neural networks to generate sequences using actor-critic methods from reinforcement learning (RL). Current log-likelihood training methods are limited by the discrepancy between their training and testing modes, as models must generate tokens conditioned on their previous guesses rather than the ground-t...More

Code:

Data:

Introduction
  • In many important applications of machine learning, the task is to develop a system that produces a sequence of discrete tokens given an input.
  • Due to this discrepancy between training and testing conditions, it has been shown that maximum likelihood training can be suboptimal (Bengio et al, 2015; Ranzato et al, 2015)
  • In these works, the authors argue that the network should be trained to continue generating correctly given the outputs already produced by the model, rather than the ground-truth reference outputs from the data.
  • Popular choices for the mapping f are the Long Short-Term Memory (Hochreiter & Schmidhuber, 1997) and the Gated Recurrent Units (Cho et al, 2014), the latter of which the authors use for the models
Highlights
  • In many important applications of machine learning, the task is to develop a system that produces a sequence of discrete tokens given an input
  • We show that some of the techniques recently developed in deep reinforcement learning, such as having a target network, may be beneficial for sequence prediction
  • The contributions of the paper can be summarized as follows: 1) we describe how reinforcement learning methodology like the actor-critic approach can be applied to supervised learning problems with structured outputs; and 2) we investigate the performance and behavior of the new method on both a synthetic task and a real-world task of machine translation, demonstrating the improvements over maximum-likelihood and REINFORCE brought by the actor-critic training
  • We propose to use a separate recurrent neural networks parameterized by φ
  • We showed that our method leads to significant improvements over maximum likelihood training on both a synthetic task and a machine translation benchmark
  • One interesting observation we made from the machine translation results is that the training methods that use generated predictions have a strong regularization effect
Methods
  • The authors performed two sets of experiments 2.
  • The authors consider each character in a natural language corpus and with some probability replace it with a random character.
  • The authors call this synthetic task spelling correction.
  • The authors' second series of experiments is done on the task of automatic machine translation using different models and datasets
Results
  • The authors show that the method leads to improved performance on both a synthetic task, and for German-English machine translation.
Conclusion
  • The authors' method takes the task objective into account during training and uses the ground-truth output to aid the critic in its prediction of intermediate targets for the actor.
  • The authors' understanding is that conditioning on the sampled outputs effectively increases the diversity of training data.
  • This phenomenon makes it harder to judge whether the actor-critic training meets the expectations, because a noisier gradient estimate yielded a better test set performance.
  • In future work the authors will consider larger machine translation datasets
Summary
  • Introduction:

    In many important applications of machine learning, the task is to develop a system that produces a sequence of discrete tokens given an input.
  • Due to this discrepancy between training and testing conditions, it has been shown that maximum likelihood training can be suboptimal (Bengio et al, 2015; Ranzato et al, 2015)
  • In these works, the authors argue that the network should be trained to continue generating correctly given the outputs already produced by the model, rather than the ground-truth reference outputs from the data.
  • Popular choices for the mapping f are the Long Short-Term Memory (Hochreiter & Schmidhuber, 1997) and the Gated Recurrent Units (Cho et al, 2014), the latter of which the authors use for the models
  • Methods:

    The authors performed two sets of experiments 2.
  • The authors consider each character in a natural language corpus and with some probability replace it with a random character.
  • The authors call this synthetic task spelling correction.
  • The authors' second series of experiments is done on the task of automatic machine translation using different models and datasets
  • Results:

    The authors show that the method leads to improved performance on both a synthetic task, and for German-English machine translation.
  • Conclusion:

    The authors' method takes the task objective into account during training and uses the ground-truth output to aid the critic in its prediction of intermediate targets for the actor.
  • The authors' understanding is that conditioning on the sampled outputs effectively increases the diversity of training data.
  • This phenomenon makes it harder to judge whether the actor-critic training meets the expectations, because a noisier gradient estimate yielded a better test set performance.
  • In future work the authors will consider larger machine translation datasets
Tables
  • Table1: Character error rate of different methods on the spelling correction task. In the table L is the length of input strings, η is the probability of replacing a character with a random one. LL stands for the log-likelihood training, AC and RF-C and for the actor-critic and the REINFORCE-critic respectively, AC+LL and RF-C+LL for the combinations of AC and RF-C with LL
  • Table2: Our IWSLT 2014 machine translation results with a convolutional encoder compared to the previous work by Ranzato et al Please see 1 for an explanation of abbreviations. The asterisk identifies results from (<a class="ref-link" id="cRanzato_et+al_2015_a" href="#rRanzato_et+al_2015_a"><a class="ref-link" id="cRanzato_et+al_2015_a" href="#rRanzato_et+al_2015_a">Ranzato et al, 2015</a></a>). The numbers reported with ≤ were approximately read from Figure 6 of (<a class="ref-link" id="cRanzato_et+al_2015_a" href="#rRanzato_et+al_2015_a"><a class="ref-link" id="cRanzato_et+al_2015_a" href="#rRanzato_et+al_2015_a">Ranzato et al, 2015</a></a>)
  • Table3: Our IWSLT 2014 machine translation results with a bidirectional recurrent encoder compared to the previous work. Please see Table 1 for an explanation of abbreviations. The asterisk identifies results from (Wiseman & Rush, 2016)
  • Table4: Our WMT 14 machine translation results compared to the previous work. Please see Table 1 for an explanation of abbreviations. The apostrophy and the asterisk identify results from (<a class="ref-link" id="cBahdanau_et+al_2015_a" href="#rBahdanau_et+al_2015_a">Bahdanau et al, 2015</a>) and (<a class="ref-link" id="cShen_et+al_2015_a" href="#rShen_et+al_2015_a">Shen et al, 2015</a>) respectively
  • Table5: Results of an ablation study. We tried varying the actor update speed γθ, the critic update speed γφ, the value penalty coefficient λ, whether or not reward shaping is used, whether or not temporal difference (TD) learning is used for the critic. Reported are the best training and validation
Download tables as Excel
Related work
  • In other recent RL-inspired work on sequence prediction, Ranzato et al (2015) trained a translation model by gradually transitioning from maximum likelihood learning into optimizing BLEU or ROUGE scores using the REINFORCE algorithm. However, REINFORCE is known to have very high variance and does not exploit the availability of the ground-truth like the critic network does. The approach also relies on a curriculum learning scheme. Standard value-based RL algorithms like SARSA and OLPOMDP have also been applied to structured prediction (Maes et al, 2009). Again, these systems do not use the ground-truth for value prediction.

    Imitation learning has also been applied to structured prediction (Vlachos, 2012). Methods of this type include the SEARN (Daume Iii et al, 2009) and DAGGER (Ross et al, 2010) algorithms. These methods rely on an expert policy to provide action sequences that the policy learns to imitate. Unfortunately, it’s not always easy or even possible to construct an expert policy for a task-specific score. In our approach, the critic plays a role that is similar to the expert policy, but is learned without requiring prior knowledge about the task-specific score. The recently proposed ‘scheduled sampling’ (Bengio et al, 2015) can also be seen as imitation learning. In this method, ground-truth tokens are occasionally replaced by samples from the model itself during training. A limitation is that the token k for the ground-truth answer is used as the target at step k, which might not always be the optimal strategy.
Funding
  • We thank NSERC, Compute Canada, Calcul Quebec, Canada Research Chairs, CIFAR, CHISTERA project M2CR (PCIN-2015-226) and Samsung Institute of Advanced Techonology for their financial support
Reference
  • Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. In Proceedings of the ICLR 2015, 2015.
    Google ScholarLocate open access versionFindings
  • Andrew G Barto, Richard S Sutton, and Charles W Anderson. Neuronlike adaptive elements that can solve difficult learning control problems. Systems, Man and Cybernetics, IEEE Transactions on, (5):834–846, 1983.
    Google ScholarFindings
  • Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and Noam Shazeer. Scheduled sampling for sequence prediction with recurrent neural networks. arXiv preprint arXiv:1506.03099, 2015.
    Findings
  • Mauro Cettolo, Jan Niehues, Sebastian Stuker, Luisa Bentivogli, and Marcello Federico. Report on the 11th iwslt evaluation campaign. In Proc. of IWSLT, 2014.
    Google ScholarLocate open access versionFindings
  • William Chan, Navdeep Jaitly, Quoc V Le, and Oriol Vinyals. Listen, attend and spell. arXiv preprint arXiv:1508.01211, 2015.
    Findings
  • Ciprian Chelba, Tomas Mikolov, Mike Schuster, Qi Ge, Thorsten Brants, Phillipp Koehn, and Tony Robinson. One billion word benchmark for measuring progress in statistical language modeling. arXiv preprint arXiv:1312.3005, 2013.
    Findings
  • Kyunghyun Cho, Bart Van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078, 2014.
    Findings
  • Jan Chorowski, Dzmitry Bahdanau, Dmitriy Serdyuk, KyungHyun Cho, and Yoshua Bengio. Attention-based models for speech recognition. CoRR, abs/1506.07503, 2015. URL http://arxiv.org/abs/1506.07503.
    Findings
  • Hal Daume III and Daniel Marcu. Learning as search optimization: Approximate large margin methods for structured prediction. In Proceedings of the 22nd international conference on Machine learning, pp. 169–176. ACM, 2005.
    Google ScholarLocate open access versionFindings
  • Hal Daume Iii, John Langford, and Daniel Marcu. Search-based structured prediction. Machine learning, 75(3):297–325, 2009.
    Google ScholarLocate open access versionFindings
  • Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, and Trevor Darrell. Long-term recurrent convolutional networks for visual recognition and description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2625–2634, 2015.
    Google ScholarLocate open access versionFindings
  • Vaibhava Goel and William J Byrne. Minimum bayes-risk automatic speech recognition. Computer Speech & Language, 14(2):115–135, 2000.
    Google ScholarLocate open access versionFindings
  • Awni Y Hannun, Andrew L Maas, Daniel Jurafsky, and Andrew Y Ng. First-pass large vocabulary continuous speech recognition using bi-directional recurrent dnns. arXiv preprint arXiv:1408.2873, 2014.
    Findings
  • Tamir Hazan, Joseph Keshet, and David A McAllester. Direct loss minimization for structured prediction. In Advances in Neural Information Processing Systems, pp. 1594–1602, 2010.
    Google ScholarLocate open access versionFindings
  • Sepp Hochreiter and Jurgen Schmidhuber. Long short-term memory. Neural computation, 9(8): 1735–1780, 1997.
    Google ScholarLocate open access versionFindings
  • Andrej Karpathy and Li Fei-Fei. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3128– 3137, 2015.
    Google ScholarLocate open access versionFindings
  • Diederik P Kingma and Jimmy Ba. A method for stochastic optimization. In International Conference on Learning Representation, 2015.
    Google ScholarLocate open access versionFindings
  • Ryan Kiros, Ruslan Salakhutdinov, and Richard S Zemel. Unifying visual-semantic embeddings with multimodal neural language models. arXiv preprint arXiv:1411.2539, 2014.
    Findings
  • Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015.
    Findings
  • Chin-Yew Lin and Eduard Hovy. Automatic evaluation of summaries using n-gram co-occurrence statistics. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology-Volume 1, pp. 71–78. Association for Computational Linguistics, 2003.
    Google ScholarLocate open access versionFindings
  • Francis Maes, Ludovic Denoyer, and Patrick Gallinari. Structured prediction with reinforcement learning. Machine learning, 77(2-3):271–301, 2009.
    Google ScholarLocate open access versionFindings
  • W Thomas Miller, Paul J Werbos, and Richard S Sutton. Neural networks for control. MIT press, 1995.
    Google ScholarFindings
  • Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529–533, 2015.
    Google ScholarLocate open access versionFindings
  • Andrew Y Ng, Daishi Harada, and Stuart Russell. Policy invariance under reward transformations: Theory and application to reward shaping. In ICML, volume 99, pp. 278–287, 1999.
    Google ScholarLocate open access versionFindings
  • Franz Josef Och. Minimum error rate training in statistical machine translation. In Proceedings of the 41st Annual Meeting on Association for Computational Linguistics-Volume 1, pp. 160–167. Association for Computational Linguistics, 2003.
    Google ScholarLocate open access versionFindings
  • Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics, pp. 311–318. Association for Computational Linguistics, 2002.
    Google ScholarLocate open access versionFindings
  • Marc’Aurelio Ranzato, Sumit Chopra, Michael Auli, and Wojciech Zaremba. Sequence level training with recurrent neural networks. arXiv preprint arXiv:1511.06732, 2015.
    Findings
  • Stephane Ross, Geoffrey J Gordon, and J Andrew Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. arXiv preprint arXiv:1011.0686, 2010.
    Findings
  • Alexander M Rush, Sumit Chopra, and Jason Weston. A neural attention model for abstractive sentence summarization. arXiv preprint arXiv:1509.00685, 2015.
    Findings
  • Mike Schuster and Kuldip K Paliwal. Bidirectional recurrent neural networks. Signal Processing, IEEE Transactions on, 45(11):2673–2681, 1997.
    Google ScholarLocate open access versionFindings
  • Shiqi Shen, Yong Cheng, Zhongjun He, Wei He, Hua Wu, Maosong Sun, and Yang Liu. Minimum risk training for neural machine translation. arXiv preprint arXiv:1512.02433, 2015.
    Findings
  • Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 8-13 2014, Montreal, Quebec, Canada, pp. 3104–3112, 2014.
    Google ScholarLocate open access versionFindings
  • Richard S Sutton. Learning to predict by the methods of temporal differences. Machine learning, 3 (1):9–44, 1988.
    Google ScholarFindings
  • Richard S Sutton and Andrew G Barto. Introduction to reinforcement learning, volume 135. MIT Press Cambridge, 1998.
    Google ScholarFindings
  • Richard S Sutton, David A McAllester, Satinder P Singh, Yishay Mansour, et al. Policy gradient methods for reinforcement learning with function approximation. In NIPS, volume 99, pp. 1057– 1063, 1999.
    Google ScholarLocate open access versionFindings
  • Richard Stuart Sutton. Temporal credit assignment in reinforcement learning. 1984.
    Google ScholarFindings
  • Gerald Tesauro. Td-gammon, a self-teaching backgammon program, achieves master-level play. Neural computation, 6(2):215–219, 1994.
    Google ScholarLocate open access versionFindings
  • Theano Development Team. Theano: A Python framework for fast computation of mathematical expressions. arXiv e-prints, abs/1605.02688, May 2016. URL http://arxiv.org/abs/1605.02688.
    Findings
  • John N Tsitsiklis and Benjamin Van Roy. An analysis of temporal-difference learning with function approximation. Automatic Control, IEEE Transactions on, 42(5):674–690, 1997.
    Google ScholarLocate open access versionFindings
  • Bart van Merrienboer, Dzmitry Bahdanau, Vincent Dumoulin, Dmitriy Serdyuk, David WardeFarley, Jan Chorowski, and Yoshua Bengio. Blocks and fuel: Frameworks for deep learning. arXiv:1506.00619 [cs, stat], June 2015.
    Findings
  • Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. Show and tell: A neural image caption generator. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3156–3164, 2015.
    Google ScholarLocate open access versionFindings
  • Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8(3-4):229–256, 1992.
    Google ScholarLocate open access versionFindings
  • Published as a conference paper at ICLR 2017 Sam Wiseman and Alexander M Rush. Sequence-to-sequence learning as beam-search optimization.
    Google ScholarFindings
  • arXiv preprint arXiv:1606.02960, 2016. Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144, 2016. Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron C. Courville, Ruslan Salakhutdinov, Richard S. Zemel, and Yoshua Bengio. Show, attend and tell: Neural image caption generation with visual attention. In Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 6-11 July 2015, pp. 2048–2057, 2015. Wojciech Zaremba, Tomas Mikolov, Armand Joulin, and Rob Fergus. Learning simple algorithms from examples. arXiv preprint arXiv:1511.07275, 2015.
    Findings
Full Text
Your rating :
0

 

Tags
Comments