AI helps you reading Science

AI generates interpretation videos

AI extracts and analyses the key points of the paper to generate videos automatically


pub
Go Generating

AI Traceability

AI parses the academic lineage of this thesis


Master Reading Tree
Generate MRT

AI Insight

AI extracts a summary of this paper


Weibo:
We have introduced a new model that uses partial conditioning on inputs to generate output sequences

A neural transducer

Cited by: 111|Views16
Full Text
Bibtex
Weibo

Abstract

Sequence-to-sequence models have achieved impressive results on various tasks. However, they are unsuitable for tasks that require incremental predictions to be made as more data arrives or tasks that have long input sequences and output sequences. This is because they generate an output sequence conditioned on an entire input sequence....More

Code:

Data:

Introduction
  • The recently introduced sequence-to-sequence model has shown success in many tasks that map sequences to sequences, e.g., translation, speech recognition, image captioning and dialogue modeling [17, 4, 1, 6, 3, 20, 18, 15, 19]
  • This method is unsuitable for tasks where it is important to produce outputs as the input sequence arrives.
  • The transducer RNN generates local extensions to the output sequence, 29th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain
Highlights
  • The recently introduced sequence-to-sequence model has shown success in many tasks that map sequences to sequences, e.g., translation, speech recognition, image captioning and dialogue modeling [17, 4, 1, 6, 3, 20, 18, 15, 19]
  • Instant translation systems would be much more effective if audio was translated online, rather than after entire utterances. This limitation of the sequence-to-sequence model is due to the fact that output predictions are conditioned on the entire input sequence
  • We present a Neural Transducer, a more general class of sequence-to-sequence learning models
  • The inputs to the transducer recurrent neural network come from two sources: the encoder recurrent neural network and its own recurrent state
  • We have introduced a new model that uses partial conditioning on inputs to generate output sequences
  • This allows the model to produce output as input arrives. This is useful for speech recognition systems and will be crucial for future generations of online speech translation systems. Further it can be useful for performing transduction over long sequences – something that is possibly difficult for sequence-to-sequence models
Methods
  • The authors describe the model in more detail. Please refer to Figure 2 for an overview. 3.1 Model

    Let x1···L be the input data that is L time steps long, where xi represents the features at input time step i.
  • Let the transducer produce a sequence of k outputs, yi···(i+k), where 0 ≤ k < M , for any input block.
  • Each such sequence is padded with the symbol, that is added to the vocabulary.
  • It signifies that the transducer may proceed and consume data from the block.
  • When no symbols are produced for a block, this symbol is akin to the blank symbol of CTC
Results
  • The authors experimented with the Neural Transducer on the toy task of adding two three-digit decimal numbers.
  • The second number is presented in the reverse order, and so is the target output.
  • The model can produce the first output as soon as the first digit of the second number is observed.
  • The model is able to learn this task with a very small number of units.
  • As can be seen below, the model learns to output the digits as soon as the required information is available.
  • A block window size of W=1 was used, with M=8
Conclusion
  • One of the important side-effects of the model using partial conditioning with a blocked transducer is that it naturally alleviates the problem of “losing attention” suffered by sequence-to-sequence models.
  • The authors note that increasing the block size, W , so that it is as large as the input utterance makes the model similar to vanilla end-to-end models [5, 3].The authors have introduced a new model that uses partial conditioning on inputs to generate output sequences
  • This allows the model to produce output as input arrives.
  • The authors applied the model to a toy task of addition, and to a phone recognition task and showed that is can produce results comparable to the state of the art from sequence-to-sequence models
Summary
  • Introduction:

    The recently introduced sequence-to-sequence model has shown success in many tasks that map sequences to sequences, e.g., translation, speech recognition, image captioning and dialogue modeling [17, 4, 1, 6, 3, 20, 18, 15, 19]
  • This method is unsuitable for tasks where it is important to produce outputs as the input sequence arrives.
  • The transducer RNN generates local extensions to the output sequence, 29th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain
  • Methods:

    The authors describe the model in more detail. Please refer to Figure 2 for an overview. 3.1 Model

    Let x1···L be the input data that is L time steps long, where xi represents the features at input time step i.
  • Let the transducer produce a sequence of k outputs, yi···(i+k), where 0 ≤ k < M , for any input block.
  • Each such sequence is padded with the symbol, that is added to the vocabulary.
  • It signifies that the transducer may proceed and consume data from the block.
  • When no symbols are produced for a block, this symbol is akin to the blank symbol of CTC
  • Results:

    The authors experimented with the Neural Transducer on the toy task of adding two three-digit decimal numbers.
  • The second number is presented in the reverse order, and so is the target output.
  • The model can produce the first output as soon as the first digit of the second number is observed.
  • The model is able to learn this task with a very small number of units.
  • As can be seen below, the model learns to output the digits as soon as the required information is available.
  • A block window size of W=1 was used, with M=8
  • Conclusion:

    One of the important side-effects of the model using partial conditioning with a blocked transducer is that it naturally alleviates the problem of “losing attention” suffered by sequence-to-sequence models.
  • The authors note that increasing the block size, W , so that it is as large as the input utterance makes the model similar to vanilla end-to-end models [5, 3].The authors have introduced a new model that uses partial conditioning on inputs to generate output sequences
  • This allows the model to produce output as input arrives.
  • The authors applied the model to a toy task of addition, and to a phone recognition task and showed that is can produce results comparable to the state of the art from sequence-to-sequence models
Tables
  • Table1: Impact of maintaining recurrent state of transducer across blocks on the PER. This table shows that maintaining the state of the transducer across blocks leads to much better results. For this experiment, a block size (W) of 15 frames was used. The reported number is the median of three different runs
  • Table2: Impact of architecture on PER. Table shows the PER on the dev set as function of the number of layers (2 or 3) in the encoder and the number of layers in the transducer (1-4)
Download tables as Excel
Related work
  • In the past few years, many proposals have been made to add more power or flexibility to neural networks, especially via the concept of augmented memory [10, 16, 21] or augmented arithmetic units [13, 14]. Our work is not concerned with memory or arithmetic components but it allows more flexibility in the model so that it can dynamically produce outputs as data come in.

    Our work is related to traditional structured prediction methods, commonplace in speech recognition. The work bears similarity to HMM-DNN [11] and CTC [7] systems. An important aspect of these approaches is that the model makes predictions at every input time step. A weakness of these models is that they typically assume conditional independence between the predictions at each output step.
Reference
  • Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural Machine Translation by Jointly Learning to Align and Translate. In International Conference on Learning Representations, 2015.
    Google ScholarLocate open access versionFindings
  • Dzmitry Bahdanau, Jan Chorowski, Dmitriy Serdyuk, Philemon Brakel, and Yoshua Bengio. End-to-end attention-based large vocabulary speech recognition. In http://arxiv.org/abs/1508.04395, 2015.
    Findings
  • William Chan, Navdeep Jaitly, Quoc V Le, and Oriol Vinyals. Listen, attend and spell. arXiv preprint arXiv:1508.01211, 2015.
    Findings
  • Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwen, and Yoshua Bengio. Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation. In Conference on Empirical Methods in Natural Language Processing, 2014.
    Google ScholarLocate open access versionFindings
  • Jan Chorowski, Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. End-to-end Continuous Speech Recognition using Attention-based Recurrent NN: First Results. In Neural Information Processing Systems: Workshop Deep Learning and Representation Learning Workshop, 2014.
    Google ScholarLocate open access versionFindings
  • Jan Chorowski, Dzmitry Bahdanau, Dmitriy Serdyuk, Kyunghyun Cho, and Yoshua Bengio. AttentionBased Models for Speech Recognition. In Neural Information Processing Systems, 2015.
    Google ScholarLocate open access versionFindings
  • Alan Graves, Abdel-rahman Mohamed, and Geoffrey Hinton. Speech recognition with deep recurrent neural networks. In Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on, pages 6645–6649. IEEE, 2013.
    Google ScholarLocate open access versionFindings
  • Alex Graves. Sequence Transduction with Recurrent Neural Networks. In International Conference on Machine Learning: Representation Learning Workshop, 2012.
    Google ScholarLocate open access versionFindings
  • Alex Graves, Abdel-rahman Mohamed, and Geoffrey Hinton. Speech Recognition with Deep Recurrent Neural Networks. In IEEE International Conference on Acoustics, Speech and Signal Processing, 2013.
    Google ScholarLocate open access versionFindings
  • Alex Graves, Greg Wayne, and Ivo Danihelka. Neural turing machines. arXiv preprint arXiv:1410.5401, 2014.
    Findings
  • Geoffrey Hinton, Li Deng, Dong Yu, George E. Dahl, Abdel-rahman Mohamed, Navdeep Jaitly, Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Tara N. Sainath, and Brian Kingsbury. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. Signal Processing Magazine, IEEE, 29(6):82–97, 2012.
    Google ScholarLocate open access versionFindings
  • Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin A. Riedmiller. Playing atari with deep reinforcement learning. CoRR, abs/1312.5602, 2013.
    Findings
  • Arvind Neelakantan, Quoc V Le, and Ilya Sutskever. Neural programmer: Inducing latent programs with gradient descent. arXiv preprint arXiv:1511.04834, 2015.
    Findings
  • Scott Reed and Nando de Freitas. Neural programmer-interpreters. arXiv preprint arXiv:1511.06279, 2015.
    Findings
  • Alessandro Sordoni, Michel Galley, Michael Auli, Chris Brockett, Yangfeng Ji, Margaret Mitchell, JianYun Nie, Jianfeng Gao, and Bill Dolan. A neural network approach to context-sensitive generation of conversational responses. arXiv preprint arXiv:1506.06714, 2015.
    Findings
  • Sainbayar Sukhbaatar, Jason Weston, Rob Fergus, et al. End-to-end memory networks. In Advances in Neural Information Processing Systems, pages 2431–2439, 2015.
    Google ScholarLocate open access versionFindings
  • Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. Sequence to Sequence Learning with Neural Networks. In Neural Information Processing Systems, 2014.
    Google ScholarLocate open access versionFindings
  • Oriol Vinyals, Lukasz Kaiser, Terry Koo, Slav Petrov, Ilya Sutskever, and Geoffrey Hinton. Grammar as a foreign language. In NIPS, 2015.
    Google ScholarLocate open access versionFindings
  • Oriol Vinyals and Quoc V. Le. A neural conversational model. In ICML Deep Learning Workshop, 2015.
    Google ScholarLocate open access versionFindings
  • Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. Show and Tell: A Neural Image Caption Generator. In IEEE Conference on Computer Vision and Pattern Recognition, 2015.
    Google ScholarLocate open access versionFindings
  • Wojciech Zaremba and Ilya Sutskever. Reinforcement learning neural turing machines. arXiv preprint arXiv:1505.00521, 2015.
    Findings
Author
Your rating :
0

 

Tags
Comments
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn
小科