Fast and Accurate Deep Bidirectional Language Representations for Unsupervised Learning

ACL, pp. 823-835, 2020.

Cited by: 0|Bibtex|Views78
EI
Other Links: arxiv.org|dblp.uni-trier.de|academic.microsoft.com
Weibo:
We observe again that the bidirectional language models trained with the language autoencoding and masked language modeling perform better than the unidirectional language model trained with the causal language modeling

Abstract:

Even though BERT achieves successful performance improvements in various supervised learning tasks, applying BERT for unsupervised tasks still holds a limitation that it requires repetitive inference for computing contextual language representations. To resolve the limitation, we propose a novel deep bidirectional language model called ...More
0
Introduction
  • A language model is an essential component in many NLP applications ranging from automatic speech recognition (ASR) (Chan et al, 2016; Panayotov et al, 2015) to neural machine translation (NMT) (Sutskever et al, 2014; Sennrich et al, 2016; Vaswani et al, 2017).
  • BERT (Bidirectional Encoder Representations from Transformers) (Devlin et al, 2019) and its variations have brought significant improvements in learning natural language representation, and they have achieved state-of-the-art performances on various downstream tasks such as GLUE benchmark (Wang et al, 2019) and question answering (Rajpurkar et al, 2016)
  • This success of BERT continues in various unsupervised tasks such as the N -best list reranking for ASR and NMT (Shin et al, 2019; Salazar et al, 2019), showing that deep bidirec-.
  • Faced with this limitation of the BERT, we raise a new research question: “Can we make a deep bidirectional language model that has minimal inference time while maintaining the accuracy of BERT?”
Highlights
  • A language model is an essential component in many NLP applications ranging from automatic speech recognition (ASR) (Chan et al, 2016; Panayotov et al, 2015) to neural machine translation (NMT) (Sutskever et al, 2014; Sennrich et al, 2016; Vaswani et al, 2017)
  • We introduce a novel architecture of a deep bidirectional language model named Text Autoencoder, which stands for Transformer-based Text Autoencoder, and the overall architecture of the Text Autoencoder is shown in Figure 2
  • We observe that the bidirectional language models trained with the language autoencoding (T-TA) and masked language modeling outperform the unidirectional language model trained with the causal language modeling
  • We propose a novel deep bidirectional language model named Transformer-based Text Autoencoder (T-TA) in order to eliminate the computational overload of applying BERT for unsupervised applications
  • Experimental results on the N best list reranking and the unsupervised semantic textual similarity tasks demonstrate that the proposed Text Autoencoder is significantly faster than the BERTbased approach, while its encoding ability is competitive or even better than that of BERT
Methods
  • 4.1 Language Autoencoding

    In this paper, the authors propose a new learning objective named language autoencoding (LAE) for obtaining fully contextualized language representations without repetition.
  • The model only outputs the representation copied from the input representation without learning any statistics of the language.
  • To this end, information flow from the i-th input to the i-th output should be blocked inside the model shown in.
  • Figure 1c
  • From this LAE objective, the authors can obtain fully contextualized language representations HL = [HL1 , .
  • The way of blocking the information flow is described
Results
  • Results on Speech Recognition

    For reranking in ASR, the authors use prepared N -best lists obtained from dev and test sets using Seq2SeqASR that the authors train on the Librispeech ASR corpus.
  • Each interpolation weight becomes a value that shows the best performance on each test set with each method in NMT.
  • The Fr→En translation has less effect on reranking than the De→En translation because the base NMT system for Fr→En is better than that for De→En. The STS Benchmark (STSb) has 5749/1500/1379 sentence pairs for train/dev/test splits with corresponding scores ranging from 0-5.
  • In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4960–4964.
Conclusion
  • Discussion and Analysis

    Until the authors have introduced the new learning objective, language autoencoding (LAE), and the novel deep bidirectional language model, Transformer-based Text Autoencoder (T-TA).
  • The key and value K = V = X + P have the information of input tokens and they are fixed in all layers, but the query Ql is updated across the layers during inference started from the position embeddings Q1 = P at the first layer.
  • Hl = SMSAN(Ql, K, V) = g(Norm(Add(Ql, f (Ql, K, V)))), (1)In this work, the authors propose a novel deep bidirectional language model named Transformer-based Text Autoencoder (T-TA) in order to eliminate the computational overload of applying BERT for unsupervised applications.
  • In Advances in neural information processing systems, pages 577–585
Summary
  • Introduction:

    A language model is an essential component in many NLP applications ranging from automatic speech recognition (ASR) (Chan et al, 2016; Panayotov et al, 2015) to neural machine translation (NMT) (Sutskever et al, 2014; Sennrich et al, 2016; Vaswani et al, 2017).
  • BERT (Bidirectional Encoder Representations from Transformers) (Devlin et al, 2019) and its variations have brought significant improvements in learning natural language representation, and they have achieved state-of-the-art performances on various downstream tasks such as GLUE benchmark (Wang et al, 2019) and question answering (Rajpurkar et al, 2016)
  • This success of BERT continues in various unsupervised tasks such as the N -best list reranking for ASR and NMT (Shin et al, 2019; Salazar et al, 2019), showing that deep bidirec-.
  • Faced with this limitation of the BERT, we raise a new research question: “Can we make a deep bidirectional language model that has minimal inference time while maintaining the accuracy of BERT?”
  • Methods:

    4.1 Language Autoencoding

    In this paper, the authors propose a new learning objective named language autoencoding (LAE) for obtaining fully contextualized language representations without repetition.
  • The model only outputs the representation copied from the input representation without learning any statistics of the language.
  • To this end, information flow from the i-th input to the i-th output should be blocked inside the model shown in.
  • Figure 1c
  • From this LAE objective, the authors can obtain fully contextualized language representations HL = [HL1 , .
  • The way of blocking the information flow is described
  • Results:

    Results on Speech Recognition

    For reranking in ASR, the authors use prepared N -best lists obtained from dev and test sets using Seq2SeqASR that the authors train on the Librispeech ASR corpus.
  • Each interpolation weight becomes a value that shows the best performance on each test set with each method in NMT.
  • The Fr→En translation has less effect on reranking than the De→En translation because the base NMT system for Fr→En is better than that for De→En. The STS Benchmark (STSb) has 5749/1500/1379 sentence pairs for train/dev/test splits with corresponding scores ranging from 0-5.
  • In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4960–4964.
  • Conclusion:

    Discussion and Analysis

    Until the authors have introduced the new learning objective, language autoencoding (LAE), and the novel deep bidirectional language model, Transformer-based Text Autoencoder (T-TA).
  • The key and value K = V = X + P have the information of input tokens and they are fixed in all layers, but the query Ql is updated across the layers during inference started from the position embeddings Q1 = P at the first layer.
  • Hl = SMSAN(Ql, K, V) = g(Norm(Add(Ql, f (Ql, K, V)))), (1)In this work, the authors propose a novel deep bidirectional language model named Transformer-based Text Autoencoder (T-TA) in order to eliminate the computational overload of applying BERT for unsupervised applications.
  • In Advances in neural information processing systems, pages 577–585
Tables
  • Table1: WERs after reranking with each language model on LibriSpeech. ‘other’ sets are recorded in noisier environments than ‘clean’ sets. Bolds are for the best performance on each sub-task. ∗ are word-level language models from (<a class="ref-link" id="cShin_et+al_2019_a" href="#rShin_et+al_2019_a">Shin et al, 2019</a>)
  • Table2: BLEU scores after reranking with each language model on WMT13. Bolds are for the best performance on each sub-task. Underlines are for the best in our implementations
  • Table3: Pearson’s r × 100 results on STS Benchmark. - denotes the infeasible value. Bolds are for the top-2 performances on each sub-task
  • Table4: Pearson’s r × 100 results on SICK data. - denotes the infeasible value. Bolds are for the best performance on each sub-task
  • Table5: Oracle WERs of the 50-best lists on LibriSpeech from each ASR system
  • Table6: Oracle BLEUs of the 50-best lists on WMT
  • Table7: pseudo)Perplexities and corresponding WERs of language models on LibriSpeech
Download tables as Excel
Related work
  • When referring to the autoencoder for language modeling, sequence-to-sequence learning approaches have been commonly used. These approaches encode a given sentence into a compressed vector representation, followed by a decoder which reconstructs the original sentence from the sentence-level representation (Sutskever et al, 2014; Cho et al, 2014; Dai and Le, 2015). To the best of our knowledge, however, none of them considered an autoencoder that encodes word-level representations like BERT without the autoregressive decoding process.

    There have been many studies on neural networkbased language models for word-level representations. Distributed word representations were proposed and gained huge interests as they were considered to be fundamental building blocks for the natural language processing tasks (Rumelhart et al, 1986; Bengio et al, 2003; Mikolov et al, 2013b). Recently, researchers explored contextualized representations of text where each word will have different representations depending on the context (Peters et al, 2018; Radford et al, 2018). More recently, the Transformer-based deep bidirectional model was proposed and applied to the various supervised-learning tasks with a huge success (Devlin et al, 2019).
Funding
  • This work was supported by the Ministry of Trade, Industry & Energy (MOTIE, Korea) under Industrial Technology Innovation Program (No.10073144) and by the NRF grant funded by the Korea government (MSIT) (NRF2016M3C4A7952587)
Reference
  • Ebru Arisoy, Abhinav Sethy, Bhuvana Ramabhadran, and Stanley Chen. 2015. Bidirectional recurrent neural network language models for automatic speech recognition. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5421–5425. IEEE.
    Google ScholarLocate open access versionFindings
  • Yoshua Bengio, Rejean Ducharme, Pascal Vincent, and Christian Jauvin. 2003. A neural probabilistic language model. Journal of machine learning research, 3(Feb):1137–1155.
    Google ScholarLocate open access versionFindings
  • Daniel Cer, Mona Diab, Eneko Agirre, Inigo LopezGazpio, and Lucia Specia. 2017. Semeval-2017 task 1: Semantic textual similarity multilingual and crosslingual focused evaluation. In Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), pages 1–14.
    Google ScholarLocate open access versionFindings
  • Andrew M Dai and Quoc V Le. 2015. Semi-supervised sequence learning. In Advances in neural information processing systems, pages 3079–3087.
    Google ScholarLocate open access versionFindings
  • Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186.
    Google ScholarLocate open access versionFindings
  • William H DuBay. 200The classic readability studies. Impact Information, Costa Mesa, California.
    Google ScholarFindings
  • Kenneth Heafield. 2011. Kenlm: Faster and smaller language model queries. In Proceedings of the sixth workshop on statistical machine translation, pages 187–19Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Dan Hendrycks and Kevin Gimpel. 2016. Bridging nonlinearities and stochastic regularizers with gaussian error linear units. arXiv preprint arXiv:1606.08415.
    Findings
  • Takaaki Hori, Shinji Watanabe, Yu Zhang, and William Chan. 2017. Advances in joint ctc-attention based end-to-end speech recognition with a deep cnn encoder and rnn-lm. Proc. Interspeech 2017, pages 949–953.
    Google ScholarLocate open access versionFindings
  • Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
    Findings
  • Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. 2019. Albert: A lite bert for self-supervised learning of language representations. In International Conference on Learning Representations.
    Google ScholarLocate open access versionFindings
  • Marco Marelli, Luisa Bentivogli, Marco Baroni, Raffaella Bernardi, Stefano Menini, and Roberto Zamparelli. 2014. Semeval-2014 task 1: Evaluation of compositional distributional semantic models on full sentences through semantic relatedness and textual entailment. In Proceedings of the 8th international workshop on semantic evaluation (SemEval 2014), pages 1–8.
    Google ScholarLocate open access versionFindings
  • Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013a. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.
    Findings
  • Tomas Mikolov, Martin Karafiat, Lukas Burget, Jan Cernocky, and Sanjeev Khudanpur. 2010. Recurrent neural network based language model. In Eleventh annual conference of the international speech communication association.
    Google ScholarLocate open access versionFindings
  • Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013b. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pages 3111–3119.
    Google ScholarLocate open access versionFindings
  • Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. 2015. Librispeech: an asr corpus based on public domain audio books. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5206–5210. IEEE.
    Google ScholarLocate open access versionFindings
  • Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 1532–1543.
    Google ScholarLocate open access versionFindings
  • Alvaro Peris and Francisco Casacuberta. 2015. A bidirectional recurrent neural language model for machine translation. Procesamiento del Lenguaje Natural, 55:109–116.
    Google ScholarLocate open access versionFindings
  • Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), volume 1, pages 2227–2237.
    Google ScholarLocate open access versionFindings
  • Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Improving language understanding by generative pre-training.
    Google ScholarFindings
  • Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. Squad: 100,000+ questions for machine comprehension of text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 2383–2392.
    Google ScholarLocate open access versionFindings
  • David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. 1986. Learning representations by backpropagating errors. nature, 323(6088):533–536.
    Google ScholarLocate open access versionFindings
  • Julian Salazar, Davis Liang, Toan Q Nguyen, and Katrin Kirchhoff. 2019. Pseudolikelihood reranking with masked language models. arXiv preprint arXiv:1910.14659.
    Findings
  • Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. 2019.
    Google ScholarFindings
  • Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Improving neural machine translation models with monolingual data. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 86–96.
    Google ScholarLocate open access versionFindings
  • Yusuxke Shibata, Takuya Kida, Shuichi Fukamachi, Masayuki Takeda, Ayumi Shinohara, Takeshi Shinohara, and Setsuo Arikawa. 1999. Byte pair encoding: A text compression scheme that accelerates pattern matching. Technical report, Technical Report DOITR-161, Department of Informatics, Kyushu University.
    Google ScholarFindings
  • Joonbo Shin, Yoonhyung Lee, and Kyomin Jung. 2019. Effective sentence scoring method using bert for speech recognition. In Asian Conference on Machine Learning, pages 1081–1093.
    Google ScholarLocate open access versionFindings
  • Martin Sundermeyer, Ralf Schluter, and Hermann Ney. 2012. Lstm neural networks for language modeling. In Thirteenth annual conference of the international speech communication association.
    Google ScholarLocate open access versionFindings
  • Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems, pages 3104–3112.
    Google ScholarLocate open access versionFindings
  • Ashish Vaswani, Samy Bengio, Eugene Brevdo, Francois Chollet, Aidan Gomez, Stephan Gouws, Llion Jones, Łukasz Kaiser, Nal Kalchbrenner, Niki Parmar, et al. 2018. Tensor2tensor for neural machine translation. In Proceedings of the 13th Conference of the Association for Machine Translation in the Americas (Volume 1: Research Papers), pages 193– 199.
    Google ScholarLocate open access versionFindings
  • Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems, pages 5998–6008.
    Google ScholarLocate open access versionFindings
  • Alex Wang and Kyunghyun Cho. 2019. Bert has a mouth, and it must speak: Bert as a markov random field language model. In Proceedings of the Workshop on Methods for Optimizing and Evaluating Neural Language Generation, pages 30–36.
    Google ScholarLocate open access versionFindings
  • Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. 2019. Glue: A multi-task benchmark and analysis platform for natural language understanding. In 7th International Conference on Learning Representations, ICLR 2019.
    Google ScholarLocate open access versionFindings
  • Shinji Watanabe, Takaaki Hori, Shigeki Karita, Tomoki Hayashi, Jiro Nishitoba, Yuya Unno, NelsonEnrique Yalta Soplin, Jahn Heymann, Matthew Wiesner, Nanxin Chen, et al. 2018. Espnet: Endto-end speech processing toolkit. Proc. Interspeech 2018, pages 2207–2211.
    Google ScholarLocate open access versionFindings
  • Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. 2016. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144.
    Findings
  • Matthew D Zeiler. 2012. Adadelta: an adaptive learning rate method. arXiv preprint arXiv:1212.5701.
    Findings
  • For the input features, we use 80-band Melscale spectrogram derived from the speech signal. The target sequence is processed in 5K caseinsensitive sub-word units created via unigram byte-pair encoding (Shibata et al., 1999). We use an attention-based encoder-decoder model as our acoustic model. The encoder is a 5-layer bidirectional LSTM, and there are bottleneck layers that conduct linear transformation between every LSTM layers. Also, there is a VGG module before the encoder, and it reduces encoding time steps by a quarter through two max-pooling layers. The decoder is 2-layer bidirectional LSTM with location-aware attention mechanism (Chorowski et al., 2015). All the layers have 1024 hidden units. The model is trained with additional CTC objective function because the left-to-right constraint of CTC helps learn alignments between speech-text pairs (Hori et al., 2017).
    Google ScholarLocate open access versionFindings
  • Our model is trained for 20 epochs on 960h of LibriSpeech training data using Adadelta optimizer (Zeiler, 2012). Using this acoustic model, we obtain 50-best decoded sentences for each input audio through hybrid CTC-attention based scoring (Hori et al., 2017) method. For Seq2SeqASR, we additionally use a pre-trained RNNLM to combine the log-probability plm of RNNLM during decoding as follows: log p(yn|y1:n−1) = log pam(yn|y1:n−1) + β log plm(yn|y1:n−1), (3)
    Google ScholarLocate open access versionFindings
  • Table 5 shows the oracle word error rates (WERs) of the 50-best lists, which are measured assuming that the best sentence is always picked from the candidates. We also include the oracle WERs from the 50-best lists of (Shin et al., 2019).
    Google ScholarFindings
Full Text
Your rating :
0

 

Tags
Comments