AI helps you reading Science

AI generates interpretation videos

AI extracts and analyses the key points of the paper to generate videos automatically


pub
Go Generating

AI Traceability

AI parses the academic lineage of this thesis


Master Reading Tree
Generate MRT

AI Insight

AI extracts a summary of this paper


Weibo:
We identify the issues of existing Transformer architectures, such as BERT and GPT, that are not able to capture the strong word-level context required in language model

Language Models with Transformers.

arXiv: Computation and Language, (2019)

Cited by: 34|Views138
EI
Full Text
Bibtex
Weibo

Abstract

The Transformer architecture is superior to RNN-based models in computational efficiency. Recently, GPT and BERT demonstrate the efficacy of Transformer models on various NLP tasks using pre-trained language models on large-scale corpora. Surprisingly, these Transformer architectures are suboptimal for language model itself. Neither self-...More

Code:

Data:

0
Introduction
  • Modeling the sequential context in language is the key to success in many NLP tasks. Recurrent neural networks (RNNs) (Mikolov et al, 2010) memorize the sequential context in carefully designed cells.
  • Transformers are able to capture long-range dependencies with vague relative token positions.
  • This results in a coarse-grained sequence representation at sentence level.
  • Recent works such as GPT (Radford et al, 2018, 2019) and BERT (Devlin et al, 2018) show that the representations learned on large-scale language modeling datasets are effective for fine-tuning both sentence-level tasks, such as GLUE benchmark (Wang et al, 2018), and token-level tasks that do not rely on word order dependency in the context, such as question answering and NER
Highlights
  • Modeling the sequential context in language is the key to success in many NLP tasks
  • We study the problem of finding an effective Transformer architecture for language model
  • We identify the issues of existing Transformer architectures, such as BERT and GPT, that are not able to capture the strong word-level context required in language model
  • We proposed two approaches to address this issue: we fine-tune a subset of parameters to improve the coarse-grain representations obtained from the pre-trained Transformer models
  • We propose a coordinate architecture search (CAS) algorithm to select an effective architecture based on finetuning results
  • We experimentally show that Coordinate Architecture Search (CAS) outperforms the state-of-the-art language models on three language model benchmark datasets
Methods
  • 2. The model size of GPT-CAS is 149M, which is much larger compared to the size 37M from ENAS; 3.
  • The GPT vocabulary size is 10k larger compared to the ENAS’s vocabulary.
  • The original implementations are based on basic word tokenization of the PTB and WT-2.
  • The authors are using the sub-word tokenization (WordPiece and BPE respectively) for BERT and GPT architecture exploration.
  • The vocabulary size after basic tokenization processing is similar to the results after the sub-word tokenization, which are all around 30k-40k.
  • The authors consider the performance comparison as fair 2
Results
  • The authors experimentally show that CAS outperforms the state-of-the-art language models on three language model benchmark datasets.
Conclusion
  • The authors study the problem of finding an effective Transformer architecture for language model.
  • The authors identify the issues of existing Transformer architectures, such as BERT and GPT, that are not able to capture the strong word-level context required in language model.
  • The authors proposed two approaches to address this issue: the authors fine-tune a subset of parameters to improve the coarse-grain representations obtained from the pre-trained Transformer models.
  • The authors propose a coordinate architecture search (CAS) algorithm to select an effective architecture based on finetuning results.
  • The authors experimentally show that CAS outperforms the state-of-the-art language models on three language model benchmark datasets
Tables
  • Table1: Performance of Coordinate Architecture Search (CAS). ‘Val’ and ‘Test’ denote validation and test perplexity respectively
  • Table2: Ablation study. Compare CAS with not adding LSTM layers (CAS-Subset) and not updating Transformer block parameters (CAS-LSTM)
  • Table3: Over-fitting example on PTB data. BERTAll: BERT with fully fine-tuning including the last layer. BERT-CAS: BERT with coordinate architecture search
  • Table4: Effects of different search constraints for placing the LSTM on perplexity on the PTB data
  • Table5: Efficiency of different search methods on PTB and WT-2
  • Table6: Compare model parameter size and results with GPT-2. The GPT-2 model size and results are from (<a class="ref-link" id="cRadford_et+al_2019_a" href="#rRadford_et+al_2019_a">Radford et al, 2019</a>)
  • Table7: Compare training data size with GPT-2
Download tables as Excel
Related work
Funding
  • We experimentally show that CAS outperforms the state-of-the-art language models on three language model benchmark datasets
Study subjects and analysis
datasets: 3
It confirms our hypothesis that neither BERT nor GPT are effective tools for language modeling. Applying them naively leads to significantly worse results compared to AWS-LSTM-MoS on three datasets. It demonstrates that language modeling requires strong capabilities in modeling the word order dependency within sentences

popular language model datasets: 3
Contribution 1 is arguably more language specific. We evaluate CAS on three popular language model datasets: PTB, WikiText-2 and WikiText-103. The BERTbased CAS achieves in average 12.0 perplexity gains compared to the state-of-the-art LSTMbased language model AWD-LSTM-MoS (Yang et al, 2017)

datasets: 3
The split word pieces are denoted with ## following (Devlin et al, 2018). For the model architectures based on GPT, the three datasets are tokenized based on bytepair encoding (BPE) (Sennrich et al, 2016), where the subword vocabulary size is 40k based on (Radford et al, 2018), denoted as GPTVocab. Note that BERT and the WordPiece embedding in BERT are trained on BooksCorpus and Wikipedia, whereas GPT and its BPE are trained only on BooksCorpus

datasets: 3
4.2 Training Details. We evaluate CAS (Algorithm 2) with both BERT and GPT pre-trained as the initial architecture, and trained on all three datasets. The same training configuration is used across all datasets

epochs on training datasets: 50
Lastly, for AWD-LSTM-MoS with BERT or GPT sub-word setting, we largely follow the parameter settings in the original implementation (Yang et al, 2017). We use NT-ASGD (Merity et al, 2017) to train 50 epochs on training datasets. Since the goal of this work is to discover bestperforming language model from the architecture perspective, we do not employ post-training methods such as neural cache model (Grave et al, 2016) or dynamic evaluation (Krause et al, 2018)

datasets: 3
Combing both together leads to further improvement. CAS outperforms AWD-LSTM-MoS on all three datasets. Next, we unfreeze the pre-trained weights of BERT to allow fully fine-tuning including the last

cases: 4
Let’s look into the details of adding LSTMs. There are 4 cases: Only-LSTM implements a model consisting only. Only-LSTM First-LSTM

language model benchmark datasets: 3
It uses a greedy search strategy to accelerate architecture search. We experimentally show that CAS outperforms the state-of-the-art language models on three language model benchmark datasets. Although we only show the effectiveness of CAS when applying Transformer architectures to the language model task, we feel it is possible to apply CAS to both other neural network architectures and fine-tuning other NLP tasks that require strong word-level context as well

Reference
  • Alexei Baevski and Michael Auli. 2018. Adaptive input representations for neural language modeling. arXiv preprint arXiv:1809.10853.
    Findings
  • Bowen Baker, Otkrist Gupta, Ramesh Raskar, and Nikhil Naik. 2017. Accelerating neural architecture search using performance prediction. arXiv preprint arXiv:1705.10823.
    Findings
  • Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Jauvin. 200A neural probabilistic language model. JMLR, 3(Feb):1137–1155.
    Google ScholarLocate open access versionFindings
  • Andrew Brock, Theodore Lim, James M Ritchie, and Nick Weston. 2017. Smash: one-shot model architecture search through hypernetworks. arXiv preprint arXiv:1708.05344.
    Findings
  • Han Cai, Tianyao Chen, Weinan Zhang, Yong Yu, and Jun Wang. 2018. Efficient architecture search by network transformation. AAAI.
    Google ScholarFindings
  • Tianqi Chen, Ian J. Goodfellow, and Jonathon Shlens. 2015. Net2net: Accelerating learning via knowledge transfer. CoRR.
    Google ScholarFindings
  • Zihang Dai, Zhilin Yang, Yiming Yang, William W Cohen, Jaime Carbonell, Quoc V Le, and Ruslan Salakhutdinov. 2019. Transformer-xl: Attentive language models beyond a fixed-length context. arXiv preprint arXiv:1901.02860.
    Findings
  • Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 201Bert: Pretraining of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
    Findings
  • Thomas Elsken, Jan Hendrik Metzen, and Frank Hutter. 2017. Simple and efficient architecture search for convolutional neural networks. CoRR.
    Google ScholarLocate open access versionFindings
  • Edouard Grave, Armand Joulin, and Nicolas Usunier. 2016. Improving neural language models with a continuous cache. CoRR.
    Google ScholarLocate open access versionFindings
  • Hakan Inan, Khashayar Khosravi, and Richard Socher. 2016. Tying word vectors and word classifiers: A loss framework for language modeling. arXiv preprint arXiv:1611.01462.
    Findings
  • Haifeng Jin, Qingquan Song, and Xia Hu. 2018. Efficient neural architecture search with network morphism. arXiv preprint arXiv:1806.10282.
    Findings
  • Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
    Findings
  • Ben Krause, Emmanuel Kahembwe, Iain Murray, and Steve Renals. 2018. Dynamic evaluation of neural sequence models. In ICML, pages 2771– 2780.
    Google ScholarLocate open access versionFindings
  • Tao Lei, Yu Zhang, Sida I Wang, Hui Dai, and Yoav Artzi. 2018. Simple recurrent units for highly parallelizable recurrence. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 4470–4481.
    Google ScholarLocate open access versionFindings
  • Chenxi Liu, Barret Zoph, Jonathon Shlens, Wei Hua, Li-Jia Li, Li Fei-Fei, Alan Yuille, Jonathan Huang, and Kevin Murphy. 2017a. Progressive neural architecture search. arXiv preprint arXiv:1712.00559.
    Findings
  • Hanxiao Liu, Karen Simonyan, Oriol Vinyals, Chrisantha Fernando, and Koray Kavukcuoglu. 2017b. Hierarchical representations for efficient architecture search. arXiv preprint arXiv:1711.00436.
    Findings
  • Hanxiao Liu, Karen Simonyan, and Yiming Yang. 20DARTS: differentiable architecture search. CoRR.
    Google ScholarLocate open access versionFindings
  • Gábor Melis, Chris Dyer, and Phil Blunsom. 2017. On the state of the art of evaluation in neural language models. arXiv preprint arXiv:1707.05589.
    Findings
  • Stephen Merity, Nitish Shirish Keskar, and Richard Socher. 2017. Regularizing and optimizing LSTM language models. CoRR.
    Google ScholarLocate open access versionFindings
  • Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. 2016. Pointer sentinel mixture models. CoRR.
    Google ScholarLocate open access versionFindings
  • Tomáš Mikolov, Martin Karafiát, Lukáš Burget, Jan Cernocky, and Sanjeev Khudanpur. 2010. Recurrent neural network based language model. In Eleventh Annual Conference of the International Speech Communication Association.
    Google ScholarLocate open access versionFindings
  • Andriy Mnih and Geoffrey Hinton. 2007. Three new graphical models for statistical language modelling. In ICML, pages 641–648.
    Google ScholarLocate open access versionFindings
  • Hieu Pham, Melody Y. Guan, Barret Zoph, Quoc V. Le, and Jeff Dean. 2018. Efficient neural architecture search via parameter sharing. In ICML, pages 4092–4101.
    Google ScholarLocate open access versionFindings
  • Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Improving language understanding by generative pre-training.
    Google ScholarFindings
  • URL https://s3-us-west-2.
    Findings
  • Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language models are unsupervised multitask learners.
    Google ScholarFindings
  • Esteban Real, Alok Aggarwal, Yanping Huang, and Quoc V Le. 2018. Regularized evolution for image classifier architecture search. arXiv preprint arXiv:1802.01548.
    Findings
  • Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Neural machine translation of rare words with subword units. In ACL.
    Google ScholarFindings
  • Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In NIPS, pages 5998– 6008.
    Google ScholarLocate open access versionFindings
  • Alex Wang, Amapreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R Bowman. 2018. Glue: A multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461.
    Findings
  • Tao Wei, Changhu Wang, Yong Rui, and Chang Wen Chen. 2016. Network morphism. In ICML, pages 564–572.
    Google ScholarLocate open access versionFindings
  • Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. 2016. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144.
    Findings
  • Zhilin Yang, Zihang Dai, Ruslan Salakhutdinov, and William W. Cohen. 2017. Breaking the softmax bottleneck: A high-rank RNN language model. CoRR.
    Google ScholarLocate open access versionFindings
  • Barret Zoph and Quoc V. Le. 2016. Neural architecture search with reinforcement learning. CoRR.
    Google ScholarFindings
  • Barret Zoph, Vijay Vasudevan, Jonathon Shlens, and Quoc V. Le. 2018. Learning transferable architectures for scalable image recognition. In
    Google ScholarLocate open access versionFindings
Author
Your rating :
0

 

Tags
Comments
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn
小科