AI helps you reading Science
AI generates interpretation videos
AI extracts and analyses the key points of the paper to generate videos automatically
AI parses the academic lineage of this thesis
AI extracts a summary of this paper
We identify the issues of existing Transformer architectures, such as BERT and GPT, that are not able to capture the strong word-level context required in language model
Language Models with Transformers.
arXiv: Computation and Language, (2019)
The Transformer architecture is superior to RNN-based models in computational efficiency. Recently, GPT and BERT demonstrate the efficacy of Transformer models on various NLP tasks using pre-trained language models on large-scale corpora. Surprisingly, these Transformer architectures are suboptimal for language model itself. Neither self-...More
PPT (Upload PPT)
- Modeling the sequential context in language is the key to success in many NLP tasks. Recurrent neural networks (RNNs) (Mikolov et al, 2010) memorize the sequential context in carefully designed cells.
- Transformers are able to capture long-range dependencies with vague relative token positions.
- This results in a coarse-grained sequence representation at sentence level.
- Recent works such as GPT (Radford et al, 2018, 2019) and BERT (Devlin et al, 2018) show that the representations learned on large-scale language modeling datasets are effective for fine-tuning both sentence-level tasks, such as GLUE benchmark (Wang et al, 2018), and token-level tasks that do not rely on word order dependency in the context, such as question answering and NER
- Modeling the sequential context in language is the key to success in many NLP tasks
- We study the problem of finding an effective Transformer architecture for language model
- We identify the issues of existing Transformer architectures, such as BERT and GPT, that are not able to capture the strong word-level context required in language model
- We proposed two approaches to address this issue: we fine-tune a subset of parameters to improve the coarse-grain representations obtained from the pre-trained Transformer models
- We propose a coordinate architecture search (CAS) algorithm to select an effective architecture based on finetuning results
- We experimentally show that Coordinate Architecture Search (CAS) outperforms the state-of-the-art language models on three language model benchmark datasets
- 2. The model size of GPT-CAS is 149M, which is much larger compared to the size 37M from ENAS; 3.
- The GPT vocabulary size is 10k larger compared to the ENAS’s vocabulary.
- The original implementations are based on basic word tokenization of the PTB and WT-2.
- The authors are using the sub-word tokenization (WordPiece and BPE respectively) for BERT and GPT architecture exploration.
- The vocabulary size after basic tokenization processing is similar to the results after the sub-word tokenization, which are all around 30k-40k.
- The authors consider the performance comparison as fair 2
- The authors experimentally show that CAS outperforms the state-of-the-art language models on three language model benchmark datasets.
- The authors study the problem of finding an effective Transformer architecture for language model.
- The authors identify the issues of existing Transformer architectures, such as BERT and GPT, that are not able to capture the strong word-level context required in language model.
- The authors proposed two approaches to address this issue: the authors fine-tune a subset of parameters to improve the coarse-grain representations obtained from the pre-trained Transformer models.
- The authors propose a coordinate architecture search (CAS) algorithm to select an effective architecture based on finetuning results.
- The authors experimentally show that CAS outperforms the state-of-the-art language models on three language model benchmark datasets
- Table1: Performance of Coordinate Architecture Search (CAS). ‘Val’ and ‘Test’ denote validation and test perplexity respectively
- Table2: Ablation study. Compare CAS with not adding LSTM layers (CAS-Subset) and not updating Transformer block parameters (CAS-LSTM)
- Table3: Over-fitting example on PTB data. BERTAll: BERT with fully fine-tuning including the last layer. BERT-CAS: BERT with coordinate architecture search
- Table4: Effects of different search constraints for placing the LSTM on perplexity on the PTB data
- Table5: Efficiency of different search methods on PTB and WT-2
- Table6: Compare model parameter size and results with GPT-2. The GPT-2 model size and results are from (<a class="ref-link" id="cRadford_et+al_2019_a" href="#rRadford_et+al_2019_a">Radford et al, 2019</a>)
- Table7: Compare training data size with GPT-2
- Architecture search has shown promising results in tasks such as image classification (Zoph and Le, 2016; Liu et al, 2017a,b; Real et al, 2018; Zoph et al, 2018; Liu et al, 2018), object detection (Zoph et al, 2018) as well as language modeling (Zoph and Le, 2016; Pham et al, 2018; Liu et al, 2018) in NLP. Existing neural architecture search studies focus on leveraging different methods to build the neural network from scratch. For example, NAS (Zoph and Le, 2016) uses reinforcement learning to obtain an architecture for CIFAR-10 and ImageNet. Designing the architecture from scratch using reinforcement learning is very costly. Many follow-up studies focus on speeding up the search process by weight-sharing across child models (Pham et al, 2018; Cai et al, 2018), by incorporating a particular structure into the search space (Liu et al, 2017a,b), or by enabling weights prediction for each architecture (Brock et al, 2017; Baker et al, 2017). Different from the above methods, the proposed coordinate search does not involve any controllers.
- We experimentally show that CAS outperforms the state-of-the-art language models on three language model benchmark datasets
Study subjects and analysis
It confirms our hypothesis that neither BERT nor GPT are effective tools for language modeling. Applying them naively leads to significantly worse results compared to AWS-LSTM-MoS on three datasets. It demonstrates that language modeling requires strong capabilities in modeling the word order dependency within sentences
popular language model datasets: 3
Contribution 1 is arguably more language specific. We evaluate CAS on three popular language model datasets: PTB, WikiText-2 and WikiText-103. The BERTbased CAS achieves in average 12.0 perplexity gains compared to the state-of-the-art LSTMbased language model AWD-LSTM-MoS (Yang et al, 2017)
The split word pieces are denoted with ## following (Devlin et al, 2018). For the model architectures based on GPT, the three datasets are tokenized based on bytepair encoding (BPE) (Sennrich et al, 2016), where the subword vocabulary size is 40k based on (Radford et al, 2018), denoted as GPTVocab. Note that BERT and the WordPiece embedding in BERT are trained on BooksCorpus and Wikipedia, whereas GPT and its BPE are trained only on BooksCorpus
4.2 Training Details. We evaluate CAS (Algorithm 2) with both BERT and GPT pre-trained as the initial architecture, and trained on all three datasets. The same training configuration is used across all datasets
epochs on training datasets: 50
Lastly, for AWD-LSTM-MoS with BERT or GPT sub-word setting, we largely follow the parameter settings in the original implementation (Yang et al, 2017). We use NT-ASGD (Merity et al, 2017) to train 50 epochs on training datasets. Since the goal of this work is to discover bestperforming language model from the architecture perspective, we do not employ post-training methods such as neural cache model (Grave et al, 2016) or dynamic evaluation (Krause et al, 2018)
Combing both together leads to further improvement. CAS outperforms AWD-LSTM-MoS on all three datasets. Next, we unfreeze the pre-trained weights of BERT to allow fully fine-tuning including the last
Let’s look into the details of adding LSTMs. There are 4 cases: Only-LSTM implements a model consisting only. Only-LSTM First-LSTM
language model benchmark datasets: 3
It uses a greedy search strategy to accelerate architecture search. We experimentally show that CAS outperforms the state-of-the-art language models on three language model benchmark datasets. Although we only show the effectiveness of CAS when applying Transformer architectures to the language model task, we feel it is possible to apply CAS to both other neural network architectures and fine-tuning other NLP tasks that require strong word-level context as well
- Alexei Baevski and Michael Auli. 2018. Adaptive input representations for neural language modeling. arXiv preprint arXiv:1809.10853.
- Bowen Baker, Otkrist Gupta, Ramesh Raskar, and Nikhil Naik. 2017. Accelerating neural architecture search using performance prediction. arXiv preprint arXiv:1705.10823.
- Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Jauvin. 200A neural probabilistic language model. JMLR, 3(Feb):1137–1155.
- Andrew Brock, Theodore Lim, James M Ritchie, and Nick Weston. 2017. Smash: one-shot model architecture search through hypernetworks. arXiv preprint arXiv:1708.05344.
- Han Cai, Tianyao Chen, Weinan Zhang, Yong Yu, and Jun Wang. 2018. Efficient architecture search by network transformation. AAAI.
- Tianqi Chen, Ian J. Goodfellow, and Jonathon Shlens. 2015. Net2net: Accelerating learning via knowledge transfer. CoRR.
- Zihang Dai, Zhilin Yang, Yiming Yang, William W Cohen, Jaime Carbonell, Quoc V Le, and Ruslan Salakhutdinov. 2019. Transformer-xl: Attentive language models beyond a fixed-length context. arXiv preprint arXiv:1901.02860.
- Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 201Bert: Pretraining of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
- Thomas Elsken, Jan Hendrik Metzen, and Frank Hutter. 2017. Simple and efficient architecture search for convolutional neural networks. CoRR.
- Edouard Grave, Armand Joulin, and Nicolas Usunier. 2016. Improving neural language models with a continuous cache. CoRR.
- Hakan Inan, Khashayar Khosravi, and Richard Socher. 2016. Tying word vectors and word classifiers: A loss framework for language modeling. arXiv preprint arXiv:1611.01462.
- Haifeng Jin, Qingquan Song, and Xia Hu. 2018. Efficient neural architecture search with network morphism. arXiv preprint arXiv:1806.10282.
- Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
- Ben Krause, Emmanuel Kahembwe, Iain Murray, and Steve Renals. 2018. Dynamic evaluation of neural sequence models. In ICML, pages 2771– 2780.
- Tao Lei, Yu Zhang, Sida I Wang, Hui Dai, and Yoav Artzi. 2018. Simple recurrent units for highly parallelizable recurrence. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 4470–4481.
- Chenxi Liu, Barret Zoph, Jonathon Shlens, Wei Hua, Li-Jia Li, Li Fei-Fei, Alan Yuille, Jonathan Huang, and Kevin Murphy. 2017a. Progressive neural architecture search. arXiv preprint arXiv:1712.00559.
- Hanxiao Liu, Karen Simonyan, Oriol Vinyals, Chrisantha Fernando, and Koray Kavukcuoglu. 2017b. Hierarchical representations for efficient architecture search. arXiv preprint arXiv:1711.00436.
- Hanxiao Liu, Karen Simonyan, and Yiming Yang. 20DARTS: differentiable architecture search. CoRR.
- Gábor Melis, Chris Dyer, and Phil Blunsom. 2017. On the state of the art of evaluation in neural language models. arXiv preprint arXiv:1707.05589.
- Stephen Merity, Nitish Shirish Keskar, and Richard Socher. 2017. Regularizing and optimizing LSTM language models. CoRR.
- Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. 2016. Pointer sentinel mixture models. CoRR.
- Tomáš Mikolov, Martin Karafiát, Lukáš Burget, Jan Cernocky, and Sanjeev Khudanpur. 2010. Recurrent neural network based language model. In Eleventh Annual Conference of the International Speech Communication Association.
- Andriy Mnih and Geoffrey Hinton. 2007. Three new graphical models for statistical language modelling. In ICML, pages 641–648.
- Hieu Pham, Melody Y. Guan, Barret Zoph, Quoc V. Le, and Jeff Dean. 2018. Efficient neural architecture search via parameter sharing. In ICML, pages 4092–4101.
- Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Improving language understanding by generative pre-training.
- URL https://s3-us-west-2.
- Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language models are unsupervised multitask learners.
- Esteban Real, Alok Aggarwal, Yanping Huang, and Quoc V Le. 2018. Regularized evolution for image classifier architecture search. arXiv preprint arXiv:1802.01548.
- Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Neural machine translation of rare words with subword units. In ACL.
- Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In NIPS, pages 5998– 6008.
- Alex Wang, Amapreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R Bowman. 2018. Glue: A multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461.
- Tao Wei, Changhu Wang, Yong Rui, and Chang Wen Chen. 2016. Network morphism. In ICML, pages 564–572.
- Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. 2016. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144.
- Zhilin Yang, Zihang Dai, Ruslan Salakhutdinov, and William W. Cohen. 2017. Breaking the softmax bottleneck: A high-rank RNN language model. CoRR.
- Barret Zoph and Quoc V. Le. 2016. Neural architecture search with reinforcement learning. CoRR.
- Barret Zoph, Vijay Vasudevan, Jonathon Shlens, and Quoc V. Le. 2018. Learning transferable architectures for scalable image recognition. In