AI helps you reading Science

AI generates interpretation videos

AI extracts and analyses the key points of the paper to generate videos automatically


pub
Go Generating

AI Traceability

AI parses the academic lineage of this thesis


Master Reading Tree
Generate MRT

AI Insight

AI extracts a summary of this paper


Weibo:
It is believed that this study will benefit future work about choosing suitable positional encoding functions or designing other modeling methods for position information in the target NLP tasks based on their properties

What Do Position Embeddings Learn? An Empirical Study of Pre Trained Language Model Positional Encoding

EMNLP 2020, pp.6840-6849, (2020)

Cited by: 0|Views554
Full Text
Bibtex
Weibo

Abstract

In recent years, pre-trained Transformers have dominated the majority of NLP benchmark tasks. Many variants of pre-trained Transformers have kept breaking out, and most focus on designing different pre-training objectives or variants of self-attention. Embedding the position information in the self-attention mechanism is also an indispens...More

Code:

Data:

0
Introduction
  • Word ordering often determines the meaning of a sentence; how to utilize the position information of a word sequence has been an important topic in NLP and widely investigated recently.
  • The authors conduct a deep analysis of the learned position embeddings among three iconic pre-trained Transformer language models: BERT (Devlin et al, 2018), RoBERTa (Liu et al, 2019) and GPT-2 (Radford et al, 2019).
  • To examine the performance of different NLP types, the authors conduct the experiments on text classification, language modeling, and machine translation, and empirically analyze and explain the meaning and influence of position embeddings from different aspects
Highlights
  • Word ordering often determines the meaning of a sentence; how to utilize the position information of a word sequence has been an important topic in NLP and widely investigated recently
  • A common approach for modeling word ordering is to use recurrent neural networks (RNN), such as long short-term memory (LSTM) (Hochreiter and Schmidhuber, 1997) or gated recurrent unit (GRU) (Chung et al, 2014), which use a hidden state to represent the information of an ordered sequence and update model weights by backpropagation through time (BPTT) (Werbos, 1990); the
  • This paper investigates the implicit meaning of pretrained Transformer position embeddings
  • Transformer encoders learn the local position information that can only be effective in masked language modeling
  • We show that different NLP tasks with different model architectures and different training objectives may utilize the position information in different ways
  • It is believed that this study will benefit future work about choosing suitable positional encoding functions or designing other modeling methods for position information in the target NLP tasks based on their properties
Methods
  • The authors experiment on six common text classification datasets: SST2, TREC, SUBJ, CR, MR, and MPQA.
  • The experiments imply that even though text classification allows the model to utilize all tokens when making the prediction, the absolute positions, which GPT-2 can capture, may be still salient for longer inputs.
  • The skip position attack significantly harms the performance of the masked language models, while it affects nothing on the autoregressive language model
  • Another observation is that, in Wikitext-2, the distribution of position information is not robust enough so the difference between position embeddings of BERT and RoBERTa is not significant.
Conclusion
  • This paper investigates the implicit meaning of pretrained Transformer position embeddings.
  • Transformer encoders learn the local position information that can only be effective in masked language modeling.
  • The Transformer decoders for autoregressive language modeling learn about absolute positions.
  • The empirical experiments on the pre-trained position embeddings validate the hypothesis.
  • The authors show that different NLP tasks with different model architectures and different training objectives may utilize the position information in different ways.
  • It is believed that this study will benefit future work about choosing suitable positional encoding functions or designing other modeling methods for position information in the target NLP tasks based on their properties
Tables
  • Table1: Mean absolute error of the reversed mapping function learned by linear regression
  • Table2: Error rate of the relative position regression
  • Table3: Testing accuracy of text classification. † indicates the much shorter average length in TREC and MPQA, so position embedding can not significantly affect the result
  • Table4: Testing perplexity in Wikitext-2 and Wikitext-103
  • Table5: Statistics of Multi30k dataset
  • Table6: BLEU scores on full set and long sentences (> 2σ) of Multi30k translation data. The hyphen (-) in the table means the same as the baseline (random)
Download tables as Excel
Related work
  • The concept of using position embedding on position-insensitive models was first proposed by convolutional seq2seq (Gehring et al, 2017), which built an encoder-decoder architecture on convolutional neural networks. Vaswani et al (2017) proposed Transformers that used the self-attention mechanism in the basic blocks. Because the attention mechanism is position-insensitive, it proposed a pre-defined sinusoidal function as positional encoding. Pre-trained language models became a trend among many NLP tasks after (Peters et al, 2018) introduced ELMo. Affected by ELMo, OpenAI GPT (Radford et al, 2018) is the first pretrained language model using a Transformer architecture, then many different variant of pre-trained Transformer including BERT (Devlin et al, 2018), RoBERTa (Roberts, 2005) and GPT-2 (Radford et al, 2019) started evolving the researches of NLP tremendously. In Transformers, the attention values are the same in each input position. Thus, Shaw et al (2018) proposed a relative position representation in the attention level to address this issue. Dai et al (2019) used a segment-level recurrence mechanism on Transformers and also utilized an adaptive version of relative position embeddings inspired by Shaw et al (2018). Furthermore, Wang et al (2019) extended the embedding space from real numbers to complex values , and also proposed a new learnable positional encoding function instead of a simple position embedding mapping.
Funding
  • This work was financially supported from the Young Scholar Fellowship Program by Ministry of Science and Technology (MOST) in Taiwan, under Grant 109-2636-E-002-026
Study subjects and analysis
common text classification datasets: 6
It is believed that our experimental results can guide the future work to choose the suitable positional encoding function for specific tasks given the application property. We experiment on six common text classification datasets: SST2, TREC, SUBJ, CR, MR, and MPQA. Since the last four datasets have no train/dev/test splits, we evaluate them with 5-fold cross-validation

datasets: 4
We experiment on six common text classification datasets: SST2, TREC, SUBJ, CR, MR, and MPQA. Since the last four datasets have no train/dev/test splits, we evaluate them with 5-fold cross-validation. We use the same model architecture as Wang et al (2019), building a 1 layer Transformer encoder with 256 and 512 hidden size for self-attention and feed-forward respectively and 8 attention heads

common text classification datasets: 6
Therefore, we conduct a fair comparison with pre-trained position embeddings of both encoders and decoders in order to check whether all settings achieve similar performance. Experimental Setup We experiment on six common text classification datasets: SST2, TREC, SUBJ, CR, MR, and MPQA. Since the last four datasets have no train/dev/test splits, we evaluate them with 5-fold cross-validation

datasets: 4
Experimental Setup We experiment on six common text classification datasets: SST2, TREC, SUBJ, CR, MR, and MPQA. Since the last four datasets have no train/dev/test splits, we evaluate them with 5-fold cross-validation. We use the same model architecture as Wang et al (2019), building a 1 layer Transformer encoder with 256 and 512 hidden size for self-attention and feed-forward respectively and 8 attention heads

datasets: 3
Here we only calculate the average accuracy of SUBJ, SST, and CR since the average lengths of TREC and MPQA are too short. MR dataset is also excluded, because we find the distribution of length and accuracy in MR is too different from other three datasets and it may cause a huge bias in the figure. Note that the results of MR roughly agrees with others

Reference
  • Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. 2016. Layer normalization. arXiv preprint arXiv:1607.06450.
    Findings
  • Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473.
    Findings
  • Denny Britz, Anna Goldie, Minh-Thang Luong, and Quoc Le. 2017. Massive exploration of neural machine translation architectures. arXiv preprint arXiv:1703.03906.
    Findings
  • Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. 201Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555.
    Findings
  • Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc V Le, and Ruslan Salakhutdinov. 2019. Transformer-xl: Attentive language models beyond a fixed-length context. arXiv preprint arXiv:1901.02860.
    Findings
  • Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
    Findings
  • Philip Gage. 1994. A new algorithm for data compression. C Users Journal, 12(2):23–38.
    Google ScholarLocate open access versionFindings
  • Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, and Yann N Dauphin. 2017. Convolutional sequence to sequence learning. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1243–1252. JMLR. org.
    Google ScholarLocate open access versionFindings
  • Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770– 778.
    Google ScholarLocate open access versionFindings
  • Sepp Hochreiter and Jurgen Schmidhuber. 1997. Long short-term memory. Neural computation, 9(8):1735–1780.
    Google ScholarLocate open access versionFindings
  • Yann LeCun, Yoshua Bengio, et al. 1995. Convolutional networks for images, speech, and time series. The handbook of brain theory and neural networks, 3361(10):1995.
    Google ScholarLocate open access versionFindings
  • Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov, and Luke Zettlemoyer. 2019. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461.
    Findings
  • Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.
    Findings
  • Michael Auli. 2019. fairseq: A fast, extensi- A Reproducibility ble toolkit for sequence modeling. arXiv preprint arXiv:1904.01038.
    Findings
  • Zettlemoyer. 2018. Deep contextualized word repre- are also described in the pages. sentations. arXiv preprint arXiv:1802.05365.
    Findings
  • Text Classification: link and Ilya Sutskever. 2018. Improving language understanding by generative pre-training. URL
    Google ScholarFindings
  • Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language models are unsupervised multitask learners. OpenAI Blog, 1(8):9.
    Google ScholarLocate open access versionFindings
  • Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2019. Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv preprint arXiv:1910.10683.
    Findings
  • Angus Roberts. 2005. Learning meronyms from biomedical text. In Proceedings of the ACL Student Research Workshop, pages 49–54, Ann Arbor, Michigan. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Rico Sennrich, Barry Haddow, and Alexandra Birch. 2015. Neural machine translation of rare words with subword units. arXiv preprint arXiv:1508.07909.
    Findings
  • Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani. 2018. Self-attention with relative position representations. arXiv preprint arXiv:1803.02155.
    Findings
  • Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pages 3104–3112.
    Google ScholarLocate open access versionFindings
  • Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008.
    Google ScholarLocate open access versionFindings
  • Benyou Wang, Donghao Zhao, Christina Lioma, Qiuchi Li, Peng Zhang, and Jakob Grue Simonsen. 2019. Encoding word order in complex embeddings. arXiv preprint arXiv:1912.12333.
    Findings
  • Paul J Werbos. 1990. Backpropagation through time: what it does and how to do it. Proceedings of the IEEE, 78(10):1550–1560.
    Google ScholarLocate open access versionFindings
  • Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Russ R Salakhutdinov, and Quoc V Le. 2019. Xlnet: Generalized autoregressive pretraining for language understanding. In Advances in neural information processing systems, pages 5754–5764.
    Google ScholarLocate open access versionFindings
Your rating :
0

 

Tags
Comments
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn
小科