# Language Through a Prism: A Spectral Approach for Multiscale Language Representations

NIPS 2020, 2020.

EI

微博一下：

摘要：

Language exhibits structure at different scales, ranging from subwords to words, sentences, paragraphs, and documents. To what extent do deep models capture information at these scales, and can we force them to better capture structure across this hierarchy? We approach this question by focusing on individual neurons, analyzing the beha...更多

代码：

数据：

简介

- Language exhibits structure at multiple levels, ranging from morphology at the subword level [1], word meaning at the lexical level [2], coherence and other discourse properties at the clause or sentence level [3, 4, 5], to topical and narrative structures for entire documents [6, 7].
- Any sequence of values, such as a neuron’s activations across input tokens, can be represented as a weighted sum of cosine waves with different frequencies.
- In order to perform operations in the frequency domain of a sequence, the authors first need to obtain a representation of the input in the frequency domain
- This is the role of a spectral transform.
- The authors use the DCT here because it is a real-to-real function, is widely used in practice, and can often produce fewer artifacts than the DFT when filtering [27, 29]

重点内容

- Language exhibits structure at multiple levels, ranging from morphology at the subword level [1], word meaning at the lexical level [2], coherence and other discourse properties at the clause or sentence level [3, 4, 5], to topical and narrative structures for entire documents [6, 7]
- We compare the probing performance of the vanilla BERT model with the BERT model trained with our prism layer, shown in Table 1
- The BERT model with the prism layer performs considerably better than BERT on topic (+18.8%) and dialog speech act (+6.9%) classification while maintaining high accuracy on part of speech tagging (-1.5%). These results demonstrate that the prism layer has enabled BERT to produce more general-purpose representations that capture phenomena across scales
- The multiscale representations produced by the prism layer are used by the model to perform the masked language modeling (MLM) objective. Since these representations contain information at different scales, this provides an inductive bias for the model to rely on both long-range and shortrange information when performing the MLM task
- We show how to create multiscale representations by training with a prism layer, which forces different neurons to capture information about different scales
- We show that training with a prism layer increases the model’s sensitivity to long-range context, as measured by a masked language modeling task

结果

- The authors compare the probing performance of the vanilla BERT model with the BERT model trained with the prism layer, shown in Table 1.
- The multiscale representations produced by the prism layer are used by the model to perform the masked language modeling (MLM) objective
- Since these representations contain information at different scales, this provides an inductive bias for the model to rely on both long-range and shortrange information when performing the MLM task.
- To show this quantitatively, the authors consider an MLM problem where one hundred consecutive tokens in the middle of the input are masked.
- The model’s loss on these tokens reflects the model’s ability to rely on distant information to predict tokens without local context

结论

- The authors demonstrate how techniques from spectral analysis provide a principled and effective framework for separating multiscale phenomena in deep language representations.
- The authors show that training with a prism layer increases the model’s sensitivity to long-range context, as measured by a masked language modeling task.
- These results demonstrate that spectral techniques are a powerful set of tools for uncovering and modeling multiscale phenomena in deep NLP models

总结

## Introduction:

Language exhibits structure at multiple levels, ranging from morphology at the subword level [1], word meaning at the lexical level [2], coherence and other discourse properties at the clause or sentence level [3, 4, 5], to topical and narrative structures for entire documents [6, 7].- Any sequence of values, such as a neuron’s activations across input tokens, can be represented as a weighted sum of cosine waves with different frequencies.
- In order to perform operations in the frequency domain of a sequence, the authors first need to obtain a representation of the input in the frequency domain
- This is the role of a spectral transform.
- The authors use the DCT here because it is a real-to-real function, is widely used in practice, and can often produce fewer artifacts than the DFT when filtering [27, 29]
## Results:

The authors compare the probing performance of the vanilla BERT model with the BERT model trained with the prism layer, shown in Table 1.- The multiscale representations produced by the prism layer are used by the model to perform the masked language modeling (MLM) objective
- Since these representations contain information at different scales, this provides an inductive bias for the model to rely on both long-range and shortrange information when performing the MLM task.
- To show this quantitatively, the authors consider an MLM problem where one hundred consecutive tokens in the middle of the input are masked.
- The model’s loss on these tokens reflects the model’s ability to rely on distant information to predict tokens without local context
## Conclusion:

The authors demonstrate how techniques from spectral analysis provide a principled and effective framework for separating multiscale phenomena in deep language representations.- The authors show that training with a prism layer increases the model’s sensitivity to long-range context, as measured by a masked language modeling task.
- These results demonstrate that spectral techniques are a powerful set of tools for uncovering and modeling multiscale phenomena in deep NLP models

- Table1: Training with a prism layer produces multiscale representations that perform comparably or better than BERT across different tasks. Probing accuracy and standard deviation (3 trials) for different tasks on the final-layer BERT and BERT + Prism representations

相关工作

- Our work connects with several streams of research investigating multiscale structure in natural language and our models of it. Prior work has studied the extent of this structure at different scales in linguistic corpora, using tools ranging from random walk models and power spectra [50, 51] to entropy and mutual information [52]. To model this structure, researchers have conducted multiresolution analyses of text corpora by applying diffusion wavelets to term-document corpora [53], multinomial topic distributions [54], and term-term cooccurrence graphs [55]. Concerning deep learning, several works have considered the challenges of modeling different scales in distributed representations of words [8, 56] and of capturing long-term dependencies in recurrent neural networks [36, 57]. Other work conducts analytic studies of models that illuminate their scale-awareness, including the sensitivity of LSTM language models to relationships at different scales [58] and the attention patterns of Transformer models [59]. In conversation with this literature, our work provides a principled way of understanding multiscale structure in the representations of deep models, illuminating the linguistic phenomena captured at each of these scales and enabling the construction of scale-specific representations for downstream purposes.

基金

- This work was supported in part by DARPA under agreement FA8650-19-C-7923

研究对象与分析

data: 1600

Different spectral filters extract information useful for tasks at different scales. Probing accuracy for different tasks and band-passes. A low-pass filter produces representations that yield highest probing accuracy on topic classification, while high-passed representations have highest probing accuracy for part of speech tagging. Meanwhile, band-passing the middle frequencies is most useful for dialog speech act probing. “ORIG” refers to the performance of the original token representations. Error bars show standard deviations over three probing runs. Training with a prism layer significantly improves prediction of masked tokens without local context (note the log scale). Average log probability of correct token for different indices (N=1600). Indices between 200 and 300 are replaced with a [MASK] token in the input, requiring the model to use long-range context to generate a probability distribution for the missing token. The higher log probabilities in the masked region for the BERT + Prism model suggest the prism layer makes the model more sensitive to long-range dependencies. Shaded regions are 95% bootstrap CIs (generally too small to see without magnification).

引用论文

- Eugene A Nida. Morphology: The descriptive analysis of words. 1949.
- D Alan Cruse, David Alan Cruse, D A Cruse, and D A Cruse. Lexical semantics. Cambridge university press, 1986.
- Andrew Kehler and Andrew Kehler. Coherence, reference, and the theory of grammar. CSLI publications Stanford, CA, 2002.
- Barbara J Grosz, Scott Weinstein, and Aravind K Joshi. Centering: A framework for modeling the local coherence of discourse. Computational linguistics, 21(2):203–225, 1995.
- Sandra A Thompson and William C Mann. Rhetorical structure theory: A framework for the analysis of texts. IPRA Papers in Pragmatics, 1(1):79–105, 1987.
- Marti A Hearst. Texttiling: Segmenting text into multi-paragraph subtopic passages. Computational linguistics, 23(1):33–64, 1997.
- David M Blei, Andrew Y Ng, and Michael I Jordan. Latent dirichlet allocation. Journal of machine Learning research, 3(Jan):993–1022, 2003.
- Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pages 3111–3119, 2013.
- Jeffrey Pennington, Richard Socher, and Christopher D Manning. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 1532–1543, 2014.
- Ryan Kiros, Yukun Zhu, Russ R Salakhutdinov, Richard Zemel, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. Skip-thought vectors. In Advances in neural information processing systems, pages 3294–3302, 2015.
- Felix Hill, Kyunghyun Cho, and Anna Korhonen. Learning distributed representations of sentences from unlabelled data. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1367–1377, San Diego, California, June 2016. Association for Computational Linguistics.
- Andrew M Dai, Christopher Olah, and Quoc V Le. Document embedding with paragraph vectors. arXiv preprint arXiv:1507.07998, 2015.
- Roberta A Sinoara, Jose Camacho-Collados, Rafael G Rossi, Roberto Navigli, and Solange O Rezende. Knowledge-enhanced document embeddings for text classification. Knowledge-Based Systems, 163:955– 971, 2019.
- Rui Lin, Shujie Liu, Muyun Yang, Mu Li, Ming Zhou, and Sheng Li. Hierarchical recurrent neural network for document modeling. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 899–907, 2015.
- Jiwei Li, Minh-Thang Luong, and Dan Jurafsky. A hierarchical neural autoencoder for paragraphs and documents. arXiv preprint arXiv:1506.01057, 2015.
- Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He, Alex Smola, and Eduard Hovy. Hierarchical attention networks for document classification. In Proceedings of the 2016 conference of the North American chapter of the association for computational linguistics: human language technologies, pages 1480–1489, 2016.
- Kai Sheng Tai, Richard Socher, and Christopher D Manning. Improved semantic representations from tree-structured long short-term memory networks. arXiv preprint arXiv:1503.00075, 2015.
- Samuel R Bowman, Jon Gauthier, Abhinav Rastogi, Raghav Gupta, Christopher D Manning, and Christopher Potts. A fast unified model for parsing and sentence understanding. arXiv preprint arXiv:1603.06021, 2016.
- Yangfeng Ji and Jacob Eisenstein. Representation learning for text-level discourse parsing. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 13–24, 2014.
- Nils Reimers and Iryna Gurevych. Sentence-bert: Sentence embeddings using siamese bert-networks. arXiv preprint arXiv:1908.10084, 2019.
- Alexis Conneau, Germán Kruszewski, Guillaume Lample, Loïc Barrault, and Marco Baroni. What you can cram into a single vector: Probing sentence embeddings for linguistic properties. arXiv preprint arXiv:1805.01070, 2018.
- Ian Tenney, Dipanjan Das, and Ellie Pavlick. Bert rediscovers the classical nlp pipeline. arXiv preprint arXiv:1905.05950, 2019.
- Nelson F Liu, Matt Gardner, Yonatan Belinkov, Matthew Peters, and Noah A Smith. Linguistic knowledge and transferability of contextual representations. arXiv preprint arXiv:1903.08855, 2019.
- John Hewitt and Christopher D Manning. A structural probe for finding syntax in word representations. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4129–4138, 2019.
- Alan V Oppenheim. Discrete-time signal processing. Pearson Education India, 1999.
- Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
- K Ramamohan Rao and Ping Yip. Discrete cosine transform: algorithms, advantages, applications. Academic press, 2014.
- Chao Zuo, Qian Chen, and Anand Asundi. Boundary-artifact-free phase retrieval with the transport of intensity equation: fast solution with use of discrete cosine transform. Optics express, 22(8):9220–9244, 2014.
- Alan C Bovik. The essential guide to video processing. Academic Press, 2009.
- Nasir Ahmed, T_ Natarajan, and Kamisetty R Rao. Discrete cosine transform. IEEE transactions on Computers, 100(1):90–93, 1974.
- Stephen Butterworth et al. On the theory of filter amplifiers. Wireless Engineer, 7(6):536–541, 1930.
- H Takahasi. On the ladder-type filter network with tchebysheff response. J. Inst. Elec. Commun. Engrs. Japan, 34(2):65–74, 1951.
- Stéphane Mallat. A wavelet tour of signal processing. Elsevier, 1999.
- Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008, 2017.
- Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. OpenAI Blog, 1(8):9, 2019.
- Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
- Matthew E Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. Deep contextualized word representations. arXiv preprint arXiv:1802.05365, 2018.
- Ian Tenney, Patrick Xia, Berlin Chen, Alex Wang, Adam Poliak, R Thomas McCoy, Najoung Kim, Benjamin Van Durme, Samuel R Bowman, Dipanjan Das, et al. What do you learn from context? probing for sentence structure in contextualized word representations. arXiv preprint arXiv:1905.06316, 2019.
- Guillaume Alain and Yoshua Bengio. Understanding intermediate layers using linear classifier probes. arXiv preprint arXiv:1610.01644, 2016.
- Allyson Ettinger, Ahmed Elgohary, and Philip Resnik. Probing for semantic evidence of composition by means of simple classification tasks. In Proceedings of the 1st Workshop on Evaluating Vector-Space Representations for NLP, pages 134–139, 2016.
- Xing Shi, Inkit Padhi, and Kevin Knight. Does string-based neural mt learn source syntax? In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 1526–1534, 2016.
- Mitchell Marcus, Beatrice Santorini, and Mary Ann Marcinkiewicz. Building a large annotated corpus of english: The penn treebank. 1993.
- Daniel Jurafsky, Elizabeth Shriberg, and Debra Biasca. Switchboard SWBD-DAMSL shallow-discoursefunction annotation coders manual, draft 13. Technical Report 97-02, University of Colorado, Boulder Institute of Cognitive Science, Boulder, CO, 1997.
- Elizabeth Shriberg, Rebecca Bates, Paul Taylor, Andreas Stolcke, Daniel Jurafsky, Klaus Ries, Noah Coccaro, Rachel Martin, Marie Meteer, and Carol Van Ess-Dykema. Can prosody aid the automatic classification of dialog acts in conversational speech? Language and Speech, 41(3–4):439–487, 1998.
- Andreas Stolcke, Klaus Ries, Noah Coccaro, Elizabeth Shriberg, Rebecca Bates, Daniel Jurafsky, Paul Taylor, Rachel Martin, Marie Meteer, and Carol Van Ess-Dykema. Dialogue act modeling for automatic tagging and recognition of conversational speech. Computational Linguistics, 26(3):339–371, 2000.
- Ken Lang. Newsweeder: Learning to filter netnews. In Machine Learning Proceedings 1995, pages 331–339.
- Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models. arXiv preprint arXiv:1609.07843, 2016.
- Trieu H Trinh, Andrew M Dai, Minh-Thang Luong, and Quoc V Le. Learning longer-term dependencies in rnns with auxiliary losses. arXiv preprint arXiv:1803.00144, 2018.
- Werner Ebeling and Alexander Neiman. Long-range correlations between letters and sentences in texts. Physica A: Statistical Mechanics and its Applications, 215(3):233–241, 1995.
- Alexey N Pavlov, Werner Ebeling, Lutz Molgedey, Amir R Ziganshin, and Vadim S Anishchenko. Scaling features of texts, images and time series. Physica A: Statistical Mechanics and its Applications, 300(12):310–324, 2001.
- Werner Ebeling and Thorsten Pöschel. Entropy and long-range correlations in literary english. EPL (Europhysics Letters), 26(4):241, 1994.
- Ronald R Coifman and Mauro Maggioni. Diffusion wavelets. Applied and Computational Harmonic Analysis, 21(1):53–94, 2006.
- Chang Wang and Sridhar Mahadevan. Multiscale analysis of document corpora based on diffusion models. In Twenty-First International Joint Conference on Artificial Intelligence, 2009.
- Vidit Jain and Jay Mahadeokar. Short-text representation using diffusion wavelets. In Proceedings of the 23rd International Conference on World Wide Web, pages 301–302, 2014.
- Aakash Sarkar and Marc Howard. Scale-dependent relationships in natural language. arXiv preprint arXiv:1912.07506, 2019.
- Sepp Hochreiter, Yoshua Bengio, Paolo Frasconi, Jürgen Schmidhuber, et al. Gradient flow in recurrent nets: the difficulty of learning long-term dependencies, 2001.
- Urvashi Khandelwal, He He, Peng Qi, and Dan Jurafsky. Sharp nearby, fuzzy far away: How neural language models use context. arXiv preprint arXiv:1805.04623, 2018.
- Kevin Clark, Urvashi Khandelwal, Omer Levy, and Christopher D Manning. What does bert look at? an analysis of bert’s attention. arXiv preprint arXiv:1906.04341, 2019.
- Jan Koutnik, Klaus Greff, Faustino Gomez, and Juergen Schmidhuber. A clockwork rnn. arXiv preprint arXiv:1402.3511, 2014.
- Salah El Hihi and Yoshua Bengio. Hierarchical recurrent neural networks for long-term dependencies. In Advances in neural information processing systems, pages 493–499, 1996.
- Junyoung Chung, Sungjin Ahn, and Yoshua Bengio. Hierarchical multiscale recurrent neural networks. arXiv preprint arXiv:1609.01704, 2016.
- Boxuan Yue, Junwei Fu, and Jun Liang. Residual recurrent neural networks for learning sequential representations. Information, 9(3):56, 2018.
- Bo Chang, Minmin Chen, Eldad Haber, and Ed H Chi. Antisymmetricrnn: A dynamical system view on recurrent neural networks. arXiv preprint arXiv:1902.09689, 2019.
- Sanjeev Arora, Yingyu Liang, and Tengyu Ma. A simple but tough-to-beat baseline for sentence embeddings. 2016.
- Y Zhang and Lai-Wan Chan. Forenet: fourier recurrent networks for time series prediction. 2000.
- Jiong Zhang, Yibo Lin, Zhao Song, and Inderjit S Dhillon. Learning long term dependencies via fourier recurrent units. arXiv preprint arXiv:1803.06585, 2018.
- Yikang Shen, Shawn Tan, Alessandro Sordoni, and Aaron Courville. Ordered neurons: Integrating tree structures into recurrent neural networks. arXiv preprint arXiv:1810.09536, 2018.
- Ossama Abdel-Hamid, Abdel-rahman Mohamed, Hui Jiang, and Gerald Penn. Applying convolutional neural networks concepts to hybrid nn-hmm model for speech recognition. In 2012 IEEE international conference on Acoustics, speech and signal processing (ICASSP), pages 4277–4280. IEEE, 2012.
- Geoffrey Hinton, Li Deng, Dong Yu, George E Dahl, Abdel-rahman Mohamed, Navdeep Jaitly, Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Tara N Sainath, et al. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal processing magazine, 29(6):82–97, 2012.
- Henri J Nussbaumer. The fast fourier transform. In Fast Fourier Transform and Convolution Algorithms, pages 80–111.
- Michael Mathieu, Mikael Henaff, and Yann LeCun. Fast training of convolutional networks through ffts. arXiv preprint arXiv:1312.5851, 2013.
- Salman H Khan, Munawar Hayat, and Fatih Porikli. Scene categorization with spectral features. In Proceedings of the IEEE International Conference on Computer Vision, pages 5638–5648, 2017.
- Yu Cheng, Felix X Yu, Rogerio S Feris, Sanjiv Kumar, Alok Choudhary, and Shi-Fu Chang. An exploration of parameter redundancy in deep networks with circulant projections. In Proceedings of the IEEE International Conference on Computer Vision, pages 2857–2865, 2015.
- Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4401–4410, 2019.
- Oren Rippel, Jasper Snoek, and Ryan P Adams. Spectral representations for convolutional neural networks. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems 28, pages 2449–2457. Curran Associates, Inc., 2015.
- Salman H Khan, Munawar Hayat, and Fatih Porikli. Regularization of deep neural networks with spectral dropout. Neural Networks, 110:82–90, 2019.
- Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144, 2016.
- Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, et al. Transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771, 2019.
- Wen-Hsiung Chen, CH Smith, and SC Fralick. A fast computational algorithm for the discrete cosine transform. IEEE Transactions on communications, 25(9):1004–1009, 1977.

下载 PDF 全文

标签

评论