AI helps you reading Science

AI generates interpretation videos

AI extracts and analyses the key points of the paper to generate videos automatically


pub
Go Generating

AI Traceability

AI parses the academic lineage of this thesis


Master Reading Tree
Generate MRT

AI Insight

AI extracts a summary of this paper


Weibo:
This paper explores the knowledge of linguistic structure learned by large artificial neural networks, trained via self-supervision, whereby the model tries to predict a masked word in a given context

Emergent linguistic structure in artificial neural networks trained by self-supervision.

Proc. Natl. Acad. Sci. USA, no. 48 (2020): 30046-30054

Cited by: 39|Views272
WOS EI

Abstract

This paper explores the knowledge of linguistic structure learned by large artificial neural networks, trained via self-supervision, whereby the model simply tries to predict a masked word in a given context. Human language communication is via sequences of words, but language understanding requires constructing rich hierarchical structur...More

Code:

Data:

0
Introduction
  • This paper explores the knowledge of linguistic structure learned by large artificial neural networks, trained via self-supervision, whereby the model tries to predict a masked word in a given context.
  • The authors show that a linear transformation of learned embeddings in these models captures parse tree distances to a surprising degree, allowing approximate reconstruction of the sentence tree structures normally assumed by linguists.
  • These results help explain why these models have brought such large improvements across many language-understanding tasks.
Highlights
  • This paper explores the knowledge of linguistic structure learned by large artificial neural networks, trained via self-supervision, whereby the model tries to predict a masked word in a given context
  • Since many of Bidirectional Encoder Representations from Transformers’s attention heads encode individual syntactic relations, it is natural to wonder whether the representation, that is, the vectors that represent the words in each layer of Bidirectional Encoder Representations from Transformers, embed syntax trees
  • A fundamental question when investigating linguistic structure in neural networks is whether the internal representations of networks are reconcilable with the tree structures of sentences
  • We find that dependency tree structures are embedded in Bidirectional Encoder Representations from Transformers representations to a striking extent.**
  • We have demonstrated the surprising extent to which Bidirectional Encoder Representations from Transformers, an natural language processing representation learner trained via self-supervision on word prediction tasks, implicitly learns to recover the rich latent structure of human language
  • This result has been demonstrated in attention: how Bidirectional Encoder Representations from Transformers looks at sentential context for encoding a word
Methods
  • To interpret what an attention head in BERT is computing, the authors examine the most-attended-to word at each position.
  • The authors evaluate whether the attention head is expressing a particular linguistic relationship by computing how often the most-attended-to word is in that relationship with the input word.
  • Probing work has shown that BERT and similar systems encode in each word representation information about the semantic role of each word in the sentence, like agent and patient [47, 48].
  • There is evidence that finer-grained attributes, like whether a doer was aware it did the action, seem to be encoded [48].‡‡
Results
  • Results for dependency syntax are shown in Table 1.
  • Predating the attention and structural probes, early work by Shi et al [45] introduced the probing task of predicting the label of the smallest phrasal constituent above each word in the tree using its representation
  • This method has been extended [46, 47] to predicting the label of the dependency edge governing a word, the label of the edge governing the word’s parent, and so on.
  • It has been shown that the presence of individual dependency edges can be predicted from probes on pairs of word representations [47, 48]
Conclusion
  • The authors have demonstrated the surprising extent to which BERT, an NLP representation learner trained via self-supervision on word prediction tasks, implicitly learns to recover the rich latent structure of human language.
  • The authors found a similar result through structural probes on internal vector representations, showing that the hierarchical tree structures of language emerge in BERT vector space
  • That such rich information emerges through selfsupervision is surprising and exciting, with intriguing implications for both NLP research and the logical problem of language acquisition
Summary
  • Introduction:

    This paper explores the knowledge of linguistic structure learned by large artificial neural networks, trained via self-supervision, whereby the model tries to predict a masked word in a given context.
  • The authors show that a linear transformation of learned embeddings in these models captures parse tree distances to a surprising degree, allowing approximate reconstruction of the sentence tree structures normally assumed by linguists.
  • These results help explain why these models have brought such large improvements across many language-understanding tasks.
  • Methods:

    To interpret what an attention head in BERT is computing, the authors examine the most-attended-to word at each position.
  • The authors evaluate whether the attention head is expressing a particular linguistic relationship by computing how often the most-attended-to word is in that relationship with the input word.
  • Probing work has shown that BERT and similar systems encode in each word representation information about the semantic role of each word in the sentence, like agent and patient [47, 48].
  • There is evidence that finer-grained attributes, like whether a doer was aware it did the action, seem to be encoded [48].‡‡
  • Results:

    Results for dependency syntax are shown in Table 1.
  • Predating the attention and structural probes, early work by Shi et al [45] introduced the probing task of predicting the label of the smallest phrasal constituent above each word in the tree using its representation
  • This method has been extended [46, 47] to predicting the label of the dependency edge governing a word, the label of the edge governing the word’s parent, and so on.
  • It has been shown that the presence of individual dependency edges can be predicted from probes on pairs of word representations [47, 48]
  • Conclusion:

    The authors have demonstrated the surprising extent to which BERT, an NLP representation learner trained via self-supervision on word prediction tasks, implicitly learns to recover the rich latent structure of human language.
  • The authors found a similar result through structural probes on internal vector representations, showing that the hierarchical tree structures of language emerge in BERT vector space
  • That such rich information emerges through selfsupervision is surprising and exciting, with intriguing implications for both NLP research and the logical problem of language acquisition
Tables
  • Table1: Well-performing BERT attention heads on WSJ SD dependency parsing by dependency type
  • Table2: Precisions (%) of systems selecting a correct antecedent for a coreferent mention in the CoNLL-2012 data by mention type
  • Table3: Results of structural probes on the WSJ SD test set (baselines in the top half, models hypothesized to encode syntax below)
Download tables as Excel
Funding
  • K.C. was supported by a Google Fellowship
  • J.H. and C.D.M. were partly funded by a gift from Tencent Corp. 1
Reference
  • P. K. Kuhl, Early language acquisition: Cracking the speech code. Nat. Rev. Neurosci. 5, 831–843 (2004).
    Google ScholarLocate open access versionFindings
  • O. Rambow, “The simple truth about dependency and phrase structure representations: An opinion piece” in Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, R. Kaplan, J. Burstein, M. Harper, G. Penn, Eds. (Association for Computational Linguistics, Stroudsburg, PA, 2010), pp. 337–340.
    Google ScholarLocate open access versionFindings
  • Z. Pizlo, Perception viewed as an inverse problem. Vis. Res. 41, 3145–3161 (2001).
    Google ScholarLocate open access versionFindings
  • M. P. Marcus, B. Santorini, M. A. Marcinkiewicz, Building a large annotated corpus of English: The Penn treebank. Comput. Ling. 19, 313–330 (1993).
    Google ScholarLocate open access versionFindings
  • 6. M. Collins, Head-driven statistical models for natural language parsing. Comput. Ling. 29, 589–637 (2003).
    Google ScholarLocate open access versionFindings
  • 7. D. Chen, C. D. Manning, “A fast and accurate dependency parser using neural networks” in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, A. Moschitti, B. Pang, W. Daelemans, Eds. (Association for Computational Linguistics, Stroudsburg, PA, 2014), pp. 740–750.
    Google ScholarLocate open access versionFindings
  • 8. T. Dozat, C. D. Manning, “Deep biaffine attention for neural dependency parsing.” https://openreview.net/pdf?id=Hk95PK9le. Accessed 21 May 2020.
    Findings
  • 9. J. Schmidhuber, “An on-line algorithm for dynamic reinforcement learning and planning in reactive environments” in Proceedings of the International Joint Conference on Neural Networks (IJCNN) (Institute of Electrical and Electronic Engineers, Piscataway, NJ, 1990), pp. 253–258.
    Google ScholarLocate open access versionFindings
  • 10. D. Lieb, A. Lookingbill, S. Thrun, “Adaptive road following using self-supervised learning and reverse optical flow” in Proceedings of Robotics: Science and Systems (RSS), S. Thrun, G. S. Sukhatme, S. Schaal, Eds. (MIT Press, Cambridge, MA, 2005), pp. 273–280.
    Google ScholarLocate open access versionFindings
  • 11. W. L. Taylor, Cloze procedure: A new tool for measuring readability. Journal. Q. 30, 415–433 (1953).
    Google ScholarLocate open access versionFindings
  • 12. T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, J. Dean, “Distributed representations of words and phrases and their compositionality” in Advances Neural Information Processing Systems 26, C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, K.Q. Weinberger, Eds. (Curran Associates, Red Hook, NY, 2013), pp. 3111–3119.
    Google ScholarLocate open access versionFindings
  • 13. J. Pennington, R. Socher, C. Manning, “Glove: Global vectors for word representation” in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, A. Moschitti, B. Pang, W. Daelemans, Eds. (Association for Computational Linguistics, Stroudsburg, PA, 2014), pp. 1532–1543.
    Google ScholarLocate open access versionFindings
  • 14. Y. Bengio, Learning deep architectures for AI. Found. Trends Mach. Learn. 2, 1–127 (2009).
    Google ScholarLocate open access versionFindings
  • 15. R. C. Berwick, P. Pietroski, B. Yankama, N. Chomsky, Poverty of the stimulus revisited. Cognit. Sci. 35, 1207–1242 (2011).
    Google ScholarLocate open access versionFindings
  • 16. T. L. Griffiths, Rethinking language: How probabilities shape the words we use. Proc. Natl. Acad. Sci. U.S.A. 108, 3825–3826 (2011).
    Google ScholarLocate open access versionFindings
  • 17. M. Peters et al., “Deep contextualized word representations” in Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, M. Walker, H. Ji, A. Stent, Eds. (Association for Computational Linguistics, Stroudsburg, PA, 2018), pp. 2227–2237.
    Google ScholarLocate open access versionFindings
  • 18. J. Devlin, M. W. Chang, K. Lee, K. Toutanova, “BERT: Pre-training of deep bidirectional transformers for language understanding” in Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, J. Burstein, C. Doran, T. Solorio, Eds. (Association for Computational Linguistics, Stroudsburg, PA, 2019), pp. 4171–4186.
    Google ScholarLocate open access versionFindings
  • 19. N. Chomsky, Knowledge of Language: Its Nature, Origin, and Use (Praeger, New York, NY, 1986).
    Google ScholarFindings
  • 20. J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT. https://github.com/googleresearch/bert. Accessed 14 May 2020.
    Findings
  • 21. A. Vaswani et al., “Attention is all you need” in Advances in Neural Information Processing Systems 30, I. Guyon et al., Eds. (Curran Associates, Red Hook, NY, 2017), pp. 5998–6008.
    Google ScholarLocate open access versionFindings
  • 22. J. Chung, C. Gulcehre, K. Cho, Y. Bengio, Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv:1412.3555 (11 Dececember 2014).
    Findings
  • 23. D. Bahdanau, K. Cho, Y. Bengio, Neural machine translation by jointly learning to align and translate. arXiv:1409.0473 (16 January 2019).
    Findings
  • 24. T. Linzen, E. Dupoux, Y. Goldberg, Assessing the ability of LSTMs to learn syntax-sensitive dependencies. Trans. Assoc. Comput. Linguist. 4, 521–535 (2016).
    Google ScholarLocate open access versionFindings
  • 25. K. Gulordava, P. Bojanowski, E. Grave, T. Linzen, M. Baroni, “Colorless green recurrent networks dream hierarchically” in Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, M. Walker, H. Ji, A. Stent, Eds. (Association for Computational Linguistics, Stroudsburg, PA, 2018), pp. 1195–1205.
    Google ScholarLocate open access versionFindings
  • 26. R. Marvin, T. Linzen, “Targeted syntactic evaluation of language models” in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, E. Riloff, D. Chiang, J. Hockenmaier, J. Tsujii, Eds. (Association for Computational Linguistics, Stroudsburg, PA, 2018), pp. 1192–1202.
    Google ScholarLocate open access versionFindings
  • 27. A. Kuncoro et al., “LSTMs can learn syntax-sensitive dependencies well, but modeling structure makes them better” in Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, I. Gurevych, Y. Miyao, Eds. (Association for Computational Linguistics, Stroudsburg, PA, 2018), pp. 1426–1436.
    Google ScholarLocate open access versionFindings
  • 28. Y. Goldberg, Assessing BERT’s syntactic abilities. arXiv:1901.05287 (16 January 2019).
    Findings
  • 29. K. Bock, C. A. Miller, Broken agreement. Cognit. Psychol. 23, 45–93 (1991).
    Google ScholarLocate open access versionFindings
  • 30. C. Phillips, M. W. Wagers, E. F. Lau, “Grammatical illusions and selective fallibility in real-time language comprehension” in Experiments at the Interfaces, Syntax and Semantics, J. Runner, Ed. (Emerald Group Publishing Limited, 2011), vol. 37, pp. 147–180.
    Google ScholarFindings
  • 31. T. Luong, H. Q. Pham, C. D. Manning, “Effective approaches to attention-based neural machine translation” in Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, L. Marquez, C. Callison-Burch, J. Su, Eds. (Association for Computational Linguistics, Stroudsburg, PA, 2015), pp. 1412–1421.
    Google ScholarLocate open access versionFindings
  • 32. S. Sharma, R. Kiros, R. Salakhutdinov, Action recognition using visual attention. arxiv:1511.04119 (14 February 2016).
    Findings
  • 33. K. Xu et al., “Show, attend and tell: Neural image caption generation with visual attention” in Proceedings of the International Conference on Machine Learning, F. Bach, D. Blei, Eds. (Proceedings of Machine Learning Research, Brookline, MA, 2015), pp. 2048–2057.
    Google ScholarLocate open access versionFindings
  • 34. J. K. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, Y. Bengio, “Attention-based models for speech recognition” in Advances Neural Information Processing Systems 28, C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, R. Garnett, Eds. (Curran Associates, Red Hook, NY, 2015), pp. 577–585.
    Google ScholarLocate open access versionFindings
  • 35. M. P. Marcus, B. Santorini, M. A. Marcinkiewicz, A. Taylor, Treebank-3. Linguistic Data Consortium LDC99T42. https://catalog.ldc.upenn.edu/LDC99T42. Accessed 14 May 2020.
    Findings
  • 36. M. C. de Marneffe, B. MacCartney, C. D. Manning, “Generating typed dependency parses from phrase structure parses” in LREC International Conference on Language Resources and Evaluation, N. Calzolari et al., Eds. (European Language Resources Association, Paris, France, 2006), pp. 449–454.
    Google ScholarLocate open access versionFindings
  • 37. S. Pradhan, A. Moschitti, N. Xue, O. Uryupina, Y. Zhang, “CoNLL-2012 shared task: Modeling multilingual unrestricted coreference in Ontonotes” in Joint Conference on EMNLP and CoNLL – Shared Task, S. Pradhan, A. Moschitti, N. Xue, Eds. (Association for Computational Linguistics, Stroudsburg, PA, 2012), pp. 1–40.
    Google ScholarFindings
  • 38. H. Lee et al., “Stanford’s multi-pass sieve coreference resolution system at the CoNLL2011 shared task” in Proceedings of the Conference on Computational Natural Language Learning: Shared Task, S. Pradhan, Ed. (Association for Computational Linguistics, Stroudsburg, PA, 2011), pp. 28–34.
    Google ScholarLocate open access versionFindings
  • 39. A. Eriguchi, K. Hashimoto, Y. Tsuruoka, “Tree-to-sequence attentional neural machine translation” in Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, K. Erk, N. A. Smith, Eds. (Association for Computational Linguistics, Stroudsburg, PA, 2016), pp. 823–833.
    Google ScholarLocate open access versionFindings
  • 40. K. Chen, R. Wang, M. Utiyama, E. Sumita, T. Zhao, “Syntax-directed attention for neural machine translation” in Proceedings of the AAAI Conference on Artificial Intelligence (AAAI Press, Palo Alto, CA, 2018), pp. 4792–4799.
    Google ScholarLocate open access versionFindings
  • 41. E. Strubell, P. Verga, D. Andor, D. I. Weiss, A. McCallum, “Linguistically-informed self-attention for semantic role labeling” in Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, E. Riloff, D. Chiang, J. Hockenmaier, J. Tsujii, Eds. (Association for Computational Linguistics, Stroudsburg, PA, 2018), pp. 5027–5038.
    Google ScholarLocate open access versionFindings
  • 42. J. Hewitt, C. D. Manning, “A structural probe for finding syntax in word representations” in Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, J. Burstein, C. Doran, T. Solorio, Eds. (Association for Computational Linguistics, Stroudsburg, PA, 2019), pp. 4129–4138.
    Google ScholarLocate open access versionFindings
  • 43. E. Reif et al., “Visualizing and measuring the geometry of BERT” in Advances in Neural Information Processing Systems 32, H. Wallach et al., Eds. (Curran Associates, Red Hook, NY, 2019), pp. 8594–8603.
    Google ScholarLocate open access versionFindings
  • 44. T. K. Landauer, S. T. Dumais, A solution to Plato’s problem: The latent semantic analysis theory of acquisition, induction, and representation of knowledge. Psychol. Rev. 104, 211–240 (1997).
    Google ScholarLocate open access versionFindings
  • 45. X. Shi, I. Padhi, K. Knight, “Does string-based neural MT learn source syntax?” in Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, J. Su, K. Duh, X. Carreras, Eds. (Association for Computational Linguistics, Stroudsburg, PA, 2016), pp. 1526–1534.
    Google ScholarLocate open access versionFindings
  • 46. T. Blevins, O. Levy, L. Zettlemoyer, “Deep RNNs encode soft hierarchical syntax” in Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, I. Gurevych, Y. Miyao, Eds. (Association for Computational Linguistics, Stroudsburg, PA, 2018), pp. 14–19.
    Google ScholarLocate open access versionFindings
  • 47. N. F. Liu, M. Gardner, Y. Belinkov, M. E. Peters, N. A. Smith, “Linguistic knowledge and transferability of contextual representations” in Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, J. Burstein, C. Doran, T. Solorio, Eds. (Association for Computational Linguistics, Stroudsburg, PA, 2019), pp. 1073–1094.
    Google ScholarLocate open access versionFindings
  • 48. I. Tenney et al., “What do you learn from context? Probing for sentence structure in contextualized word representations.” https://openreview.net/pdf?id=SJzSgnRcKX. Accessed 21 May 2020.
    Findings
  • 49. M. Peters, M. Neumann, L. Zettlemoyer, W. T. Yih, “Dissecting contextual word embeddings: Architecture and representation” in Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, E. Riloff, D. Chiang, J. Hockenmaier, J. Tsujii, Eds. (Association for Computational Linguistics, Stroudsburg, PA, 2018), pp. 1499–1509.
    Google ScholarLocate open access versionFindings
  • 50. N. Saphra, A. Lopez, “Understanding learning dynamics of language models with SVCCA” in Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, J. Burstein, C. Doran, T. Solorio, Eds. (Association for Computational Linguistics, Stroudsburg, PA, 2019), pp. 3257–3267.
    Google ScholarLocate open access versionFindings
  • 51. K. W. Zhang, S. R. Bowman, “Language modeling teaches you more syntax than translation does: Lessons learned through auxiliary task analysis” in Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, E. Riloff, D. Chiang, J. Hockenmaier, J. Tsujii, Eds. (Association for Computational Linguistics, Stroudsburg, PA, 2018), pp. 359–361.
    Google ScholarLocate open access versionFindings
  • 52. A. Kohn, “What’s in an embedding? Analyzing word embeddings through multilingual evaluation” in Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, L. Marquez, C. Callison-Burch, J. Su, Eds. (Association for Computational Linguistics, Stroudsburg, PA, 2015), pp. 2067–2073.
    Google ScholarLocate open access versionFindings
  • 53. A. Conneau, G. Kruszewski, G. Lample, L. Barrault, M. Baroni, “What you can cram into a single \&!#* vector: Probing sentence embeddings for linguistic properties” in Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, I. Gurevych, Y. Miyao, Eds. (Association for Computational Linguistics, Stroudsburg, PA, 2018), pp. 2126–2136.
    Google ScholarLocate open access versionFindings
  • 54. Y. Belinkov, N. Durrani, F. Dalvi, H. Sajjad, J. Glass, “What do neural machine translation models learn about morphology?” in Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, R. Barzilay, M.-Y. Kan, Eds. (Association for Computational Linguistics, Stroudsburg, PA, 2017), pp. 861–872.
    Google ScholarLocate open access versionFindings
  • 55. K. Clark, BERT attention analysis. https://github.com/clarkkev/attention-analysis. Deposited 27 June 2019.
    Locate open access versionFindings
  • 56. J. Hewitt, Structural probes. https://github.com/john-hewitt/structural-probes. Deposited 27 May 2019.
    Locate open access versionFindings
  • 57. S. Oepen et al., “SemEval 2014 task 8: Broad-coverage semantic dependency parsing” in Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014), P. Nakov, T. Zesch, Eds. (Association for Computational Linguistics, Stroudsburg, PA, 2014), pp. 63–72.
    Google ScholarLocate open access versionFindings
  • 58. D. Reisinger et al., Semantic proto-roles. Trans. Assoc. Comput. Linguist. 3, 475–488 (2015).
    Google ScholarLocate open access versionFindings
  • 59. K. Clark, U. Khandelwal, O. Levy, C. D. Manning, “What does BERT look at? An analysis of BERT’s attention” in Proceedings of the Second BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP, T. Linzen, G. Chrupała, Y. Belinkov, D. Hupkes, Eds. (Association for Computational Linguistics, Stroudsburg PA, 2019), pp. 276–286.
    Google ScholarLocate open access versionFindings
Author
Kevin Clark
Kevin Clark
John Hewitt
John Hewitt
Urvashi Khandelwal
Urvashi Khandelwal
Your rating :
0

 

Tags
Comments
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn
小科