AI helps you reading Science

AI generates interpretation videos

AI extracts and analyses the key points of the paper to generate videos automatically


pub
Go Generating

AI Traceability

AI parses the academic lineage of this thesis


Master Reading Tree
Generate MRT

AI Insight

AI extracts a summary of this paper


Weibo:
We have investigated the sources of variation in neural language models’ grammatical judgments

Word Frequency Does Not Predict Grammatical Knowledge in Language Models

EMNLP 2020, (2020)

Cited by: 0|Views9
Full Text
Bibtex
Weibo

Abstract

Neural language models learn, to varying degrees of accuracy, the grammatical properties of natural languages. In this work, we investigate whether there are systematic sources of variation in the language models’ accuracy. Focusing on subject-verb agreement and reflexive anaphora, we find that certain nouns are systematically understood ...More

Code:

Data:

0
Introduction
  • Neural language models (Howard and Ruder, 2018; Devlin et al, 2019; Dai et al, 2019; Yang et al, 2019; Radford et al, 2019) have achieved success in both text prediction and downstream tasks such as question-answering, text classification, and natural language inference.
  • Just as in human psycholinguistic tasks, previous work on neural LMs has observed variability in grammatical judgments between different sentences; not all violations of a grammatical constraint are judged to be bad.
  • It is not clear, whether there are systematic sources of variation in these judgments, and if so, what the sources are
Highlights
  • Neural language models (Howard and Ruder, 2018; Devlin et al, 2019; Dai et al, 2019; Yang et al, 2019; Radford et al, 2019) have achieved success in both text prediction and downstream tasks such as question-answering, text classification, and natural language inference
  • We focus on the variation in grammatical knowledge that potentially exists within a neural language model
  • We have investigated the sources of variation in neural language models’ grammatical judgments
  • We found that there are systematic differences between nouns: when a language model exhibits knowledge of a noun’s grammatical properties in one task, it is more likely to do so in other tasks
  • The study found two latent dimensions of variation between nouns: one corresponding to how well the models understood its behavior with reflexive pronouns, and the other corresponding to subject-verb agreement
  • Subsequent analyses demonstrate a pair of empirical phenomena: 1
Methods
  • The authors describe the process of calculating a target noun’s task performance score in more detail.

    as shown in Table 2.
  • For each pair of task template and target noun, 500 sentences were randomly sampled by choosing lexical items from the appropriate word lists.
  • 2*2 or 2*2*2 versions were generated.
  • These versions varied the grammaticality of the sentence and the plurality of the target noun and any distractor nouns.
  • For the SVA Simple task, 2*2 versions are generated for every sampled sentence:
Results
  • 4.1 Noun performance is correlated across tasks

    The authors first examine how each noun’s performance varies across the grammatical tasks.
  • 4.1 Noun performance is correlated across tasks.
  • The authors measure the average performance of the noun on that task, as described above.
  • This gives 10 features per noun, corresponding to the 10 grammatical tasks.
  • Figure 1 shows the pairwise comparisons between performance on the different tasks for Transformer-XL.
Conclusion
  • The authors have investigated the sources of variation in neural language models’ grammatical judgments.
  • The authors found that there are systematic differences between nouns: when a language model exhibits knowledge of a noun’s grammatical properties in one task, it is more likely to do so in other tasks.
  • The study found two latent dimensions of variation between nouns: one corresponding to how well the models understood its behavior with reflexive pronouns, and the other corresponding to subject-verb agreement.
  • The models learn the agreement properties of a novel noun from just a few samples, and the data supporting few-shot learning appears to be densely distributed; most types of syntactic and semantic data examined lead to improvements on the reflexive pronoun or subject-verb agreement tasks
Summary
  • Introduction:

    Neural language models (Howard and Ruder, 2018; Devlin et al, 2019; Dai et al, 2019; Yang et al, 2019; Radford et al, 2019) have achieved success in both text prediction and downstream tasks such as question-answering, text classification, and natural language inference.
  • Just as in human psycholinguistic tasks, previous work on neural LMs has observed variability in grammatical judgments between different sentences; not all violations of a grammatical constraint are judged to be bad.
  • It is not clear, whether there are systematic sources of variation in these judgments, and if so, what the sources are
  • Methods:

    The authors describe the process of calculating a target noun’s task performance score in more detail.

    as shown in Table 2.
  • For each pair of task template and target noun, 500 sentences were randomly sampled by choosing lexical items from the appropriate word lists.
  • 2*2 or 2*2*2 versions were generated.
  • These versions varied the grammaticality of the sentence and the plurality of the target noun and any distractor nouns.
  • For the SVA Simple task, 2*2 versions are generated for every sampled sentence:
  • Results:

    4.1 Noun performance is correlated across tasks

    The authors first examine how each noun’s performance varies across the grammatical tasks.
  • 4.1 Noun performance is correlated across tasks.
  • The authors measure the average performance of the noun on that task, as described above.
  • This gives 10 features per noun, corresponding to the 10 grammatical tasks.
  • Figure 1 shows the pairwise comparisons between performance on the different tasks for Transformer-XL.
  • Conclusion:

    The authors have investigated the sources of variation in neural language models’ grammatical judgments.
  • The authors found that there are systematic differences between nouns: when a language model exhibits knowledge of a noun’s grammatical properties in one task, it is more likely to do so in other tasks.
  • The study found two latent dimensions of variation between nouns: one corresponding to how well the models understood its behavior with reflexive pronouns, and the other corresponding to subject-verb agreement.
  • The models learn the agreement properties of a novel noun from just a few samples, and the data supporting few-shot learning appears to be densely distributed; most types of syntactic and semantic data examined lead to improvements on the reflexive pronoun or subject-verb agreement tasks
Tables
  • Table1: Templates used for sentence generation. TargetNoun indicates the position of the target noun whose performance score is being calculated
  • Table2: Size of word sets for each model
  • Table3: The three types of training data used for syntactic fine-tuning
  • Table4: Types of training data used for semantic finetuning
  • Table5: Cumulative proportion of variance explained by the top (of 10) PCs for each model as detailed in Section 4.1
  • Table6: Top contributors (tasks) to top few (of 10) PCs for Transformer-XL’s noun performance as detailed in Section 4.1. Cells contain the task name followed by their (absolute) component value in the eigenvector
  • Table7: Top contributors (tasks) to top few (of 10) PCs for BERT’s noun performance as detailed in Section 4.1. Cells contain the task name followed by their (absolute) component value in the eigenvector
  • Table8: Top contributors (tasks) to top few (of 10) PCs for GPT-2’s noun performance as detailed in Section 4.1. Cells contain the task name followed by their (absolute) component value in the eigenvector
Download tables as Excel
Related work
  • A number of other studies have investigated the linguistic representations of neural models, both language models specifically and networks trained using other objectives. Linzen et al (2016); Gulordava et al (2018); Kuncoro et al (2018) probe the ability of LSTMs to learn hierarchical structures. Warstadt et al (2019b) introduces a large-scale corpus of grammatical acceptability judgements, trains RNNs to predict these judgments, and concludes

    Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, pages 4040–4054, November 16–20, 2020. c 2020 Association for Computational Linguistics that the models outperform unsupervised baselines, but fall far short of human performance. Lepori et al (2020) finds that tree-based RNNs outperform sequential RNNs on number prediction tasks, but that fine-tuning on an artificially-generated augmentation set can bring the models closer to parity.

    Other work has focused on probing whether neural language models have acquired adequate representations of specific linguistic phenomena. Marvin and Linzen (2018) and Goldberg (2019) use a minimal pair methodology to assess the grammatical knowledge of RNNs and BERT, looking at subject-verb number agreement, reflexive anaphora, and negative polarity items. Wilcox et al (2018) examines whether RNN language models exhibit wh-licensing interactions on surprisal associated with gaps, concluding they can represent long-distance filler-gap dependencies and learn certain island constraints. Futrell et al (2019) studies whether neural language models show evidence for incremental syntactic state representations using psycholinguistic methodology. Warstadt et al (2019a) studies BERT’s knowledge of NPI’s, focusing on differences between tasks: boolean classification (e.g. Linzen et al 2016 and Warstadt et al 2019b), minimal pair comparisons (e.g. Marvin and Linzen 2018 and Wilcox et al 2019), and probing tasks (e.g. Giulianelli et al 2018).
Funding
  • Frequency explains no more than 0.1% of the variation in performance
Study subjects and analysis
minimal pairs: 500
For example, substituting the target noun “zombie” in the SVA Simple template results in: (4) The zombie Verb. Given each of these partially specified templates, 500 minimal pairs are randomly sampled by filling in the remaining lexical items. Finally, the model’s

minimal pairs: 500
The TargetNoun the NonGenderedNoun liked PastTransVerb himself/themselves. grammatical judgments on the 500 minimal pairs are computed (by taking the difference in scores between the grammatical and ungrammatical variants) and averaged, resulting in a task performance score for the noun. 2.3 Limitations

minimal pairs: 4
Given the scores for a sentence’s variants, we compute an overall score for the sentence, which captures how much the model prefers the grammatical variants to the ungrammatical variants. For each sampled sentence S, there are either 2 or 4 minimal pairs among its variants. In Example 5, a. and b. is a minimal pair, and c. and d. is a minimal pair

minimal pairs: 4
+Scorestring(sc) − Scorestring(sd)). The formula when there are four minimal pairs is similar. 3.4 Noun scoring

Reference
  • Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. arXiv preprint arXiv:2005.14165.
    Findings
  • Lucas Champollion. 2015.
    Google ScholarFindings
  • Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc Le, and Ruslan Salakhutdinov. 2019. Transformer-XL: Attentive language models beyond a fixed-length context. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2978–2988, Florence, Italy. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • C. Fellbaum, editor. 1998. WordNet: An electronic lexical database. MIT Press.
    Google ScholarFindings
  • Robert M French. 1999. Catastrophic forgetting in connectionist networks. Trends in cognitive sciences, 3(4):128–135.
    Google ScholarLocate open access versionFindings
  • Richard Futrell, Ethan Wilcox, Takashi Morita, Peng Qian, Miguel Ballesteros, and Roger Levy. 2019. Neural language models as psycholinguistic subjects: Representations of syntactic state. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 32–42, Minneapolis, Minnesota. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Mario Giulianelli, Jack Harding, Florian Mohnert, Dieuwke Hupkes, and Willem Zuidema. 201Under the hood: Using diagnostic classifiers to investigate and improve how language models track agreement information. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 240–248, Brussels, Belgium. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Aaron Gokaslan and Vanya Cohen. 201Openwebtext corpus. http://Skylion007.github.io/ OpenWebTextCorpus.
    Findings
  • Yoav Goldberg. 2019. Assessing BERT’s syntactic abilities. arXiv preprint arXiv:1901.05287.
    Findings
  • Kristina Gulordava, Piotr Bojanowski, Edouard Grave, Tal Linzen, and Marco Baroni. 2018. Colorless green recurrent networks dream hierarchically. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 1195–1205, New Orleans, Louisiana. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Jeremy Howard and Sebastian Ruder. 2018. Universal language model fine-tuning for text classification. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 328–339, Melbourne, Australia. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Adhiguna Kuncoro, Chris Dyer, John Hale, Dani Yogatama, Stephen Clark, and Phil Blunsom. 2018. LSTMs can learn syntax-sensitive dependencies well, but modeling structure makes them better. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1426–1436, Melbourne, Australia. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Peter Nathan Lasersohn. 2011. Mass nouns and plurals. In Semantics: An International Handbook of Natural Language Meaning, pages 1131–1153. De Gruyter.
    Google ScholarLocate open access versionFindings
  • Jey Han Lau, Alexander Clark, and Shalom Lappin. 2017.
    Google ScholarFindings
  • Michael A Lepori, Tal Linzen, and R Thomas McCoy. 2020. Representations of syntax [MASK] useful: Effects of constituency and dependency structure in recursive LSTMs. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Seattle, Washington. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Beth Levin. 1993. English verb classes and alternations: A preliminary investigation. University of Chicago press.
    Google ScholarFindings
  • Tal Linzen, Emmanuel Dupoux, and Yoav Goldberg. 2016. Assessing the ability of LSTMs to learn syntax-sensitive dependencies. Transactions of the Association for Computational Linguistics, 4:521– 535.
    Google ScholarLocate open access versionFindings
  • Jan Tore Lønning. 1997. Plurals and collectivity. In Handbook of logic and language, pages 1009–1053. Elsevier.
    Google ScholarLocate open access versionFindings
  • Rebecca Marvin and Tal Linzen. 2018. Targeted syntactic evaluation of language models. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 1192–1202, Brussels, Belgium. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language models are unsupervised multitask learners.
    Google ScholarFindings
  • Roger Ratcliff. 1990. Connectionist models of recognition memory: constraints imposed by learning and forgetting functions. Psychological review, 97(2):285.
    Google ScholarLocate open access versionFindings
  • Karin Kipper Schuler. 2005. Verbnet: A broadcoverage, comprehensive verb lexicon.
    Google ScholarFindings
  • Alex Warstadt, Yu Cao, Ioana Grosu, Wei Peng, Hagen Blix, Yining Nie, Anna Alsop, Shikha Bordia, Haokun Liu, Alicia Parrish, Sheng-Fu Wang, Jason Phang, Anhad Mohananey, Phu Mon Htut, Paloma Jeretic, and Samuel R. Bowman. 2019a. Investigating BERT’s knowledge of language: Five analysis methods with NPIs. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2877–2887, Hong Kong, China. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Alex Warstadt, Amanpreet Singh, and Samuel R Bowman. 2019b. Neural network acceptability judgments. Transactions of the Association for Computational Linguistics, 7:625–641.
    Google ScholarLocate open access versionFindings
  • Ethan Wilcox, Roger Levy, Takashi Morita, and Richard Futrell. 2018. What do RNN language models learn about filler–gap dependencies? In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 211–221, Brussels, Belgium. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Ethan Wilcox, Peng Qian, Richard Futrell, Miguel Ballesteros, and Roger Levy. 2019. Structural supervision improves learning of non-local grammatical dependencies. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 3302–3312, Minneapolis, Minnesota. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, R’emi Louf, Morgan Funtowicz, and Jamie Brew. 2019. Huggingface’s transformers: State-of-the-art natural language processing. ArXiv, abs/1910.03771.
    Findings
  • Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Russ R Salakhutdinov, and Quoc V Le. 2019. Xlnet: Generalized autoregressive pretraining for language understanding. In Advances in neural information processing systems, pages 5754–5764.
    Google ScholarLocate open access versionFindings
Author
Charles Yu
Charles Yu
Ryan Sie
Ryan Sie
Nicolas Tedeschi
Nicolas Tedeschi
Leon Bergen
Leon Bergen
Your rating :
0

 

Tags
Comments
小科