Automatic Extraction of Rules Governing Morphological Agreement
EMNLP 2020, pp. 5212-5236, 2020.
Weibo:
Abstract:
Creating a descriptive grammar of a language is an indispensable step for language documentation and preservation. However, at the same time it is a tedious, time-consuming task. In this paper, we take steps towards automating this process by devising an automated framework for extracting a first-pass grammatical specification from raw te...More
Code:
Data:
Introduction
- While the languages of the world are amazingly diverse, one thing they share in common is their adherence to grammars — sets of morpho-syntactic rules specifying how to create sentences in the language.
- To create the training data for rule extraction, the authors first annotate raw text with part-of-speech (POS) tags, morphological analyses, and dependency trees.
Highlights
- While the languages of the world are amazingly diverse, one thing they share in common is their adherence to grammars — sets of morpho-syntactic rules specifying how to create sentences in the language
- An important step in the understanding and documentation of languages is the creation of a grammar sketch, a concise and human-readable description of the unique characteristics of that particular language (e.g. Huddleston (2002) for En
- Automated Evaluation As an alternative to the infeasible manual evaluation of all rules in every language, we propose an automated rule metric (ARM) that evaluates how well the rules extracted from decision tree T fit to unseen gold-annotated test data
- We evaluate the quality of the rules induced by our framework, using gold-standard syntactic analyses and learning the decision trees over triples obtained from the training portion of all Syntactic Universal Dependencies (SUD) treebanks
- In a reverse example from Catalan, the overwhelming majority (92%) of 8650 tokens are in the third-person, causing our model to label all leaves as chance agreement despite the fact that person/number agreement is required in such cases
- Data statistics are listed in Appendix A.2. We parse these sentences using the “universal" Udify model that has been pre-trained on all of the Universal Dependencies (UD) treebanks, as released by (Kondratyuk and Straka, 2019). We use these automatically parsed syntactic analyses to extract the rules which we evaluate with ARM over the gold standard test data of the corresponding SUD treebanks
Results
- The authors set this threshold to 90% based on manually inspecting some resulting trees to find a threshold that limited the number of non-agreeing syntactic structures being labeled as required-agreement.
- Leaf 1: chance-agreement relation = conj, det, comp:obj head-pos = any child-pos = noun (a) Rule Extraction (b) Rule Labeling (c) Rule Merging specified by a null hypothesis.
- Rule Learning The authors use sklearn’s (Buitinck et al, 2013) implementation of decision trees and train a separate model for each morphological feature f for a given language.
- For each language/treebank the authors extract and evaluate the top 20 most frequent “head POS, dependency relation, dependent POS” triples for the six morphological features amounting to 120 sets of triples to be annotated.5 The authors present these triples with 10 randomly selected illustrative examples and ask a linguist to annotate whether there is a rule in this language governing agreement between the head-dependent pair for this relation.
- The authors use a threshold of 0.95, and if qf,t > 0.95 the authors assign the test label ltest,f,t for that triple as required-agreement, and otherwise choose chance-agreement.7 Similar to the human evaluation, the authors compute a score for each triple t marking feature f
- The authors evaluate the quality of the rules induced by the framework, using gold-standard syntactic analyses and learning the decision trees over triples obtained from the training portion of all SUD treebanks.
- To compute the morphological complexity of a language, the authors use the word entropy measure proposed by Bentz et al (2016) which measures the average information content of words and is computed as follows: H(D) = − p(wi) log p(wi) where V is the vocabulary, D is the monolingual text extracted from the training portion of the respective treebank, p(wi) is the word type frequency normalized by the total tokens.
- Like person in Russian, the model produces required-agreement labels, which the authors can attribute to skewed data statistics in the treebanks.
Conclusion
- The authors observe that using cross-lingual transfer learning (CLTL) already leads to high scores across all languages even in zero-shot settings where the authors do not use any data from the gold-standard treebank.
- The authors use these automatically parsed syntactic analyses to extract the rules which the authors evaluate with ARM over the gold standard test data of the corresponding SUD treebanks.
Summary
- While the languages of the world are amazingly diverse, one thing they share in common is their adherence to grammars — sets of morpho-syntactic rules specifying how to create sentences in the language.
- To create the training data for rule extraction, the authors first annotate raw text with part-of-speech (POS) tags, morphological analyses, and dependency trees.
- The authors set this threshold to 90% based on manually inspecting some resulting trees to find a threshold that limited the number of non-agreeing syntactic structures being labeled as required-agreement.
- Leaf 1: chance-agreement relation = conj, det, comp:obj head-pos = any child-pos = noun (a) Rule Extraction (b) Rule Labeling (c) Rule Merging specified by a null hypothesis.
- Rule Learning The authors use sklearn’s (Buitinck et al, 2013) implementation of decision trees and train a separate model for each morphological feature f for a given language.
- For each language/treebank the authors extract and evaluate the top 20 most frequent “head POS, dependency relation, dependent POS” triples for the six morphological features amounting to 120 sets of triples to be annotated.5 The authors present these triples with 10 randomly selected illustrative examples and ask a linguist to annotate whether there is a rule in this language governing agreement between the head-dependent pair for this relation.
- The authors use a threshold of 0.95, and if qf,t > 0.95 the authors assign the test label ltest,f,t for that triple as required-agreement, and otherwise choose chance-agreement.7 Similar to the human evaluation, the authors compute a score for each triple t marking feature f
- The authors evaluate the quality of the rules induced by the framework, using gold-standard syntactic analyses and learning the decision trees over triples obtained from the training portion of all SUD treebanks.
- To compute the morphological complexity of a language, the authors use the word entropy measure proposed by Bentz et al (2016) which measures the average information content of words and is computed as follows: H(D) = − p(wi) log p(wi) where V is the vocabulary, D is the monolingual text extracted from the training portion of the respective treebank, p(wi) is the word type frequency normalized by the total tokens.
- Like person in Russian, the model produces required-agreement labels, which the authors can attribute to skewed data statistics in the treebanks.
- The authors observe that using cross-lingual transfer learning (CLTL) already leads to high scores across all languages even in zero-shot settings where the authors do not use any data from the gold-standard treebank.
- The authors use these automatically parsed syntactic analyses to extract the rules which the authors evaluate with ARM over the gold standard test data of the corresponding SUD treebanks.
Tables
- Table1: The Spanish gender rules extracted in a zeroshot setting are generally similar to the ones extracted from the gold data (93%). We highlight the few mistakes that the zero-shot tree makes
- Table2: Dataset statistics. Training data is obtained by parsing the Liepzig corpora (<a class="ref-link" id="cGoldhahn_et+al_2012_a" href="#rGoldhahn_et+al_2012_a">Goldhahn et al, 2012</a>) and test data is obtained from the respective treebank. Each cell denotes the number of sentences in train/test
- Table3: Dataset statistics. Train/Dev/Test denote the number of sentences in the respective treebank used for the target language
- Table4: We used the same hyperparameters for training with a related languages as specified by the authors.12. In the configuration file, we only change the parameters warmup steps= 100 and start-step= 100, as recommended by the authors for low-resource languages
- Table5: Comparing the ARM scores for SUD treebanks across both Statistical and Hard thresholding
Related work
- Bender et al (2014) use interlinear glossed text (IGT) to extract lexical entities and morphological rules for an endangered language. They experiment with different systems which individually extract lemmas, lexical rules, word order and the case system, some of which use hand-specified rules. Howell et al (2017) extend this to work to predict case system on additional languages. Zamaraeva (2016) also infer morphotactics from IGT using k-means clustering. To the best of our knowledge, our work is the first to propose a framework to extract firstpass grammatical agreement rules directly from raw text in a statistically-informed objective way. A parallel line of work (Hellan, 2010) extracts a construction profile of a language by having templates that define how sentences are constructed.
Funding
- This work is sponsored by the DARPA grant FA8750-18-2-0018 and by the National Science Foundation under grant 1761548
Reference
- Joseph Aoun, Elabbas Benmamoun, and Dominique Sportiche. 1994.
- Emily M. Bender, Joshua Crowgey, Michael Wayne Goodman, and Fei Xia. 2014. Learning grammar specifications from IGT: A case study of chintang. In Proceedings of the 2014 Workshop on the Use of Computational Methods in the Study of Endangered Languages, pages 43–53, Baltimore, Maryland, USA. Association for Computational Linguistics.
- Christian Bentz, Tatyana Ruzsics, Alexander Koplenig, and Tanja Samardžic. 2016. A comparison between morphological complexity measures: Typological data vs. language corpora. In Proceedings of the Workshop on Computational Linguistics for Linguistic Complexity (CL4LC), pages 142–153, Osaka, Japan. The COLING 2016 Organizing Committee.
- Robert D Borsley and Ian Roberts. 2005. The syntax of the Celtic languages: a comparative perspective. Cambridge University Press.
- Leo Breiman, Jerome Friedman, Charles J Stone, and Richard A Olshen. 1984. Classification and regression trees. CRC press.
- Keith Brown and Sarah Ogilvie. 2010. Concise encyclopedia of languages of the world. Elsevier.
- Lars Buitinck, Gilles Louppe, Mathieu Blondel, Fabian Pedregosa, Andreas Mueller, Olivier Grisel, Vlad Niculae, Peter Prettenhofer, Alexandre Gramfort, Jaques Grobler, Robert Layton, Jake VanderPlas, Arnaud Joly, Brian Holt, and Gaël Varoquaux. 2013. API design for machine learning software: experiences from the scikit-learn project. In ECML PKDD Workshop: Languages for Data Mining and Machine Learning, pages 108–122.
- Robin Cohen. 198Book reviews: Reasoning and discourse processes. Computational Linguistics, 14(4).
- Bernard Comrie. 1984. Reflections on verb agreement in hindi and related languages.
- Greville G Corbett. 2009. Agreement. In Die slavischen Sprachen/The Slavic Languages.
- Harald Cramér. 1946. Mathematical methods of statistics. Princeton U. Press, Princeton, page 500.
- Dina B Crockett. 1976. Agreement in contemporary standard Russian. Slavica Publishers Inc.
- Kim Gerdes, Bruno Guillaume, Sylvain Kahane, and Guy Perrier. 2018. SUD or surface-syntactic universal dependencies: An annotation scheme nearisomorphic to UD. In Proceedings of the Second Workshop on Universal Dependencies (UDW 2018), pages 66–74, Brussels, Belgium. Association for Computational Linguistics.
- Kim Gerdes, Bruno Guillaume, Sylvain Kahane, and Guy Perrier. 2019. Improving surface-syntactic universal dependencies (SUD): MWEs and deep syntactic features. In Proceedings of the 18th International Workshop on Treebanks and Linguistic Theories (TLT, SyntaxFest 2019), pages 126–132, Paris, France. Association for Computational Linguistics.
- Dirk Goldhahn, Thomas Eckart, and Uwe Quasthoff. 2012. Building large monolingual dictionaries at the leipzig corpora collection: From 100 to 200 languages. In LREC, volume 29, pages 31–43.
- Zellig S Harris. 1951. Methods in structural linguistics.
- Jean Hausser and Korbinian Strimmer. 2009. Entropy inference and the james-stein estimator, with application to nonlinear gene association networks. Journal of Machine Learning Research, 10(7).
- Lars Hellan. 2010. From descriptive annotation to grammar specification. In Proceedings of the Fourth Linguistic Annotation Workshop, pages 172–176, Uppsala, Sweden. Association for Computational Linguistics.
- Kristen Howell, Emily M Bender, Michel Lockwood, Fei Xia, and Olga Zamaraeva. 2017. Inferring case systems from igt: Enriching the enrichment. In Proceedings of the 2nd Workshop on the Use of Computational Methods in the Study of Endangered Languages, pages 67–75.
- Rodney D Huddleston. 2002. The Cambridge grammar of the English language. Cambridge, UK; New York: Cambridge University Press.
- Dan Kondratyuk and Milan Straka. 2019. 75 languages, 1 model: Parsing universal dependencies universally. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLPIJCNLP), pages 2779–2795, Hong Kong, China. Association for Computational Linguistics.
- Joakim Nivre, Rogier Blokland, Niko Partanen, Michael Rießler, and Jack Rueter. 2018. Universal Dependencies 2.3.
- Joakim Nivre, Marie-Catherine De Marneffe, Filip Ginter, Yoav Goldberg, Jan Hajic, Christopher D Manning, Ryan McDonald, Slav Petrov, Sampo Pyysalo, Natalia Silveira, et al. 2016. Universal dependencies v1: A multilingual treebank collection. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), pages 1659–1666.
- Joakim Nivre, Marie-Catherine de Marneffe, Filip Ginter, Jan Hajic, Christopher D. Manning, Sampo Pyysalo, Sebastian Schuster, Francis Tyers, and Daniel Zeman. 2020. Universal dependencies v2: An evergrowing multilingual treebank collection. In Proceedings of The 12th Language Resources and Evaluation Conference, pages 4034–4043, Marseille, France. European Language Resources Association.
- J. Ross Quinlan. 1986. Induction of decision trees. Machine learning, 1(1):81–106.
- Gail M Sullivan and Richard Feinn. 2012. Using effect size—or why the p value is not enough. Journal of graduate medical education, 4(3):279–282.
- Olga Zamaraeva. 2016. Inferring morphotactics from interlinear glossed text: Combining clustering and precision grammars. In Proceedings of the 14th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology, pages 141–150, Berlin, Germany. Association for Computational Linguistics.
- Ran Zmigrod, Sabrina J. Mielke, Hanna Wallach, and Ryan Cotterell. 2019. Counterfactual data augmentation for mitigating gender stereotypes in languages with rich morphology. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 1651–1661, Florence, Italy. Association for Computational Linguistics.
- The best parameters are selected based on the validation set performance. For some treebanks which have no validation set we use the default crossvalidation provided by sklearn (Buitinck et al., 2013). Average model runtime for a treebanks is 5-10mins depending on the size of the treebank.
Full Text
Tags
Comments