AI帮你理解科学

AI 生成解读视频

AI抽取解析论文重点内容自动生成视频


pub
生成解读视频

AI 溯源

AI解析本论文相关学术脉络


Master Reading Tree
生成 溯源树

AI 精读

AI抽取本论文的概要总结


微博一下
We have presented a clean method for comparing grammatical gender systems across languages: By defining gender classes extensionally, we reduced the problem to cluster evaluation from community detection

Measuring the Similarity of Grammatical Gender Systems by Comparing Partitions

EMNLP 2020, pp.5664-5675, (2020)

被引用0|浏览151
下载 PDF 全文
引用
微博一下

摘要

A grammatical gender system divides a lexicon into a small number of relatively fixed grammatical categories. How similar are these gender systems across languages? To quantify the similarity, we define gender systems extensionally, thereby reducing the problem of comparisons between languages’ gender systems to cluster evaluation. We bor...更多

代码

数据

0
简介
  • As many as half the world’s languages carve nouns up into classes (Corbett, 2013)
  • In these languages, nouns are subdivided into gender categories, which together comprise the language’s grammatical gender system.
  • A gender system tends to use a small, fixed number of categories with fixed usage across speakers.
  • Exhaustively divides up the language’s nouns; that is, the union of gender categories is the entire nominal lexicon.
  • With respect to word semantics, (Williams et al, 2019) quantify the relationship between the gender on inanimate nouns and their distributional word vectors
重点内容
  • As many as half the world’s languages carve nouns up into classes (Corbett, 2013)
  • Armed with the first way to quantify communitywise similarity of gender systems, we ask: Do gender system similarities reflect linguistic phylogeny, or something else, like areal effects? Across 20 languages, we find that our pairwise overlap results measurably align with standard pairwise phylogenetic relationships
  • Zooming in on Indo-European, we find that we can recast pairwise similarities into an accurate phylogenetic tree, by measuring distance between gender systems and performing hierarchical agglomerative clustering
  • Adjusted mutual information (AMI) shows us that the similarity of gender systems is no better than a chance relationship; at the whole-lexicon level, influence from the common Indo-European root is absent
  • We have presented a clean method for comparing grammatical gender systems across languages: By defining gender classes extensionally, we reduced the problem to cluster evaluation from community detection
  • We note that we could further extend our measures to fuzzy partitions, which remain less explored in community detection, but are a promising avenue for future work
方法
  • The authors apply each measure to the gender systems from the Swadesh lists, validate the results on NorthEuraLex.
  • The (Balto-)Slavic branch (i.e., Polish, Croatian, Slovene, Ukrainian, Slovenian, Russian, and Bulgarian) is present at the top left, and the Romance branch (i.e., French, Catalan, Italian, Spanish, and Portuguese) appears at the bottom right
  • Outside of these blocks, AMI shows them that the similarity of gender systems is no better than a chance relationship; at the whole-lexicon level, influence from the common Indo-European root is absent
结果
  • The authors use Rabinovich et al (2017)’s unweighted distance.) For each combination of dataset and measure, the authors use McNemar’s test for significance and find p < 0.0001.
结论
  • The authors have presented a clean method for comparing grammatical gender systems across languages: By defining gender classes extensionally, the authors reduced the problem to cluster evaluation from community detection.
  • A related challenge is East and Southeast Asian numeral classifier systems, which associate nouns with classifiers based largely on the semantic properties of the nouns (Kuo and Sera, 2009; Zhan and Levy, 2018; Liu et al, 2019)
  • They display more idiolectal variation, and often more than one classifier can accompany a given noun (Hu, 1993), unlike for gender.
  • The authors note that the authors could further extend the measures to fuzzy partitions, which remain less explored in community detection, but are a promising avenue for future work
总结
  • Introduction:

    As many as half the world’s languages carve nouns up into classes (Corbett, 2013)
  • In these languages, nouns are subdivided into gender categories, which together comprise the language’s grammatical gender system.
  • A gender system tends to use a small, fixed number of categories with fixed usage across speakers.
  • Exhaustively divides up the language’s nouns; that is, the union of gender categories is the entire nominal lexicon.
  • With respect to word semantics, (Williams et al, 2019) quantify the relationship between the gender on inanimate nouns and their distributional word vectors
  • Methods:

    The authors apply each measure to the gender systems from the Swadesh lists, validate the results on NorthEuraLex.
  • The (Balto-)Slavic branch (i.e., Polish, Croatian, Slovene, Ukrainian, Slovenian, Russian, and Bulgarian) is present at the top left, and the Romance branch (i.e., French, Catalan, Italian, Spanish, and Portuguese) appears at the bottom right
  • Outside of these blocks, AMI shows them that the similarity of gender systems is no better than a chance relationship; at the whole-lexicon level, influence from the common Indo-European root is absent
  • Results:

    The authors use Rabinovich et al (2017)’s unweighted distance.) For each combination of dataset and measure, the authors use McNemar’s test for significance and find p < 0.0001.
  • Conclusion:

    The authors have presented a clean method for comparing grammatical gender systems across languages: By defining gender classes extensionally, the authors reduced the problem to cluster evaluation from community detection.
  • A related challenge is East and Southeast Asian numeral classifier systems, which associate nouns with classifiers based largely on the semantic properties of the nouns (Kuo and Sera, 2009; Zhan and Levy, 2018; Liu et al, 2019)
  • They display more idiolectal variation, and often more than one classifier can accompany a given noun (Hu, 1993), unlike for gender.
  • The authors note that the authors could further extend the measures to fuzzy partitions, which remain less explored in community detection, but are a promising avenue for future work
表格
  • Table1: Distances of generated trees from gold tree
  • Table2: Languages, with their subfamilies and ISO codes, used in this study
Download tables as Excel
相关工作
  • There is a baffling dearth of work on quantifying similarity of gender systems. There is, however, ample work on characterizing intensional gender systems, i.e., sets of grammatical rules, that can be divided (Corbett, 1991) into sets of rules based on morphology (Tucker et al, 1977; Gregersen, 1967; Wald, 1975; Plank, 1986, i.a.) and on phonology (Bidot, 1925; Tucker et al, 1977; Newman, 1979; Hayward and Corbett, 1988; Marchese, 1988). Intensional approaches, particularly those with typological leanings, contribute very fine grained research on particular pairwise similarities for particular languages and dialects. Although we cannot survey these in detail here, we would love for our measures to contribute findings that can complement these approaches.

    Relatedly, other recent works have investigated grammatical gender and other types of noun classification systems with information theoretic tools. For example, Williams et al 2020b uses mutual information to quantify the strength of the relationships between declension class, grammatical gender, distributional semantics, and orthographic form respectively in several languages. Williams et al 2020a, which is arguably closest to this work, measures the strength of semantic relationships between inanimate nouns and verbs or adjectives that takes those nouns as arguments, and that work can be seen as comparing the similarity of nouns clustered by their gender, with the same nouns clustered by the adjectives that modify them or the verbs that take them as arguments.
基金
  • We use Rabinovich et al (2017)’s unweighted distance.) For each combination of dataset and measure, we use McNemar’s test for significance and find p < 0.0001
研究对象与分析
data: 1000
Individual partitions of lexicons can also be framed as members of distributions over partitions—for instance, the distribution consisting of all partitions of N items, or of all partitions of N items into K gender clusters, as in Figure 1. For example, Spanish is bi-gendered (with masculine and feminine): a lexicon of Spanish nouns (N = 1000) and their genders would come from a distribution over partitions of N = 1000 items into K = 2 clusters. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, pages 5664–5675, November 16–20, 2020. c 2020 Association for Computational Linguistics

引用论文
  • Javier Artiles, Julio Gonzalo, and Satoshi Sekine. 2007. The SemEval-2007 WePS evaluation: Establishing a benchmark for the Web people search task. In Proceedings of the Fourth International Workshop on Semantic Evaluations (SemEval-2007), pages 64–69. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Ziv Bar-Joseph, David K. Gifford, and Tommi S. Jaakkola. 2001. Fast optimal leaf ordering for hierarchical clustering. Bioinformatics, 17:S22–S29.
    Google ScholarLocate open access versionFindings
  • Nicoleta Bateman and Maria Polinsky. 2010. Romanian as a two-gender language. Hypothesis A/Hypothesis B: Linguistic Explorations in Honor of David M. Perlmutter.
    Google ScholarFindings
  • Emile Bidot. 1925. La clef du genre des substantifs francais: methode dispensant d’avoir recours au dictionnaire. Imprimerie Nouvelle.
    Google ScholarFindings
  • Carl Buck. 1949. A Dictionary of Selected in the Principal Indo-European Languages. University of Chicago Press.
    Google ScholarFindings
  • Raymond B. Cattell. 1945. The description of personality: Principles and findings in a factor analysis. The American Journal of Psychology, 58(1):69–90.
    Google ScholarLocate open access versionFindings
  • Greville G. Corbett. 1991. Gender. Cambridge University Press., Cambridge.
    Google ScholarFindings
  • Greville G. Corbett. 2013. Number of genders. In Matthew S. Dryer and Martin Haspelmath, editors, The World Atlas of Language Structures Online. Max Planck Institute for Evolutionary Anthropology, Leipzig.
    Google ScholarLocate open access versionFindings
  • Silviu Cucerzan and David Yarowsky. 2003a. Minimally supervised induction of grammatical gender. In Proceedings of the 2003 Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Silviu Cucerzan and David Yarowsky. 2003b. Minimally supervised induction of grammatical gender. In Proceedings of the 2003 Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics, pages 40–47.
    Google ScholarLocate open access versionFindings
  • Leon Danon, Albert Dıaz-Guilera, Jordi Duch, and Alex Arenas. 2005. Comparing community structure identification. Journal of Statistical Mechanics: Theory and Experiment, 2005(09):P09008–P09008.
    Google ScholarLocate open access versionFindings
  • Johannes Dellert and Gerhard Jager. 2017. NorthEuraLex. Version 0.9.
    Google ScholarFindings
  • Anca Dinu and Liviu P. Dinu. 2005. On the syllabic similarities of Romance languages. In International Conference on Intelligent Text Processing and Computational Linguistics, pages 785–788. Springer.
    Google ScholarLocate open access versionFindings
  • Carmen Dobrovie-Sorin. 2011. The syntax of Romanian: Comparative studies in Romance, volume 40. Walter de Gruyter.
    Google ScholarFindings
  • Harold Edson Driver and Alfred Louis Kroeber. 1932. Quantitative expression of cultural relationships, volume 31. University of California Press.
    Google ScholarFindings
  • Istvan Fodor. 1959. The origin of grammatical gender. Lingua, 8:186–214.
    Google ScholarLocate open access versionFindings
  • Alexander J. Gates and Yong-Yeol Ahn. 20The impact of random models on clustering similarity. Journal of Machine Learning Research, 18(87):1– 28.
    Google ScholarLocate open access versionFindings
  • Russell D. Gray and Quentin D. Atkinson. 2003. Language-tree divergence times support the anatolian theory of Indo-European origin. Nature, 426:435.
    Google ScholarLocate open access versionFindings
  • Simon J. Greenhill. 2011. Levenshtein distances fail to identify language relationships accurately. Computational Linguistics, 37(4):689–698.
    Google ScholarLocate open access versionFindings
  • Edgar A. Gregersen. 1967. Prefix and pronoun in Bantu. Published at the Waverly Press by Indiana University, Bloomington.
    Google ScholarFindings
  • Martin Haspelmath. 2001. The European linguistic area: Standard average European. In Language typology and language universals. (Handbucher zur Sprach-und Kommunikationswissenschaft), pages 1492–1510. de Gruyter.
    Google ScholarFindings
  • Richard J. Hayward and Greville G. Corbett. 1988. Resolution rules in Qafar. Linguistics, 26:259–279.
    Google ScholarLocate open access versionFindings
  • Qian Hu. 1993. The Acquisition of Chinese Classifiers by Young Mandarin-speaking Children The Acquisition of Chinese Classifiers by Young Mandarinspeaking Children. Ph.D. thesis, Boston University.
    Google ScholarFindings
  • Lawrence Hubert and Phipps Arabie. 1985. Comparing partitions. Journal of Classification, 2(1):193– 218.
    Google ScholarLocate open access versionFindings
  • Muhammad Hasan Ibrahim. 2014. Grammatical gender: Its origin and development, volume 166. Walter de Gruyter.
    Google ScholarFindings
  • N. Jardine, P. H. P. S. N. Jardine, and R. Sibson. 1971. Mathematical Taxonomy. Wiley Series in Probability and Mathematical Statistics. Wiley.
    Google ScholarFindings
  • Eric Jones, Travis Oliphant, Pearu Peterson, et al. 2001. SciPy: Open source scientific tools for Python.
    Google ScholarFindings
  • Judith Kaplan. 2017. From lexicostatistics to lexomics: Basic vocabulary and the study of language prehistory. Osiris, 32(1):202–223.
    Google ScholarLocate open access versionFindings
  • Ruth T. Kramer. 2015. The Morphosyntax of Gender, volume 58. Oxford University Press.
    Google ScholarFindings
  • Jenny Y. Kuo and Maria D. Sera. 2009. Classifier effects on human categorization: the role of shape classifiers in Mandarin Chinese. Journal of East Asian Linguistics, 18:1–19.
    Google ScholarLocate open access versionFindings
  • Shijia Liu, Hongyuan Mei, Adina Williams, and Ryan Cotterell. 2019. On the idiosyncrasies of the Mandarin Chinese classifier system. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4100–4106, Minneapolis, Minnesota. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schutze. 2008. Introduction to Information Retrieval. Cambridge University Press, New York, NY, USA.
    Google ScholarFindings
  • Lynell Marchese. 1988. Noun classes and agreement systems in Kru: A historical approach. Agreement in Natural Language: Approaches, Theories, Descriptions. Stanford: Center for the Study of Language and Information, pages 323–341.
    Google ScholarLocate open access versionFindings
  • Arya D McCarthy. 2017. Gridlock in networks: The leximin method for hierarchical community detection. Master’s thesis, Southern Methodist University.
    Google ScholarFindings
  • Arya D. McCarthy, Tongfei Chen, and Seth Ebner. 2019a. An exact no free lunch theorem for community detection. In Complex Networks and Their Applications VIII, pages 176–187, Lisbon, Portugal. Springer International Publishing.
    Google ScholarLocate open access versionFindings
  • Arya D. McCarthy, Tongfei Chen, Rachel Rudinger, and David W. Matula. 2019b. Metrics matter in community detection. In Complex Networks and Their Applications VIII, pages 164–175, Lisbon, Portugal. Springer International Publishing.
    Google ScholarLocate open access versionFindings
  • Marina Meila. 2003. Comparing clusterings by the variation of information. In Learning Theory and Kernel Machines, pages 173–187, Berlin, Heidelberg. Springer Berlin Heidelberg.
    Google ScholarLocate open access versionFindings
  • Marina Meila. 2007. Comparing clusterings—an information based distance. Journal of Multivariate Analysis, 98(5):873–895.
    Google ScholarLocate open access versionFindings
  • Thomas Muller, Helmut Schmid, and Hinrich Schutze. 2013. Efficient higher-order CRFs for morphological tagging. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 322–332, Seattle, Washington, USA. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Daniel Mullner. 2011. Modern hierarchical, agglomerative clustering algorithms. arXiv preprint arXiv:1109.2378.
    Findings
  • Vivi Nastase and Marius Popescu. 2009. What’s in a name? In some languages, grammatical gender. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, pages 1368–1377. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Paul Newman. 1979. Explaining Hausa feminines. Studies in African Linguistics.
    Google ScholarLocate open access versionFindings
  • Joakim Nivre, Mitchell Abrams, Zeljko Agic, Lars Ahrenberg, Lene Antonsen, Katya Aplonova, Maria Jesus Aranzabe, Gashaw Arutie, Masayuki Asahara, Luma Ateyah, et al. 2018. Universal dependencies 2.3. LINDAT/CLARIN digital library at the Institute of Formal and Applied Linguistics (U FAL), Faculty of Mathematics and Physics, Charles University.
    Google ScholarLocate open access versionFindings
  • M. Pagel, C. Renfrew, A. McMahon, and L. Trask. 2000. Time depth in historical linguistics. C. Renfrew, A. McMahon, and L. Trask, editors, pages 189– 207.
    Google ScholarFindings
  • Leto Peel, Daniel B. Larremore, and Aaron Clauset. 2017. The ground truth about metadata and community detection in networks. Science Advances, 3(5).
    Google ScholarLocate open access versionFindings
  • A. Pereltsvaig and M. W. Lewis. 2015. The IndoEuropean Controversy. Cambridge University Press.
    Google ScholarFindings
  • Frans Plank. 1986. Paradigm size, morphological typology, and universal economy. Folia Linguistica, 20(1-2):29–48.
    Google ScholarLocate open access versionFindings
  • Ella Rabinovich, Noam Ordan, and Shuly Wintner. 2017. Found in translation: Reconstructing phylogenetic language trees from translations. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 530–540. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Don Ringe, Tandy Warnow, and Ann Taylor. 2002. Indo-European and computational cladistics. Transactions of the Philological Society, 100(1):59–129.
    Google ScholarLocate open access versionFindings
  • Suzanne Romaine. 1997.
    Google ScholarFindings
  • Simone Romano, Nguyen Xuan Vinh, James Bailey, and Karin Verspoor. 2016. Adjusting for chance clustering comparison measures. Journal of Machine Learning Research, 17(1):4635–4666.
    Google ScholarLocate open access versionFindings
  • Maurizio Serva and Fabio Petroni. 2008. IndoEuropean languages tree by Levenshtein distance. EPL (Europhysics Letters), 81(6):68005.
    Google ScholarLocate open access versionFindings
  • Robert Reuven Sokal and Charles Duncan Michener. 1958. A Statistical Method for Evaluating Systematic Relationships. University of Kansas science bulletin. University of Kansas.
    Google ScholarFindings
  • Morris Swadesh. 1950. Salish internal relationships. International Journal of American Linguistics, 16(4):157–167.
    Google ScholarLocate open access versionFindings
  • Morris Swadesh. 1952. Lexico-statistic dating of prehistoric ethnic contacts: with special reference to North American Indians and Eskimos. Proceedings of the American philosophical society, 96(4):452– 463.
    Google ScholarLocate open access versionFindings
  • Morris Swadesh. 1955. Towards greater accuracy in lexicostatistic dating. International Journal of American Linguistics, 21(2):121–137.
    Google ScholarLocate open access versionFindings
  • meaning. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 6682–6695, Online. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Zhao Yang, Rene Algesheimer, and Claudio J. Tessone. 2016. A comparative analysis of community detection algorithms on artificial networks. Scientific Reports, 6(1):30750.
    Google ScholarLocate open access versionFindings
  • Meilin Zhan and Roger Levy. 2018. Comparing theories of speaker choice using a model of classifier production in Mandarin Chinese. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 1997–2005, New Orleans, Louisiana. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Morris Swadesh. 1971/2006. The origin and diversification of language. Chicago: Aldine.
    Google ScholarFindings
  • G. R. Tucker, W. E. Lambert, and A. Rigault. 1977. The French speaker’s skill with grammatical gender: an example of rule-governed behavior. Janua Linguarum: Series didactica. Mouton.
    Google ScholarLocate open access versionFindings
  • Nguyen Xuan Vinh, Julien Epps, and James Bailey. 2010. Information theoretic measures for clusterings comparison: Variants, properties, normalization and correction for chance. Journal of Machine Learning Research, 11:2837–2854.
    Google ScholarLocate open access versionFindings
  • Benji Wald. 1975. Animate concord in Northeast Coastal Bantu: Its linguistic and social implications as a case of grammatical convergence. Studies in African linguistics, 6(3):267–314.
    Google ScholarLocate open access versionFindings
  • Benjamin Lee Whorf. 1997. The Relation of Habitual Thought and Behavior to Language, pages 443–463. Macmillan Education UK, London.
    Google ScholarFindings
  • Adina Williams, Damian Blasi, Lawrence WolfSonkin, Hanna Wallach, and Ryan Cotterell. 2019. Quantifying the semantic core of gender systems. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 5733– 5738, Hong Kong, China. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Adina Williams, Ryan Cotterell, Lawrence WolfSonkin, Damian Blasi, and Hanna Wallach. 2020a. On the relationships between the grammatical genders of inanimate nouns and their co-occurring adjectives and verbs. Transactions of the Association for Computational Linguistics.
    Google ScholarFindings
  • Adina Williams, Tiago Pimentel, Hagen Blix, Arya D. McCarthy, Eleanor Chodroff, and Ryan Cotterell. 2020b. Predicting declension class from form and
    Google ScholarFindings
作者
Arya D. McCarthy
Arya D. McCarthy
Shijia Liu
Shijia Liu
您的评分 :
0

 

标签
评论
小科