AI helps you reading Science

AI generates interpretation videos

AI extracts and analyses the key points of the paper to generate videos automatically


pub
Go Generating

AI Traceability

AI parses the academic lineage of this thesis


Master Reading Tree
Generate MRT

AI Insight

AI extracts a summary of this paper


Weibo:
We propose a probabilistic generative model that represents latent factors as geographical distributions

Latent Geographical Factors for Analyzing the Evolution of Dialects in Contact

EMNLP 2020, pp.959-976, (2020)

Cited by: 0|Views100
Full Text
Bibtex
Weibo

Abstract

Analyzing the evolution of dialects remains a challenging problem because contact phenomena hinder the application of the standard tree model. Previous statistical approaches to this problem resort to admixture analysis, where each dialect is seen as a mixture of latent ancestral populations. However, such ancestral populations are hardly...More

Code:

Data:

0
Introduction
  • How languages have changed over time is a question that has attracted a lasting interest.
  • Historical–comparative linguists have done this by systematically comparing related languages and representing them as a tree.
  • The recent adoption of computer-intensive statistical methods offer additional insights (Gray and Atkinson, 2003; Bouckaert et al, 2012; Chang et al, 2015).
  • When it comes to dialects, or closely-related languages,1 the situation is very different.
  • At most the first three PCs are examined because subsequent PCs are hardly interpretable
Highlights
  • How languages have changed over time is a question that has attracted a lasting interest
  • We propose a probabilistic generative model that represents latent factors as geographical distributions (Figure 1)
  • While historical–comparative linguistics is known for the Neogrammarian doctrine of exceptionless sound laws, dialectology is dominated by the dictum, “every word has its own history.” the Atlas linguistique de la France (Gillieron and Edmont, 1902–1910) and subsequent linguistic atlases that have been produced by dialectologists elaborate “the geography not of dialects but of linguistic traits” (Goebl, 2018)
  • Each language is colored according to the value of a selected principal component (PC)
  • We proposed a Bayesian generative model to analyze dialectal variation
Methods
  • 3.1 Basic Idea

    The key insight behind the proposed method is that both vertical and horizontal signals can be represented as geographical distributions.
  • If horizontal contact occurs in a certain area, leading to multiple feature values being shared by the dialects there, the authors can identify the corresponding geographical cluster.
  • A group of dialects that exclusively share the same ancestor usually occupies a continuum geographical space.
  • Because their shared evolutionary history results in many shared feature values, the corresponding geographical subspace can be identified.
  • The authors' goal is to induct latent, typically clearer geographical factors from observed geographical distributions, as illustrated in Figure 1
Results
  • The results are shown in Figure 2.
  • The authors can confirm that the proposed method consistently outperformed the admixture model.
  • The proposed method was better at recognizing spatial patterns.
  • It is understandable given that the geography is explicitly encoded to the proposed method while it is ignored by the admixture model.
  • The accuracy dropped more noticeably as K increased.
  • The proposed method retained a relatively high accuracy even with K = 20.
  • It used additional latent factors to capture minor but genuine patterns
Conclusion
  • The authors' ultimate goal is to uncover spatio-temporal dynamics of languages in this paper we (a) Latent factor 20.

    (b) Latent factor 6.

    (c) Latent factor 2.

    (d) Latent factor 19.

    concentrate on spatial inference.
  • The feature value indicated by gray diamonds was used by many, but not all, dialects on the southwestern island of Kadavu
  • Not surprisingly, this group gave the largest weight to latent factor 19, which concentrated on Kadavu (Figures 4(d)).
  • Latent factor 19 for i had a much larger weight than latent factor 18 for e, and as a result, the former overwhelmed the latter.In this paper, the authors proposed a Bayesian generative model to analyze dialectal variation.
  • Future directions include the incorporation of phonological and morphosyntactic features, application to other languages, and most importantly, a model extension to infer temporal ordering
Summary
  • Introduction:

    How languages have changed over time is a question that has attracted a lasting interest.
  • Historical–comparative linguists have done this by systematically comparing related languages and representing them as a tree.
  • The recent adoption of computer-intensive statistical methods offer additional insights (Gray and Atkinson, 2003; Bouckaert et al, 2012; Chang et al, 2015).
  • When it comes to dialects, or closely-related languages,1 the situation is very different.
  • At most the first three PCs are examined because subsequent PCs are hardly interpretable
  • Objectives:

    The authors' goal is to induct latent, typically clearer geographical factors from observed geographical distributions, as illustrated in Figure 1.
  • Methods:

    3.1 Basic Idea

    The key insight behind the proposed method is that both vertical and horizontal signals can be represented as geographical distributions.
  • If horizontal contact occurs in a certain area, leading to multiple feature values being shared by the dialects there, the authors can identify the corresponding geographical cluster.
  • A group of dialects that exclusively share the same ancestor usually occupies a continuum geographical space.
  • Because their shared evolutionary history results in many shared feature values, the corresponding geographical subspace can be identified.
  • The authors' goal is to induct latent, typically clearer geographical factors from observed geographical distributions, as illustrated in Figure 1
  • Results:

    The results are shown in Figure 2.
  • The authors can confirm that the proposed method consistently outperformed the admixture model.
  • The proposed method was better at recognizing spatial patterns.
  • It is understandable given that the geography is explicitly encoded to the proposed method while it is ignored by the admixture model.
  • The accuracy dropped more noticeably as K increased.
  • The proposed method retained a relatively high accuracy even with K = 20.
  • It used additional latent factors to capture minor but genuine patterns
  • Conclusion:

    The authors' ultimate goal is to uncover spatio-temporal dynamics of languages in this paper we (a) Latent factor 20.

    (b) Latent factor 6.

    (c) Latent factor 2.

    (d) Latent factor 19.

    concentrate on spatial inference.
  • The feature value indicated by gray diamonds was used by many, but not all, dialects on the southwestern island of Kadavu
  • Not surprisingly, this group gave the largest weight to latent factor 19, which concentrated on Kadavu (Figures 4(d)).
  • Latent factor 19 for i had a much larger weight than latent factor 18 for e, and as a result, the former overwhelmed the latter.In this paper, the authors proposed a Bayesian generative model to analyze dialectal variation.
  • Future directions include the incorporation of phonological and morphosyntactic features, application to other languages, and most importantly, a model extension to infer temporal ordering
Funding
  • This work was partly supported by JSPS KAKENHI Grant Numbers 18K18104 and 18KK0012
Reference
  • David H. Alexander, John Novembre, and Kenneth Lange. 2009. Fast model-based estimation of ancestry in unrelated individuals. Genome Research, 19(9):1655–1664.
    Google ScholarLocate open access versionFindings
  • Julian Besag. 1974. Spatial interaction and the statistical analysis of lattice systems. Journal of the Royal Statistical Society. Series B (Methodological), 36(2):192–236.
    Google ScholarLocate open access versionFindings
  • Balthasar Bickel, Johanna Nichols, Taras Zakharko, Alena Witzlack-Makarevich, Kristine Hildebrandt, Michael Rießler, Lennart Bierkandt, Fernando Zuniga, and John B. Lowe. 2017. The AUTOTYP typological databases. version 0.1.0.
    Google ScholarFindings
  • David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003. Latent Dirichlet allocation. Journal of Machine Learning Research, 3:993–1022.
    Google ScholarLocate open access versionFindings
  • Remco Bouckaert, Philippe Lemey, Michael Dunn, Simon J. Greenhill, Alexander V. Alekseyenko, Alexei J. Drummond, Russell D. Gray, Marc A. Suchard, and Quentin D. Atkinson. 2012. Mapping the origins and expansion of the Indo-European language family. Science, 337(6097):957–960.
    Google ScholarLocate open access versionFindings
  • Claire Bowern. 2012. The riddle of Tasmanian languages. Proceedings of the Royal Society B: Biological Sciences, 279(1747):4590–4595.
    Google ScholarLocate open access versionFindings
  • David Bryant and Vincent Moulton. 2004. NeighborNet: An agglomerative method for the construction of phylogenetic networks. Molecular Biology and Evolution, 21(2):255–265.
    Google ScholarLocate open access versionFindings
  • Lyle Campbell. 2004. Historical Linguistics: An Introduction (2nd edition). Edinburgh University Pres.
    Google ScholarFindings
  • Chundra A. Cathcart. 2020. A probabilistic assessment of the Indo-Aryan inner-outer hypothesis. Journal of Historical Linguistics, 10(1):42–86.
    Google ScholarLocate open access versionFindings
  • Will Chang, Chundra Cathcart, David Hall, and Andrew Garrett. 2015. Ancestry-constrained phylogenetic analysis supports the Indo-European steppe hypothesis. Language, 91(1):194–244.
    Google ScholarLocate open access versionFindings
  • Hal Daume III. 2009. Non-parametric Bayesian areal linguistics. In Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pages 593–601.
    Google ScholarLocate open access versionFindings
  • Jacob Eisenstein, Brendan O’Connor, Noah A. Smith, and Eric P. Xing. 2010. A latent variable model for geographic lexical variation. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, pages 1277–1287. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Jacob Eisenstein, Brendan O’Connor, Noah A. Smith, and Eric P. Xing. 2014. Diffusion of lexical change in social media. PLOS ONE, 9(11):1–13.
    Google ScholarLocate open access versionFindings
  • Paul A. Geraghty. 1983. The History of the Fijian Languages. University of Hawai‘i Press.
    Google ScholarFindings
  • Jules Gillieron and Edmond Edmont, editors. 1902– 1910. Atlas linguistique de la France. Champion. (in French).
    Google ScholarFindings
  • Hans Goebl. 2018. Dialectometry. In Charles Boberg, John Nerbonne, and Dominic Watt, editors, The Handbook of Dialectology, pages 123–142. John Wiley & Sons.
    Google ScholarLocate open access versionFindings
  • Russell D. Gray and Quentin D. Atkinson. 2003. Language-tree divergence times support the Anatolian theory of Indo-European origin. Nature, 426(6965):435–439.
    Google ScholarLocate open access versionFindings
  • Simon J. Greenhill, Thomas E. Currie, and Russell D. Gray. 2009. Does horizontal transmission invalidate cultural phylogenies? Proceedings of the Royal Society B: Biological Sciences, 276(1665):2299–2306.
    Google ScholarLocate open access versionFindings
  • Thomas L. Griffiths and Mark Steyvers. 2004. Finding scientific topics. Proceedings of the National academy of Sciences of the United States of America, 101:5228–5235.
    Google ScholarLocate open access versionFindings
  • Martin Haspelmath, Matthew Dryer, David Gil, and Bernard Comrie, editors. 2005. The World Atlas of Language Structures. Oxford University Press.
    Google ScholarFindings
  • Henry M. Hoenigswald. 1966. Criteria for the subgrouping of languages. In Henrik Birnbaum and Jaan Puhvel, editors, Ancient Indo-European Dialects. University of California Press.
    Google ScholarLocate open access versionFindings
  • B. R. Holland, K. T. Huber, A. Dress, and V. Moulton. 2002. δ plots: A tool for analyzing phylogenetic distance data. Molecular Biology and Evolution, 19(12):2051–2059.
    Google ScholarLocate open access versionFindings
  • Yosuke Igarashi. 2017. Phylogenetic classification of Japanese dialects using a shared innovation-based cladistic method: A proposal for the Southern Japanese branch (including Ryukyuan) and the Eastern Japanese branch (including Hachijo). First Meeting on the Reconstruction of the Proto-language of Japanese–Ryukyuan Dialects and the Construction of a Phylogenetic Tree by Means of Comparative Linguistic Methods. (in Japanese).
    Google ScholarFindings
  • Eppie R. Jones, Gloria Gonzalez-Fortes, Sarah Connell, Veronika Siska, Anders Eriksson, Rui Martiniano, Russell L. McLaughlin, Marcos Gallego Llorente, Lara M. Cassidy, Cristina Gamba, Tengiz Meshveliani, Ofer Bar-Yosef, Werner Muller, Anna Belfer-Cohen, Zinovi Matskevich, Nino Jakeli, Thomas F. G. Higham, Mathias Currat, David Lordkipanidze, Michael Hofreiter, Andrea Manica, Ron Pinhasi, and Daniel G. Bradley. 2015. Upper Palaeolithic genomes reveal deep roots of modern Eurasians. Nature Communications, 6.
    Google ScholarLocate open access versionFindings
  • Siva Kalyan and Alexandre Francois. 2018. Freeing the comparative method from the tree model: A framework for historical glottometry. Senri Ethnological Studies, 98:59–89.
    Google ScholarLocate open access versionFindings
  • Luke J. Kelly and Geoff K. Nicholls. 2017. Lateral transfer in stochastic Dollo models. Annals of Applied Statistics, 11(2):1146–1168.
    Google ScholarLocate open access versionFindings
  • Wayne Lawrence. 2006. On the subclassification of the Okinawan dialects. The Okinawa Bunka, 40(2):101– 118. (in Japanese).
    Google ScholarLocate open access versionFindings
  • Sean Lee and Toshikazu Hasegawa. 2011. Bayesian phylogenetic analysis supports an agricultural origin of Japonic languages. Proceedings of the Royal Society B: Biological Sciences, 278(1725):3662–3669.
    Google ScholarLocate open access versionFindings
  • Faming Liang. 2010. A double Metropolis–Hastings sampler for spatial models with intractable normalizing constants. Journal of Statistical Computation and Simulation, 80(9):1007–1022.
    Google ScholarLocate open access versionFindings
  • Johann-Mattis List, Mary Walworth, Simon J. Greenhill, Tiago Tresoldi, and Robert Forkel. 2018. Sequence comparison in computational historical linguistics. Journal of Language Evolution, 3(2):130– 144.
    Google ScholarLocate open access versionFindings
  • Giuseppe Longobardi, Cristina Guardiano, Giuseppina Silvestri, Alessio Boattini, and Andrea Ceolin. 2013. Toward a syntactic phylogeny of modern Indo-European languages. Journal of Historical Linguistics, 3(1):122–152.
    Google ScholarLocate open access versionFindings
  • Paolo Menozzi, Alberto Piazza, and Luigi CavalliSforza. 1978. Synthetic maps of human gene frequencies in Europeans. Science, 201(4358):786– 792.
    Google ScholarLocate open access versionFindings
  • Jesper Møller, Anthony N. Pettitt, R. Reeves, and Kasper K. Berthelsen. 2006. An efficient Markov chain Monte Carlo method for distributions with intractable normalising constants. Biometrika, 93(2):451–458.
    Google ScholarLocate open access versionFindings
  • Yugo Murawaki. 2015. Spatial structure of evolutionary models of dialects in contact. PLoS ONE, 10(7):1–15.
    Google ScholarLocate open access versionFindings
  • Yugo Murawaki. 2019. Bayesian learning of latent representations of language structures. Computational Linguistics, 45(2):199–228.
    Google ScholarLocate open access versionFindings
  • Yugo Murawaki and Kenji Yamauchi. 2018. A statistical model for the joint inference of vertical stability and horizontal diffusibility of typological features. Journal of Language Evolution, 3(1):13–25.
    Google ScholarLocate open access versionFindings
  • Iain Murray, Zoubin Ghahramani, and David J. C. MacKay. 2006. MCMC for doubly-intractable distributions. In Proceedings of the Twenty-Second Conference on Uncertainty in Artificial Intelligence, pages 359–366.
    Google ScholarLocate open access versionFindings
  • Radford M. Neal. 2011. MCMC using Hamiltonian dynamics. In Steve Brooks, Andrew Gelman, Galin L. Jones, and Xiao-Li Meng, editors, Handbook of Markov Chain Monte Carlo, pages 113–162. CRC Press.
    Google ScholarLocate open access versionFindings
  • John Nerbonne and Martijn Wieling. 2018. Statistics for aggregate variationist analyses. In Charles Boberg, John Nerbonne, and Dominic Watt, editors, The Handbook of Dialectology, pages 400–414. John Wiley & Sons.
    Google ScholarLocate open access versionFindings
  • Johanna Nichols and Tandy Warnow. 2008. Tutorial on computational linguistic phylogeny. Language and Linguistics Compass, 2(5):760–820.
    Google ScholarLocate open access versionFindings
  • Nick Patterson, Alkes L. Price, and David Reich. 2006. Population structure and eigenanalysis. PLoS Genetics, 2(12):e190.
    Google ScholarLocate open access versionFindings
  • Thomas Pellard. 2009. Ogami: Elements de description d’un parler du Sud des Ryukyu. Ph.D. thesis, Ecole des Hautes Etudes en Sciences Sociales (EHESS). (in French).
    Google ScholarFindings
  • Thomas Pellard. 2018. On phylogenetic classification and bifurcations of Japonic languages. Phylesis and the History of the Japonic Languages from Philological and Field Linguistic Perspectives. (in Japanese).
    Google ScholarFindings
  • Jonathan K. Pritchard, Matthew Stephens, and Peter Donnelly. 2000. Inference of population structure using multilocus genotype data. Genetics, 155(2):945–959.
    Google ScholarLocate open access versionFindings
  • Ger Reesink, Ruth Singer, and Michael Dunn. 2009. Explaining the linguistic diversity of Sahul using population models. PLoS Biology, 7(11):e1000241.
    Google ScholarLocate open access versionFindings
  • Laurent Sagart, Guillaume Jacques, Yunfan Lai, Robin J. Ryder, Valentin Thouzeau, Simon J. Greenhill, and Johann-Mattis List. 2019. Dated language phylogenies shed light on the ancestry of SinoTibetan. Proceedings of the National Academy of Sciences, 116(21):10317–10322.
    Google ScholarLocate open access versionFindings
  • Naruya Saitou and Timothy A. Jinam. 2017. Language diversity of the Japanese Archipelago and its relationship with human DNA diversity. Man in India, 97(1):205–228.
    Google ScholarLocate open access versionFindings
  • Naruya Saitou and Masatoshi Nei. 1987. The neighborjoining method: A new method for reconstructing phylogenetic trees. Molecular Biology and Evolution, 4(4):406–425.
    Google ScholarLocate open access versionFindings
  • Morris Swadesh. 1952. Lexicostatistic dating of prehistoric ethnic contacts. Proceedings of American Philosophical Society, 96:452–463.
    Google ScholarLocate open access versionFindings
  • Kaj Syrjanen, Terhi Honkola, Jyri Lehtinen, Antti Leino, and Outi Vesakoski. 2016. Applying population genetic approaches within languages: Finnish dialects as linguistic populations. Language Dynamics and Change, 6:235–283.
    Google ScholarLocate open access versionFindings
  • Mary C. Towner, Mark N. Grote, Jay Venti, and Monique Borgerhoff Mulder. 2012. Cultural macroevolution on neighbor graphs: Vertical and horizontal transmission among western north American Indian societies. Human Nature, 23(3):283– 305.
    Google ScholarLocate open access versionFindings
  • Appendix A A Comparison between the Model of Murawaki (2019) and the Proposed Method
    Google ScholarLocate open access versionFindings
Author
Yugo Murawaki
Yugo Murawaki
Your rating :
0

 

Tags
Comments
小科