AI helps you reading Science

AI generates interpretation videos

AI extracts and analyses the key points of the paper to generate videos automatically


pub
Go Generating

AI Traceability

AI parses the academic lineage of this thesis


Master Reading Tree
Generate MRT

AI Insight

AI extracts a summary of this paper


Weibo:
This paper presents a model that jointly identifies words with high regional affinity, geographicallycoherent linguistic regions, and the relationship between regional and topic variation

A latent variable model for geographic lexical variation

EMNLP, pp.1277-1287, (2010)

Cited by: 795|Views169
EI
Full Text
Bibtex
Weibo

Abstract

The rapid growth of geotagged social media raises new computational possibilities for investigating geographic linguistic variation. In this paper, we present a multi-level generative model that reasons jointly about latent topics and geographical regions. High-level topics such as "sports" or "entertainment" are rendered differently in e...More

Code:

Data:

0
Introduction
  • Sociolinguistics and dialectology study how language varies across social and regional contexts
  • Quantitative research in these fields generally proceeds by counting the frequency of a handful of previously-identified linguistic variables: pairs of phonological, lexical, or morphosyntactic features that are semantically equivalent, but whose frequency depends on social, geographical, or other factors (Paolillo, 2002; Chambers, 2009).
  • The resulting system has multiple capabilities, including: (i) analyzing lexical variation by both topic and geography; segmenting geographical space into coherent linguistic communities; predicting author location based on text alone
Highlights
  • Sociolinguistics and dialectology study how language varies across social and regional contexts
  • The geographic topic model achieves the strongest performance on all metrics
  • Note that the geographic topic model and the mixture of unigrams use identical code and parametrization – the only difference is that the geographic topic model accounts for topical variation, while the mixture of unigrams sets K = 1
  • This paper presents a model that jointly identifies words with high regional affinity, geographicallycoherent linguistic regions, and the relationship between regional and topic variation
  • In a study of morphosyntactic variation, Szmrecsanyi (2010) finds that by the most generous measure, geographical factors account for only 33% of the observed variation
  • Our analysis might well improve if non-geographical factors were considered, including age, race, gender, income and whether a location is urban or rural
Results
  • As shown in Table 1, the geographic topic model achieves the strongest performance on all metrics.
  • Note that the geographic topic model and the mixture of unigrams use identical code and parametrization – the only difference is that the geographic topic model accounts for topical variation, while the mixture of unigrams sets K = 1
  • These results validate the basic premise that it is important to model the interaction between topical and geographical variation.
  • Text regression and supervised LDA perform especially poorly on the classification metric.
  • Both methods make predictions that are averaged across
Conclusion
  • This paper presents a model that jointly identifies words with high regional affinity, geographicallycoherent linguistic regions, and the relationship between regional and topic variation.
  • The key modeling assumption is that regions and topics interact to shape observed lexical frequencies.
  • The authors validate this assumption on a prediction task in which the model outperforms strong alternatives that do not distinguish regional and topical variation.
  • The authors see this work as a first step towards a unsupervised methodology for modeling linguistic variation using raw text.
Tables
  • Table1: Location prediction results; lower scores are better on the regression task, higher scores are better on the classification task. Distances are in kilometers. Mean location and most common class are computed from the test set. Both the geographic topic model and supervised LDA use the best number of topics from the development set (10 and 5, respectively)
  • Table2: Example base topics (top line) and regional variants. For the base topics, terms are ranked by log-odds compared to the background distribution. The regional variants show words that are strong compared to both the base topic and the background. Foreign-language words are shown in italics, while terms that are usually in proper nouns are shown in SMALL CAPS. See Table 3 for definitions of slang terms; see Section 7 for more explanation and details on the methodology
  • Table3: A glossary of non-standard terms from Table 2. Definitions are obtained by manually inspecting the context in which the terms appear, and by consulting www.urbandictionary.com
Download tables as Excel
Related work
  • The relationship between language and geography has been a topic of interest to linguists since the nineteenth century (Johnstone, 2010). An early work of particular relevance is Kurath’s Word Geography of the Eastern United States (1949), in which he conducted interviews and then mapped the occurrence of equivalent word pairs such as stoop and porch. The essence of this approach—identifying variable pairs and measuring their frequencies— remains a dominant methodology in both dialectology (Labov et al, 2006) and sociolinguistics (Tagliamonte, 2006). Within this paradigm, computational techniques are often applied to post hoc analysis: logistic regression (Sankoff et al, 2005) and mixed-effects models (Johnson, 2009) are used to measure the contribution of individual variables, while hierarchical clustering and multidimensional scaling enable aggregated inference across multiple variables (Nerbonne, 2009). However, in all such work it is assumed that the relevant linguistic variables have already been identified—a timeconsuming process involving considerable linguistic expertise. We view our work as complementary to this tradition: we work directly from raw text, identifying both the relevant features and coherent linguistic communities.
Reference
  • L. Backstrom, J. Kleinberg, R. Kumar, and J. Novak. 2008. Spatial variation in search engine queries. In Proceedings of WWW.
    Google ScholarLocate open access versionFindings
  • C. M. Bishop. 2006. Pattern Recognition and Machine Learning. Springer.
    Google ScholarFindings
  • D. M. Blei and M. I. Jordan. 2006. Variational inference for Dirichlet process mixtures. Bayesian Analysis, 1:121–144.
    Google ScholarLocate open access versionFindings
  • D. M. Blei and J. Lafferty. 2006a. Correlated topic models. In NIPS.
    Google ScholarFindings
  • D. M. Blei and J. Lafferty. 2006b. Dynamic topic models. In Proceedings of ICML.
    Google ScholarLocate open access versionFindings
  • D. M. Blei and J. D. McAuliffe. 2007. Supervised topic models. In NIPS.
    Google ScholarFindings
  • D. M. Blei, A. Y. Ng, and M. I. Jordan. 2003. Latent Dirichlet allocation. JMLR, 3:993–1022.
    Google ScholarLocate open access versionFindings
  • M. Bucholtz, N. Bermudez, V. Fung, L. Edwards, and R. Vargas. 2007. Hella Nor Cal or totally So Cal? the perceptual dialectology of California. Journal of English Linguistics, 35(4):325–352.
    Google ScholarLocate open access versionFindings
  • F. G. Cassidy and J. H. Hall. 1985. Dictionary of American Regional English, volume 1. Harvard University Press.
    Google ScholarLocate open access versionFindings
  • J. Chambers. 2009. Sociolinguistic Theory: Linguistic Variation and its Social Significance. Blackwell.
    Google ScholarFindings
  • D. J Crandall, L. Backstrom, D. Huttenlocher, and J. Kleinberg. 2009. Mapping the world’s photos. In Proceedings of WWW, page 761770.
    Google ScholarLocate open access versionFindings
  • J. Friedman, T. Hastie, and R. Tibshirani. 2010. Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software, 33(1).
    Google ScholarLocate open access versionFindings
  • D. E. Johnson. 2009. Getting off the GoldVarb standard: Introducing Rbrul for mixed-effects variable rule analysis. Language and Linguistics Compass, 3(1):359– 383.
    Google ScholarLocate open access versionFindings
  • B. Johnstone. 2010. Language and place. In R. Mesthrie and W. Wolfram, editors, Cambridge Handbook of Sociolinguistics. Cambridge University Press.
    Google ScholarLocate open access versionFindings
  • M. Joshi, D. Das, K. Gimpel, and N. A. Smith. 2010. Movie reviews and revenues: An experiment in text regression. In Proceedings of NAACL-HLT.
    Google ScholarLocate open access versionFindings
  • H. Kurath. 1949. A Word Geography of the Eastern United States. University of Michigan Press.
    Google ScholarFindings
  • H. Kwak, C. Lee, H. Park, and S. Moon. 2010. What is Twitter, a social network or a news media? In Proceedings of WWW.
    Google ScholarLocate open access versionFindings
  • W. Labov, S. Ash, and C. Boberg. 2006. The Atlas of North American English: Phonetics, Phonology, and Sound Change. Walter de Gruyter.
    Google ScholarFindings
  • W. Labov. 1966. The Social Stratification of English in New York City. Center for Applied Linguistics.
    Google ScholarFindings
  • Q. Mei, C. Liu, H. Su, and C. X Zhai. 2006. A probabilistic approach to spatiotemporal theme pattern mining on weblogs. In Proceedings of WWW, page 542.
    Google ScholarLocate open access versionFindings
  • Q. Mei, X. Ling, M. Wondra, H. Su, and C. X. Zhai. 2007. Topic sentiment mixture: modeling facets and opinions in weblogs. In Proceedings of WWW.
    Google ScholarLocate open access versionFindings
  • T. P. Minka. 2003. Estimating a Dirichlet distribution. Technical report, Massachusetts Institute of Technology.
    Google ScholarFindings
  • J. Nerbonne. 2009. Data-driven dialectology. Language and Linguistics Compass, 3(1).
    Google ScholarLocate open access versionFindings
  • B. O’Connor, M. Krieger, and D. Ahn. 2010. TweetMotif: Exploratory search and topic summarization for twitter. In Proceedings of ICWSM.
    Google ScholarLocate open access versionFindings
  • J. C. Paolillo. 2002. Analyzing Linguistic Variation: Statistical Models and Methods. CSLI Publications.
    Google ScholarFindings
  • M. Paul and R. Girju. 2010. A two-dimensional topicaspect model for discovering multi-faceted topics. In Proceedings of AAAI.
    Google ScholarLocate open access versionFindings
  • W. D. Penny. 2001. Variational Bayes for d-dimensional Gaussian mixture models. Technical report, University College London.
    Google ScholarFindings
  • D. Sankoff, S. A. Tagliamonte, and E. Smith. 2005. Goldvarb X: A variable rule application for Macintosh and Windows. Technical report, Department of Linguistics, University of Toronto.
    Google ScholarFindings
  • R. W. Sinnott. 1984. Virtues of the Haversine. Sky and Telescope, 68(2).
    Google ScholarLocate open access versionFindings
  • B. Szmrecsanyi. 2010. Geography is overrated. In S. Hansen, C. Schwarz, P. Stoeckle, and T. Streck, editors, Dialectological and Folk Dialectological Concepts of Space. Walter de Gruyter.
    Google ScholarLocate open access versionFindings
  • S. A. Tagliamonte and D. Denis. 2008. Linguistic ruin? LOL! Instant messanging and teen language. American Speech, 83.
    Google ScholarLocate open access versionFindings
  • S. A. Tagliamonte. 2006. Analysing Sociolinguistic Variation. Cambridge University Press.
    Google ScholarFindings
  • M. J. Wainwright and M. I. Jordan. 2008. Graphical Models, Exponential Families, and Variational Inference. Now Publishers.
    Google ScholarFindings
  • E. P. Xing. 2005. On topic evolution. Technical Report 05-115, Center for Automated Learning and Discovery, Carnegie Mellon University.
    Google ScholarFindings
Your rating :
0

 

Tags
Comments
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn
小科