Simple, Interpretable and Stable Method for Detecting Words with Usage Change across Corpora

Hila Gonen
Hila Gonen
Ganesh Jawahar
Ganesh Jawahar
Djamé Seddah
Djamé Seddah

ACL, pp. 538-555, 2020.

Cited by: 9|Views44
EI
Weibo:
We show that the method is considerably more stable than the popular alignment-based method popularized by Hamilton et al, and requires less tuning and word filtering

Abstract:

The problem of comparing two bodies of text and searching for words that differ in their usage between them arises often in digital humanities and computational social science. This is commonly approached by training word embeddings on each corpus, aligning the vector spaces, and looking for words whose cosine distance in the aligned spac...More

Code:

Data:

0
Full Text
Bibtex
Weibo
Introduction
  • Analyzing differences in corpora from different sources is a central use case in digital humanities and computational social science.
  • A particular methodology is to identify individual words that are used differently in the different corpora.
  • This includes words that have their meaning changed over time periods (Kim et al, 2014; Kulkarni et al, 2015; Hamilton et al, 2016b; Kutuzov et al, 2018; Tahmasebi et al, 2018), and words that are used differently by different populations (Azarbonyad et al, 2017; Rudolph et al, 2017).
  • It is sensitive to proper nouns and requires filtering them
Highlights
  • Analyzing differences in corpora from different sources is a central use case in digital humanities and computational social science
  • We propose a new and simple method for detecting usage change, that does not involve vector space alignment (§5)
  • Instead of trying to align two different vector spaces, we propose to work directly in the shared vocabulary space: we take the neighbors of a word in a vector space to reflect its usage, and consider words that have drastically different neighbours in the spaces induced by the different corpora to be words subjected to usage change
  • We compare our proposed method (NN) to the method of Hamilton et al (2016b) described in Section 4 (AlignCos), in which the vector spaces are first aligned using the Orthogonal Procrustes algorithm, and words are ranked according to the cosine-distance between the word representation in the two spaces
  • We mention the top-10 words according to the AlignCos method
  • We show that the method is considerably more stable than the popular alignment-based method popularized by Hamilton et al (2016b), and requires less tuning and word filtering
Methods
  • The authors compare the proposed method (NN) to the method of Hamilton et al (2016b) described in Section 4 (AlignCos), in which the vector spaces are first aligned using the OP algorithm, and words are ranked according to the cosine-distance between the word representation in the two spaces.5 This method was shown to outperform all others that were compared to it by Schlechtweg et al (2019).

    The authors demonstrate the approach by using it to detect change in word usage in different scenarios.
  • The authors compare the proposed method (NN) to the method of Hamilton et al (2016b) described in Section 4 (AlignCos), in which the vector spaces are first aligned using the OP algorithm, and words are ranked according to the cosine-distance between the word representation in the two spaces.5.
  • The authors compare to the longer-term (90y) diachronic setup of Hamilton et al (2016b), which is based on Google books
Results
  • The authors run the proposed method and AlignCos (Hamilton et al, 2016b) on the different scenarios described in Section 6, and manually inspect the results.
  • The authors list a few interesting words detected by the method, accompanied by a brief explanation.
  • The authors depict the top-10 words the method yields for the Age split (Table 2), accompanied by the nearest neighbors in each corpus, to better understand the context.
  • Similar tables for the other splits are provided in the Appendix
Conclusion
  • Detecting words that are used differently in different corpora is an important use-case in corpusbased research.
  • The authors present a simple and effective method for this task, demonstrating its applicability in multiple different settings.
  • The authors show that the method is considerably more stable than the popular alignment-based method popularized by Hamilton et al (2016b), and requires less tuning and word filtering.
  • The authors suggest researchers to adopt this method, and provide an accompanying software toolkit
Summary
  • Introduction:

    Analyzing differences in corpora from different sources is a central use case in digital humanities and computational social science.
  • A particular methodology is to identify individual words that are used differently in the different corpora.
  • This includes words that have their meaning changed over time periods (Kim et al, 2014; Kulkarni et al, 2015; Hamilton et al, 2016b; Kutuzov et al, 2018; Tahmasebi et al, 2018), and words that are used differently by different populations (Azarbonyad et al, 2017; Rudolph et al, 2017).
  • It is sensitive to proper nouns and requires filtering them
  • Methods:

    The authors compare the proposed method (NN) to the method of Hamilton et al (2016b) described in Section 4 (AlignCos), in which the vector spaces are first aligned using the OP algorithm, and words are ranked according to the cosine-distance between the word representation in the two spaces.5 This method was shown to outperform all others that were compared to it by Schlechtweg et al (2019).

    The authors demonstrate the approach by using it to detect change in word usage in different scenarios.
  • The authors compare the proposed method (NN) to the method of Hamilton et al (2016b) described in Section 4 (AlignCos), in which the vector spaces are first aligned using the OP algorithm, and words are ranked according to the cosine-distance between the word representation in the two spaces.5.
  • The authors compare to the longer-term (90y) diachronic setup of Hamilton et al (2016b), which is based on Google books
  • Results:

    The authors run the proposed method and AlignCos (Hamilton et al, 2016b) on the different scenarios described in Section 6, and manually inspect the results.
  • The authors list a few interesting words detected by the method, accompanied by a brief explanation.
  • The authors depict the top-10 words the method yields for the Age split (Table 2), accompanied by the nearest neighbors in each corpus, to better understand the context.
  • Similar tables for the other splits are provided in the Appendix
  • Conclusion:

    Detecting words that are used differently in different corpora is an important use-case in corpusbased research.
  • The authors present a simple and effective method for this task, demonstrating its applicability in multiple different settings.
  • The authors show that the method is considerably more stable than the popular alignment-based method popularized by Hamilton et al (2016b), and requires less tuning and word filtering.
  • The authors suggest researchers to adopt this method, and provide an accompanying software toolkit
Tables
  • Table1: Statistics of the different splits
  • Table2: Top-10 detected words from our method (NN) vs. AlignCos method (last row), for corpus split according to the age of the tweet-author. Each word from our method is accompanied by its top-10 neighbors in each of the two corpora (Young vs. Older)
  • Table3: Results on DURel and SURel with NN and with AlignCos
  • Table4: Top-10 detected words from our method (NN) vs. AlignCos method (last row), for corpus split according to the year of the text. Each word from our method is accompanied by its top-10 neighbors in each of the two corpora (1900 vs. 1990)
  • Table5: Top-10 detected words from our method (NN) vs. AlignCos method (last row), for corpus split according to the gender of the tweet-author. Each word from our method is accompanied by its top-10 neighbors in each of the two corpora (Male vs. Female)
  • Table6: Top-10 detected words from our method (NN) vs. AlignCos method (last row), for corpus split according to the occupation of the tweet-author. Each word from our method is accompanied by its top-10 neighbors in each of the two corpora (performer vs. sports)
  • Table7: Top-10 detected words from our method (NN) vs. AlignCos method (last row), for corpus split according to the occupation of the tweet-author. Each word from our method is accompanied by its top-10 neighbors in each of the two corpora (creator vs. sports)
  • Table8: Top-10 detected words from our method (NN) vs. AlignCos method (last row), for corpus split according to the occupation of the tweet-author. Each word from our method is accompanied by its top-10 neighbors in each of the two corpora (creator vs. performer)
  • Table9: Top-10 detected words from our method (NN) vs. AlignCos method (last row), for corpus split according to the time of week of the tweet. Each word from our method is accompanied by its top-10 neighbors in each of the two corpora (weekday vs. weekend)
  • Table10: Top-10 detected words from our method (NN) vs. AlignCos method (last row), for corpus split according to the year of the text. Each word from our method is accompanied by its top-10 neighbors in each of the two corpora (2014 vs. 2018)
Download tables as Excel
Funding
  • This project has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme, grant agreement No 802774 (iEXTRACT), and from the the Israeli ministry of Science, Technology and Space through the Israeli-French Maimonide Cooperation programme
  • The second and third authors were partially funded by the French Research Agency projects ParSiTi (ANR-16-CE330021), SoSweet (ANR15-CE38-0011-01) and by the French Ministry of Industry and Ministry of Foreign Affairs via the PHC Maimonide FranceIsrael cooperation programme
Reference
  • Jacob Levy Abitbol, Marton Karsai, Jean-Philippe Mague, Jean-Pierre Chevrot, and Eric Fleury. 2018. Socioeconomic dependencies of linguistic patterns in twitter: A multivariate analysis. In Proceedings of the 2018 World Wide Web Conference, WWW ’18, pages 1125–1134.
    Google ScholarLocate open access versionFindings
  • Maria Antoniak and David Mimno. 2018. Evaluating the Stability of Embedding-based Word Similarities. Transactions of the Association for Computational Linguistics, 6:107–119.
    Google ScholarLocate open access versionFindings
  • Mikel Artetxe, Gorka Labaka, and Eneko Agirre. 2018a. A robust self-learning method for fully unsupervised cross-lingual mappings of word embeddings. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 789–798, Melbourne, Australia. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Mikel Artetxe, Gorka Labaka, Eneko Agirre, and Kyunghyun Cho. 2018b. Unsupervised Neural Machine Translation. In 6th International Conference on Learning Representations, ICLR.
    Google ScholarLocate open access versionFindings
  • Hosein Azarbonyad, Mostafa Dehghani, Kaspar Beelen, Alexandra Arkut, Maarten Marx, and Jaap Kamps. 2017. Words Are Malleable: Computing Semantic Shifts in Political and Media Discourse. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, CIKM ’17, pages 1509–1518.
    Google ScholarLocate open access versionFindings
  • Robert Bamler and Stephan Mandt. 2017. Dynamic Word Embeddings. In Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 380–389, International Convention Centre, Sydney, Australia. PMLR.
    Google ScholarLocate open access versionFindings
  • Mark Davies. 2015. Corpus of Historical American English (COHA).
    Google ScholarFindings
  • Marco Del Tredici, Raquel Fernandez, and Gemma Boleda. 2019. Short-Term Meaning Shift: A Distributional Exploration. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 2069–2075, Minneapolis, Minnesota. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 201BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), pages 4171–4186.
    Google ScholarLocate open access versionFindings
  • Lea Frermann and Mirella Lapata. 2016. A Bayesian Model of Diachronic Meaning Change. Transactions of the Association for Computational Linguistics, 4(0).
    Google ScholarLocate open access versionFindings
  • Mario Giullianelli. 2019. Lexical semantic change analysis with contextualised word representations. Master’s thesis, Institute for Logic, Language and Computation,, University of Amsterdam, July.
    Google ScholarFindings
  • William L. Hamilton, Jure Leskovec, and Dan Jurafsky. 2016a. Cultural Shift or Linguistic Drift? Comparing Two Computational Measures of Semantic Change. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 2116–2121, Austin, Texas. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • William L. Hamilton, Jure Leskovec, and Dan Jurafsky. 2016b. Diachronic Word Embeddings Reveal Statistical Laws of Semantic Change. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1489–1501, Berlin, Germany. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Anna Hatty, Dominik Schlechtweg, and Sabine Schulte im Walde. 2019. SURel: A Gold Standard for Incorporating Meaning Shifts into Term Extraction. In Proceedings of the Eighth Joint Conference on Lexical and Computational Semantics (*SEM 2019), pages 1–8, Minneapolis, Minnesota. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Tin Kam Ho, Luis A. Lastras, and Oded Shmueli. 2016. Concept Evolution Modeling Using Semantic Vectors. In Proceedings of the 25th International Conference Companion on World Wide Web, WWW ’16 Companion, pages 45–46. International World Wide Web Conferences Steering Committee.
    Google ScholarLocate open access versionFindings
  • Renfen Hu, Shen Li, and Shichen Liang. 2019. Diachronic sense modeling with deep contextualized word embeddings: An ecological view. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3899–3908, Florence, Italy. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Adam Jatowt and Kevin Duh. 2014. A Framework for Analyzing Semantic Change of Words Across Time. In Proceedings of the 14th ACM/IEEE-CS Joint Conference on Digital Libraries, JCDL ’14, pages 229– 238. IEEE Press.
    Google ScholarLocate open access versionFindings
  • Yova Kementchedjhieva, Sebastian Ruder, Ryan Cotterell, and Anders Søgaard. 20Generalizing Procrustes Analysis for Better Bilingual Dictionary Induction. In Proceedings of the 22nd Conference on Computational Natural Language Learning, pages 211–220, Brussels, Belgium. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Tom Kenter, Melvin Wevers, Pim Huijnen, and Maarten de Rijke. 2015. Ad Hoc Monitoring of Vocabulary Shifts over Time. In Proceedings of the 24th ACM International on Conference on Information and Knowledge Management, CIKM ’15, pages 1191–1200. ACM.
    Google ScholarLocate open access versionFindings
  • Yoon Kim, Yi-I Chiu, Kentaro Hanaki, Darshan Hegde, and Slav Petrov. 2014. Temporal Analysis of Language through Neural Language Models. In Proceedings of the ACL 2014 Workshop on Language Technologies and Computational Social Science, pages 61–65, Baltimore, MD, USA. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Maja Rudolph and David Blei. 2018. Dynamic Embeddings for Language Evolution. In Proceedings of the 2018 World Wide Web Conference, WWW ’18, pages 1003–1011.
    Google ScholarLocate open access versionFindings
  • Vivek Kulkarni, Rami Al-Rfou, Bryan Perozzi, and Steven Skiena. 2015. Statistically significant detection of linguistic change. In Proceedings of the 24th International Conference on World Wide Web, pages 625–635. International World Wide Web Conferences Steering Committee.
    Google ScholarLocate open access versionFindings
  • Andrey Kutuzov, Lilja Øvrelid, Terrence Szymanski, and Erik Velldal. 2018. Diachronic word embeddings and semantic shifts: a survey. In Proceedings of the 27th International Conference on Computational Linguistics, Santa Fe, New Mexico, USA. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Guillaume Lample, Alexis Conneau, Ludovic Denoyer, and Marc’Aurelio Ranzato. 2018a. Unsupervised machine translation using monolingual corpora only. In Proceedings of ICLR.
    Google ScholarLocate open access versionFindings
  • Guillaume Lample, Alexis Conneau, Marc’Aurelio Ranzato, Ludovic Denoyer, and Herve Jegou. 2018b. Word translation without parallel data. In 6th International Conference on Learning Representations, ICLR.
    Google ScholarLocate open access versionFindings
  • Laurens van der Maaten and Geoffrey Hinton. 2008. Visualizing Data using t-SNE. Journal of machine learning research, 9(Nov):2579–2605.
    Google ScholarLocate open access versionFindings
  • Maja Rudolph, Francisco Ruiz, Susan Athey, and David Blei. 2017. Structured embedding models for grouped data. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, pages 250–260, USA. Curran Associates Inc.
    Google ScholarLocate open access versionFindings
  • Dominik Schlechtweg, Anna Hatty, Marco Del Tredici, and Sabine Schulte im Walde. 2019. A Wind of Change: Detecting and Evaluating Lexical Semantic Change across Times and Domains. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 732–746, Florence, Italy. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Dominik Schlechtweg, Sabine Schulte im Walde, and Stefanie Eckmann. 2018. Diachronic Usage Relatedness (DURel): A Framework for the Annotation of Lexical Semantic Change. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pages 169–174, New Orleans, Louisiana. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Peter H Schonemann. 1966. A generalized solution of the orthogonal procrustes problem. Psychometrika, 31(1):1–10.
    Google ScholarLocate open access versionFindings
  • Matej Martinc, Petra Kralj Novak, and Senja Pollak. 2019. Leveraging contextual embeddings for detecting diachronic semantic shift. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013a. Efficient estimation of word representations in vector space. arXiv:1301.3781.
    Findings
  • Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013b. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems.
    Google ScholarFindings
  • Sunny Mitra, Ritwik Mitra, Martin Riedl, Chris Biemann, Animesh Mukherjee, and Pawan Goyal. 2014. That’s sick dude!: Automatic identification of word sense change across different timescales. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1020–1029, Baltimore, Maryland. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Milan Straka and Jana Strakova. 2017.
    Google ScholarFindings
  • Tokenizing, pos tagging, lemmatizing and parsing ud 2.0 with udpipe. In Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, pages 88–99, Vancouver, Canada. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Nina Tahmasebi, Lars Borin, and Adam Jatowt. 2018. Survey of Computational Approaches to Lexical Semantic Change. CoRR, abs/1811.06278.
    Findings
  • Laura Wendlandt, Jonathan K. Kummerfeld, and Rada Mihalcea. 2018. Factors Influencing the Surprising Instability of Word Embeddings. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 2092–2102, New Orleans, Louisiana. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Matti Wiegmann, Benno Stein, and Martin Potthast. 2019. Celebrity Profiling. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2611–2618, Florence, Italy. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Syrielle Montariol and Alexandre Allauzen. 2020. Etude des variations semantiques atravers plusieurs dimensions. In Actes de la conference conjointe JEP-TALN 2020, Nancy, France.
    Google ScholarLocate open access versionFindings
  • Chao Xing, Dong Wang, Chao Liu, and Yiye Lin. 2015. Normalized Word Embedding and Orthogonal Transform for Bilingual Word Translation. In Proceedings of the 2015 Conference of the North
    Google ScholarLocate open access versionFindings
  • American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1006–1011, Denver, Colorado. Association for Computational Linguistics. Jaewon Yang and Jure Leskovec. 2011. Patterns of Temporal Variation in Online Media. In Proceedings of the Fourth ACM International Conference on Web Search and Data Mining, WSDM ’11, pages 177–186, New York, NY, USA. ACM. Zijun Yao, Yifan Sun, Weicong Ding, Nikhil Rao, and Hui Xiong. 2018. Dynamic Word Embeddings for Evolving Semantic Discovery. In Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining, WSDM ’18, pages 673– 681. ACM.
    Google ScholarLocate open access versionFindings
  • Tokenization We tokenize the English, French and Hebrew tweets using ark-twokenize-py10, Moses tokenizer11 and UDPipe (Straka and Strakova, 2017), respectively. We lowercase all the tweets and remove hashtags, mentions, retweets and URLs. We replace all the occurrences of numbers with a special token. We discard all words that do not contain one of the following: (1) a character from the respective language; (2) one of these punctuations: “-”, “’”, “.”; (3) emoji.
    Google ScholarFindings
  • Word embeddings We construct the word representations by using the continuous skip-gram negative sampling model from Word2vec (Mikolov et al., 2013a,b). We use the Gensim12 implementation. For all our experiments, we set vector dimension to 300, window size to 4, and minimum number of occurrences of a word to 20. The rest of the hyperparameters are set to their default value.
    Google ScholarFindings
  • We show the top-10 words our method yields for each of the different splits, accompanied with the nearest neighbors in each corpus (excluding words in the intersection), to better understand the context. For comparison, we also show the top-10 words according to the AlignCos method. The splits are the following: English: 1900 vs. 1990 The list of top-10 detected words from our method (NN) vs. AlignCos method, for corpus split according to the year of the English text is displayed in Table 4.
    Google ScholarFindings
  • 11https://www.nltk.org/_modules/nltk/ tokenize/moses.html
    Findings
  • French: 2014 vs. 2018 The list of top-10 detected words from our method (NN) vs. AlignCos method, for corpus split according to the year of the French text is displayed in Table 10. Interesting words found at the top-10 list are the following (2014 vs. 2018): ia (frequent misspelled contraction of “ya” in 2014, vernacular form of “il y a”, there is, vs. “intelligence artificielle”, artificial intelligence), divergent (the movie vs. the adjective). In addition, interesting words that came up in the top-30 list are the following: pls (contraction of the borrowing “please” vs. the acronym of “Position laterale de securite”, lateral safety position, which is now used as a figurative synonym for “having a stroke”. In the same vein, and tied to political debates, we note apl (contraction of “appel/appeler”, call/to call vs. controversial housing subsidies).
    Google ScholarFindings
  • Hebrew: 2014 vs. 2018 The list of top-10 detected words from our method (NN) vs. AlignCos method, for corpus split according to the year of the Hebrew text is displayed in Figure 4. Interesting words found at the top-10 list (2014 vs. 2018) are the following (we use transliteration accompanied with a literal translation to English): beelohim–in god (pledge word vs. religion-related) and Kim– Kim (First name vs. Kim Jong-un). In addition, interesting words that came up in the top-30 list are the following: shtifat–washing (plumbing vs. brainwashing), miklat–shelter (building vs. asylum (for refugees)), borot–pit/ignorance (plural of pit vs. ignorance).
    Google ScholarFindings
  • English (1900 vs. 1990)
    Google ScholarFindings
Your rating :
0

 

Tags
Comments