Bridging Linguistic Typology and Multilingual Machine Translation with Multi View Language Representations

EMNLP 2020, pp. 2391-2406, 2020.

Other Links: arxiv.org|academic.microsoft.com
Weibo:
We notice that specific typological knowledge is usually hard to learn in an unsupervised way, and fusing them with knowledge bases vectors using singular vector canonical correlation analysis is feasible for inducing information of linguistics typology in some scenarios

Abstract:

Sparse language vectors from linguistic typology databases and learned embeddings from tasks like multilingual machine translation have been investigated in isolation, without analysing how they could benefit from each other’s language characterisation. We propose to fuse both views using singular vector canonical correlation analysis and...More

Code:

Data:

0
Introduction
  • Recent surveys consider linguistic typology as a potential source of knowledge to support multilingual natural language processing (NLP) tasks (O’Horan et al, 2016; Ponti et al, 2019).
  • Previous work has shown that task-learned embeddings are potential candidates to predict features of a linguistic typology KB (Malaviya et al, 2017), and the goal is to evaluate whether SVCCA can enhance the NMT-learned language embeddings with typological knowledge from their KB parallel view.
  • The authors use a Logistic Regression classifier per US feature, which is trained with the NMT-learned or SVCCA representations in both one-language-out and one-language-familyout settings.
Highlights
  • Recent surveys consider linguistic typology as a potential source of knowledge to support multilingual natural language processing (NLP) tasks (O’Horan et al, 2016; Ponti et al, 2019)
  • Previous work has shown that task-learned embeddings are potential candidates to predict features of a linguistic typology knowledge bases (KB) (Malaviya et al, 2017), and our goal is to evaluate whether singular vector canonical correlation analysis (SVCCA) can enhance the neural machine translation (NMT)-learned language embeddings with typological knowledge from their KB parallel view
  • We argue that a potential reason for the accuracy dropping is the method used to extract the NMT-learned embeddings, which could diminishes the information embedded about each language, and impacts the SVCCA projection
  • We notice that specific typological knowledge is usually hard to learn in an unsupervised way, and fusing them with KB vectors using SVCCA is feasible for inducing information of linguistics typology in some scenarios
  • We first examine how well a language phylogeny can be reconstructed from language representations
  • Comparing the two ranking approaches, we observe that SVCCA achieves a comparable performance in most of the cases
  • English is projected in the Germanic branch, Latvian is separated from the Balto-Slavic group
Results
  • To address the task, Tan et al (2019) trained a factored multilingual NMT model of 23 languages from Cettolo et al (2012), where the language embedding is translated from different languages to investigate what kind of genetic information is preserved.
  • Rather than only using the multi-view representations to compute a set of clusters, the authors address the question: do the authors need to train the massive model again if the authors want to add one or more new languages to the setting?
  • The authors use the multi-view representations to rank related languages from the vector space, as they embed information about typological and lexical relationships.
  • The authors are interrogating whether SVCCA is a useful method for rapidly increasing the number of languages without retraining massive models given new entries that require their NMTlearned embeddings for clustering.
  • Like using the NMT-learned embeddings (LT ) as Tan et al (2019) or the concatenation baseline, obtain similar translation results in the last three bins.
  • Table 3 shows the BLEU scores of the translation into English for the smaller multilingual models that group each child language with their candidates ranked by LANGRANK and the SVCCA-53 representations.
  • The authors note that LANGRANK prefers related languages with large datasets, as it only requires three candidates to group around half a million training samples, whereas SVCCA suggests to include from three to ten languages to reach a similar amount of parallel sentences.
Conclusion
  • The pattern proves that the vectors are not suitable for clustering, and they might only encode enough information to perform a classification task in the multilingual NMT training and inference.
  • It is possible to use language vectors from KBs or tasklearned embeddings from different settings, such as one-to-many or many-to-many NMT and multilingual language modelling.
  • The authors could rapidly project new language representations to assess tasks like clustering or ranking candidates for multilingual NMT that involves massive datasets of hundreds of languages.
Summary
  • Recent surveys consider linguistic typology as a potential source of knowledge to support multilingual natural language processing (NLP) tasks (O’Horan et al, 2016; Ponti et al, 2019).
  • Previous work has shown that task-learned embeddings are potential candidates to predict features of a linguistic typology KB (Malaviya et al, 2017), and the goal is to evaluate whether SVCCA can enhance the NMT-learned language embeddings with typological knowledge from their KB parallel view.
  • The authors use a Logistic Regression classifier per US feature, which is trained with the NMT-learned or SVCCA representations in both one-language-out and one-language-familyout settings.
  • To address the task, Tan et al (2019) trained a factored multilingual NMT model of 23 languages from Cettolo et al (2012), where the language embedding is translated from different languages to investigate what kind of genetic information is preserved.
  • Rather than only using the multi-view representations to compute a set of clusters, the authors address the question: do the authors need to train the massive model again if the authors want to add one or more new languages to the setting?
  • The authors use the multi-view representations to rank related languages from the vector space, as they embed information about typological and lexical relationships.
  • The authors are interrogating whether SVCCA is a useful method for rapidly increasing the number of languages without retraining massive models given new entries that require their NMTlearned embeddings for clustering.
  • Like using the NMT-learned embeddings (LT ) as Tan et al (2019) or the concatenation baseline, obtain similar translation results in the last three bins.
  • Table 3 shows the BLEU scores of the translation into English for the smaller multilingual models that group each child language with their candidates ranked by LANGRANK and the SVCCA-53 representations.
  • The authors note that LANGRANK prefers related languages with large datasets, as it only requires three candidates to group around half a million training samples, whereas SVCCA suggests to include from three to ten languages to reach a similar amount of parallel sentences.
  • The pattern proves that the vectors are not suitable for clustering, and they might only encode enough information to perform a classification task in the multilingual NMT training and inference.
  • It is possible to use language vectors from KBs or tasklearned embeddings from different settings, such as one-to-many or many-to-many NMT and multilingual language modelling.
  • The authors could rapidly project new language representations to assess tasks like clustering or ranking candidates for multilingual NMT that involves massive datasets of hundreds of languages.
Tables
  • Table1: Avg. accuracy (↑) of typological feature prediction per NMT-learned and SVCCA(US,L∗) setting
  • Table2: APTED and nAPTED scores (↓) between the GS and inferred trees from all scenarios. NMT-learned and concatenation (⊕) can only reconstruct pruned trees of 16 (LB), 12 (LW ) and 15 (LT ) languages
  • Table3: BLEU scores (L→English) for Individual, Massive and ranking approaches. LANGRANK shows the accumulated training size (in thousands) for the top-3 candidates, whereas with SVCCA we approximate the amount of data and include the number of languages
  • Table4: List of languages with their BLEU scores per clustering approach (IE=Indo-European)
  • Table5: BLEU score average per language family (IE=Indo-European). Every method includes the weighted BLEU average per number of languages (#L) and the number of clusters/models. Bold and italic represent first and second best results per family. Δ for SVCCA indicates the difference with respect to the highest score
  • Table6: Similar to Table 2, but including the optimal values for the SVD explained variance in each setting
Download tables as Excel
Funding
  • This work was supported by funding from the first time a CCA-based method has been used the European Union’s Horizon 2020 reto compute language-level representations
  • Also, it was performed using resources provided by the Cambridge Service for Data Driven Discovery (CSD3) operated by the University of Cambridge Research Computing Service (http://www.csd3.cam.ac.uk/), provided by Dell EMC and Intel using Tier-2 funding from the Engineering and Physical Sciences Research Council (capital grant EP/P020259/1), and DiRAC funding from the Science and Technology Facilities Council (www.dirac.ac.uk)
Study subjects and analysis
specific lowresource cases: 5
6.3 Language ranking results. After discussing overall translation accuracy for all the languages, we now focus on five specific lowresource cases and how multilingual transfer enhance their performance. Table 3 shows the BLEU scores of the translation into English for the smaller multilingual models that group each child language with their candidates ranked by LANGRANK and our SVCCA-53 representations

Reference
  • Antonios Anastasopoulos. 2019. A note on evaluating multilingual benchmarks.
    Google ScholarFindings
  • Johannes Bjerva and Isabelle Augenstein. 2018a. From phonology to syntax: Unsupervised linguistic typology at different levels with language embeddings. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 907–916, New Orleans, Louisiana. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Johannes Bjerva and Isabelle Augenstein. 2018b. Tracking typological traits of uralic languages in distributed language representations. In Proceedings of the Fourth International Workshop on Computational Linguistics of Uralic Languages, pages 76–86, Helsinki, Finland. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Johannes Bjerva, Yova Kementchedjhieva, Ryan Cotterell, and Isabelle Augenstein. 2019a. A probabilistic generative model of linguistic typology. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 1529–1540, Minneapolis, Minnesota. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Johannes Bjerva, Robert stling, Maria Han Veiga, Jrg Tiedemann, and Isabelle Augenstein. 2019b. What do language representations really represent? Computational Linguistics, 45(2):381–389.
    Google ScholarLocate open access versionFindings
  • Mauro Cettolo, Christian Girardi, and Marcello Federico. 2012. WIT3: Web inventory of transcribed and translated talks. In Proceedings of the 16th Conference of the European Association for Machine Translation (EAMT), pages 261–268, Trento, Italy.
    Google ScholarLocate open access versionFindings
  • Bernard Comrie. 1989. Language universals and linguistic typology: Syntax and morphology. University of Chicago press.
    Google ScholarFindings
  • Paramveer S Dhillon, Dean P Foster, and Lyle H Ungar. 2015. Eigenwords: Spectral word embeddings. The Journal of Machine Learning Research, 16(1):3035– 3078.
    Google ScholarLocate open access versionFindings
  • Matthew S. Dryer and Martin Haspelmath, editors. 2013. WALS Online. Max Planck Institute for Evolutionary Anthropology, Leipzig.
    Google ScholarFindings
  • Susan T Dumais. 2004. Latent semantic analysis. Annual review of information science and technology, 38(1):188–230.
    Google ScholarLocate open access versionFindings
  • Isidore Dyen, Joseph B Kruskal, and Paul Black. 1992. An indoeuropean classification: A lexicostatistical experiment. Transactions of the American Philosophical society, 82(5):iii–132.
    Google ScholarLocate open access versionFindings
  • Manaal Faruqui and Chris Dyer. 2014. Improving vector space word representations using multilingual correlation. In Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, pages 462–471, Gothenburg, Sweden. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • David R Hardoon, Sandor Szedmak, and John ShaweTaylor. 2004. Canonical correlation analysis: An overview with application to learning methods. Neural computation, 16(12):2639–2664.
    Google ScholarLocate open access versionFindings
  • Melvin Johnson, Mike Schuster, Quoc V. Le, Maxim Krikun, Yonghui Wu, Zhifeng Chen, Nikhil Thorat, Fernanda Viegas, Martin Wattenberg, Greg Corrado, Macduff Hughes, and Jeffrey Dean. 2017. Google’s multilingual neural machine translation system: Enabling zero-shot translation. Transactions of the Association for Computational Linguistics, 5:339–351.
    Google ScholarLocate open access versionFindings
  • Marcin Junczys-Dowmunt, Kenneth Heafield, Hieu Hoang, Roman Grundkiewicz, and Anthony Aue. 2018. Marian: Cost-effective high-quality neural machine translation in C++. In Proceedings of the 2nd Workshop on Neural Machine Translation and Generation, pages 129–135, Melbourne, Australia. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
    Findings
  • Taku Kudo and John Richardson. 2018. SentencePiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 66–71, Brussels, Belgium. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Sneha Kudugunta, Ankur Bapna, Isaac Caswell, and Orhan Firat. 2019. Investigating multilingual NMT representations at scale. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 1565–1575, Hong Kong, China. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Yu-Hsiang Lin, Chian-Yu Chen, Jean Lee, Zirui Li, Yuyan Zhang, Mengzhou Xia, Shruti Rijhwani, Junxian He, Zhisong Zhang, Xuezhe Ma, Antonios Anastasopoulos, Patrick Littell, and Graham Neubig. 20Choosing transfer languages for cross-lingual learning. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3125–3135, Florence, Italy. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Patrick Littell, David R. Mortensen, Ke Lin, Katherine Kairis, Carlisle Turner, and Lori Levin. 2017. URIEL and lang2vec: Representing languages as typological, geographical, and phylogenetic vectors. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, pages 8–14, Valencia, Spain. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Chaitanya Malaviya, Graham Neubig, and Patrick Littell. 2017. Learning language representations for typology prediction. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 2529–2535, Copenhagen, Denmark. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Yugo Murawaki. 2015. Continuous space representations of linguistic typology and their application to phylogenetic inference. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 324–334, Denver, Colorado. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Yugo Murawaki. 2017. Diachrony-aware induction of binary latent representations from typological features. In Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 451–461, Taipei, Taiwan. Asian Federation of Natural Language Processing.
    Google ScholarLocate open access versionFindings
  • Yugo Murawaki. 2018. Analyzing correlated evolution of multiple features using latent representations. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 4371–4382, Brussels, Belgium. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • John Nerbonne, Peter Kleiweg, Wilbert Heeringa, and Franz Manni. 2008. Projecting dialect distances to geography: Bootstrap clustering vs. noisy clustering. In Data Analysis, Machine Learning and Applications, pages 647–654, Berlin, Heidelberg. Springer Berlin Heidelberg.
    Google ScholarLocate open access versionFindings
  • Helen O’Horan, Yevgeni Berzak, Ivan Vulic, Roi Reichart, and Anna Korhonen. 2016. Survey on the use of typological information in natural language processing. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, pages 1297–1308, Osaka, Japan. The COLING 2016 Organizing Committee.
    Google ScholarLocate open access versionFindings
  • Dominique Osborne, Shashi Narayan, and Shay B. Cohen. 2016. Encoding prior knowledge with eigenword embeddings. Transactions of the Association for Computational Linguistics, 4:417–430.
    Google ScholarLocate open access versionFindings
  • Robert Ostling and Jorg Tiedemann. 2017. Continuous multilinguality with language vectors. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, pages 644–649, Valencia, Spain. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Mateusz Pawlik and Nikolaus Augsten. 2015. Efficient computation of the tree edit distance. ACM Transactions on Database Systems (TODS), pages 3:1–3:40.
    Google ScholarLocate open access versionFindings
  • Mateusz Pawlik and Nikolaus Augsten. 2016. Tree edit distance: Robust and memory-efficient. Information Systems, 56:157–173.
    Google ScholarLocate open access versionFindings
  • Emmanouil Antonios Platanios, Mrinmaya Sachan, Graham Neubig, and Tom Mitchell. 2018. Contextual parameter generation for universal neural machine translation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 425–435, Brussels, Belgium. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Edoardo Maria Ponti, Helen OHoran, Yevgeni Berzak, Ivan Vuli, Roi Reichart, Thierry Poibeau, Ekaterina Shutova, and Anna Korhonen. 2019. Modeling language variation and universals: A survey on typological linguistics for natural language processing. Computational Linguistics, 45(3):559–601.
    Google ScholarLocate open access versionFindings
  • Matt Post. 2018. A call for clarity in reporting BLEU scores. In Proceedings of the Third Conference on Machine Translation: Research Papers, pages 186– 191, Belgium, Brussels. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Ye Qi, Devendra Sachan, Matthieu Felix, Sarguna Padmanabhan, and Graham Neubig. 2018. When and why are pre-trained word embeddings useful for neural machine translation? In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pages 529–535, New Orleans, Louisiana. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Ella Rabinovich, Noam Ordan, and Shuly Wintner. 2017. Found in translation: Reconstructing phylogenetic language trees from translations. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 530–540, Vancouver, Canada. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Maithra Raghu, Justin Gilmer, Jason Yosinski, and Jascha Sohl-Dickstein. 2017. SVCCA: Singular vector canonical correlation analysis for deep learning dynamics and interpretability. In Advances in Neural Information Processing Systems 30, pages 6076– 6085. Curran Associates, Inc.
    Google ScholarLocate open access versionFindings
  • Peter J Rousseeuw. 1987. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Journal of computational and applied mathematics, 20:53–65.
    Google ScholarLocate open access versionFindings
  • Devendra Sachan and Graham Neubig. 2018. Parameter sharing methods for multilingual self-attentional translation models. In Proceedings of the Third Conference on Machine Translation: Research Papers, pages 261–271, Belgium, Brussels. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Rico Sennrich, Orhan Firat, Kyunghyun Cho, Alexandra Birch, Barry Haddow, Julian Hitschler, Marcin Junczys-Dowmunt, Samuel Laubli, Antonio Valerio Miceli Barone, Jozef Mokry, and Maria Nadejde. 2017. Nematus: a toolkit for neural machine translation. In Proceedings of the Software Demonstrations of the 15th Conference of the European Chapter of the Association for Computational Linguistics, pages 65–68, Valencia, Spain. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1715– 1725, Berlin, Germany. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • M. Serva and F. Petroni. 2008. Indo-European languages tree by Levenshtein distance. EPL (Europhysics Letters), 81(6):68005.
    Google ScholarLocate open access versionFindings
  • Xu Tan, Jiale Chen, Di He, Yingce Xia, Tao QIN, and Tie-Yan Liu. 2019. Multilingual neural machine translation with language clustering. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 963–973, Hong Kong, China. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Yulia Tsvetkov, Sunayana Sitaram, Manaal Faruqui, Guillaume Lample, Patrick Littell, David Mortensen, Alan W Black, Lori Levin, and Chris Dyer. 2016. Polyglot neural language models: A case study in cross-lingual phonetic representation learning. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1357–1366, San Diego, California. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 5998–6008. Curran Associates, Inc.
    Google ScholarLocate open access versionFindings
  • Joe H Ward Jr. 1963. Hierarchical grouping to optimize an objective function. Journal of the American Statistical Association, 58(301):236–244.
    Google ScholarLocate open access versionFindings
  • We work with 53 languages pre-processed by (Qi et al., 2018), from where we mapped the ISO 639-1 codes to the ISO 693-2 standard. However, we need to manually correct the mapping of some codes to identify the correct language vector in the URIEL (Littell et al., 2017) library:
    Google ScholarFindings
Full Text
Your rating :
0

 

Tags
Comments