Multilingual Neural Machine Translation with Language Clustering

EMNLP/IJCNLP (1), pp. 963-973, 2019.

Cited by: 9|Bibtex|Views45|DOI:https://doi.org/10.18653/v1/D19-1089
EI
Other Links: academic.microsoft.com|dblp.uni-trier.de|arxiv.org
Weibo:
We have studied language clustering for multilingual neural machine translation

Abstract:

Multilingual neural machine translation (NMT), which translates multiple languages using a single model, is of great practical importance due to its advantages in simplifying the training process, reducing online maintenance costs, and enhancing low-resource and zero-shot translation. Given there are thousands of languages in the world ...More

Code:

Data:

0
Introduction
  • Neural machine translation (NMT) (Bahdanau et al, 2015; Luong et al, 2015b; Sutskever et al., 2014; Wu et al, 2016; Gehring et al, 2017; Vaswani et al, 2017) has witnessed rapid progress in recent years, from novel model structure developments (Gehring et al, 2017; Vaswani et al, 2017) to achieving performance comparable to humans (Hassan et al, 2018).

    a conventional NMT model can handle a single language translation pair (e.g., German→English, Spanish→French) well, training a separate model for each language pair is unaffordable considering there are thousands of languages in the world.
  • Johnson et al (2017); Firat et al (2016); Ha et al (2016); Lu et al (2018) propose to share part of or all models for multiple language pairs and achieve considerable accuracy improvement
  • While they focus on how to translate multiple language pairs in a single model and improve the performance of the multilingual model, they do not investigate which language pairs should be trained in the same model.
  • Lu et al (2018) propose the neural interlingua, which is an attentional LSTM encoder to link multiple encoders and decoders for different language pairs. Johnson et al (2017); Ha et al (2016) use a universal encoder and decoder to handle multiple source and target languages, with a special tag in the encoder to determine which target languages to output
Highlights
  • Neural machine translation (NMT) (Bahdanau et al, 2015; Luong et al, 2015b; Sutskever et al., 2014; Wu et al, 2016; Gehring et al, 2017; Vaswani et al, 2017) has witnessed rapid progress in recent years, from novel model structure developments (Gehring et al, 2017; Vaswani et al, 2017) to achieving performance comparable to humans (Hassan et al, 2018).

    Although a conventional neural machine translation model can handle a single language translation pair (e.g., German→English, Spanish→French) well, training a separate model for each language pair is unaffordable considering there are thousands of languages in the world
  • A straightforward solution to reduce computational cost is using one model to handle the translations of multiple languages, i.e., multilingual translation
  • Johnson et al (2017); Firat et al (2016); Ha et al (2016); Lu et al (2018) propose to share part of or all models for multiple language pairs and achieve considerable accuracy improvement. While they focus on how to translate multiple language pairs in a single model and improve the performance of the multilingual model, they do not investigate which language pairs should be trained in the same model
  • We show how our language embedding based clustering boosts the performance of multilingual neural machine translation compared with the language family based clustering
  • We show the results of random clustering (Random) and use the same number of clusters as the language embedding based clustering, and average the BLEU scores on multiple times of random clustering (3 times in our experiments) for comparison
  • We have studied language clustering for multilingual neural machine translation
Results
  • The authors mainly show the experiment results and analyses on the many-to-one setting in Section 5.1-5.3.
  • In this case, Individual with more number of models will help.
  • The authors reduce the training data of each language to 50%, 20% and 5% to check how stable the language embeddings based clustering is, as shown in Figure 5.
  • Figure 6 shows the clustering results based on language embeddings, and Table 3 shows the BLEU score of multilingual models clustered by different methods.
  • The authors find that the cluster results of Indo-European language family is more fine-grained than manyto-one setting, which divides each language into their own branches in a language family
Conclusion
  • The authors have studied language clustering for multilingual neural machine translation.
  • Experiments on 23 languages→English and English→23 languages show that language embeddings can sufficiently characterize the similarity between languages and outperform prior knowledge for language clustering in terms of the BLEU scores.
  • The authors will test the methods for many-to-many translation.
  • The authors will consider more languages to study the methods in larger scale setting.
  • The authors will study.
  • Language ISO code Arabic Ar
Summary
  • Introduction:

    Neural machine translation (NMT) (Bahdanau et al, 2015; Luong et al, 2015b; Sutskever et al., 2014; Wu et al, 2016; Gehring et al, 2017; Vaswani et al, 2017) has witnessed rapid progress in recent years, from novel model structure developments (Gehring et al, 2017; Vaswani et al, 2017) to achieving performance comparable to humans (Hassan et al, 2018).

    a conventional NMT model can handle a single language translation pair (e.g., German→English, Spanish→French) well, training a separate model for each language pair is unaffordable considering there are thousands of languages in the world.
  • Johnson et al (2017); Firat et al (2016); Ha et al (2016); Lu et al (2018) propose to share part of or all models for multiple language pairs and achieve considerable accuracy improvement
  • While they focus on how to translate multiple language pairs in a single model and improve the performance of the multilingual model, they do not investigate which language pairs should be trained in the same model.
  • Lu et al (2018) propose the neural interlingua, which is an attentional LSTM encoder to link multiple encoders and decoders for different language pairs. Johnson et al (2017); Ha et al (2016) use a universal encoder and decoder to handle multiple source and target languages, with a special tag in the encoder to determine which target languages to output
  • Results:

    The authors mainly show the experiment results and analyses on the many-to-one setting in Section 5.1-5.3.
  • In this case, Individual with more number of models will help.
  • The authors reduce the training data of each language to 50%, 20% and 5% to check how stable the language embeddings based clustering is, as shown in Figure 5.
  • Figure 6 shows the clustering results based on language embeddings, and Table 3 shows the BLEU score of multilingual models clustered by different methods.
  • The authors find that the cluster results of Indo-European language family is more fine-grained than manyto-one setting, which divides each language into their own branches in a language family
  • Conclusion:

    The authors have studied language clustering for multilingual neural machine translation.
  • Experiments on 23 languages→English and English→23 languages show that language embeddings can sufficiently characterize the similarity between languages and outperform prior knowledge for language clustering in terms of the BLEU scores.
  • The authors will test the methods for many-to-many translation.
  • The authors will consider more languages to study the methods in larger scale setting.
  • The authors will study.
  • Language ISO code Arabic Ar
Tables
  • Table1: BLEU score of 23 languages→English with multilingual models based on different methods of language clustering: Random, Family (Language Family) and Embedding (Language Embedding)
  • Table2: BLEU score of 23 languages→English with different number of clusters: Universal (all the languages share one model), Individual (each language with separate models, totally 23 models)), Embedding (Language Embedding with 7 models). Data size shows the training data for each language→English
  • Table3: BLEU score of English→23 languages with multilingual models based on different methods of language clustering: Universal (all the languages share one model), Individual (each language with separate model), Family (Language Family), Embedding (Language Embedding)
  • Table4: The size of training data for 23 language↔English in our experiments
  • Table5: The ISO 639-1 code of each language in our experiments
Download tables as Excel
Funding
  • Develops a framework that clusters languages into different groups and trains one multilingual model for each cluster
  • Studies two methods for language clustering: using prior knowledge, wclusters languages according to language family, and using language embedding, in which represents each language by an embedding vector and cluster them in the embedding space
  • Our experiments on 23 languages show that the first clustering method is simple and easy to understand but leading to suboptimal translation accuracy, while the second method sufficiently captures the relationship among languages well and improves the translation accuracy for almost all the languages over baseline methods
  • Focuses on determining which languages should be shared in one model
Reference
  • Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural machine translation by jointly learning to align and translate. ICLR 2015.
    Google ScholarLocate open access versionFindings
  • Dik Bakker, Andre Muller, Viveka Velupillai, Søren Wichmann, Cecil H Brown, Pamela Brown, Dmitry Egorov, Robert Mailhammer, Anthony Grant, and Eric W Holman. 2009. Adding typology to lexicostatistics: A combined approach to language classification. Linguistic Typology, 13(1):169–181.
    Google ScholarLocate open access versionFindings
  • Xinying Chen and Kim Gerdes. 2017. Classifying languages by dependency structure. typologies of delexicalized universal dependency treebanks. In Proceedings of the Fourth International Conference on Dependency Linguistics (Depling 2017), September 18-20, 2017, Universitadi Pisa, Italy, 139, pages 54–6Linkoping University Electronic Press.
    Google ScholarLocate open access versionFindings
  • Bernard Comrie. 1989. Language universals and linguistic typology: Syntax and morphology. University of Chicago press.
    Google ScholarFindings
  • Daxiang Dong, Hua Wu, Wei He, Dianhai Yu, and Haifeng Wang. 201Multi-task learning for multiple language translation. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), volume 1, pages 1723–1732.
    Google ScholarLocate open access versionFindings
  • Orhan Firat, Kyunghyun Cho, and Yoshua Bengio. 201Multi-way, multilingual neural machine translation with a shared attention mechanism. In NAACL HLT 2016, The 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego California, USA, June 12-17, 2016, pages 866–875.
    Google ScholarFindings
  • Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, and Yann N. Dauphin. 201Convolutional sequence to sequence learning. In Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, pages 1243–1252.
    Google ScholarLocate open access versionFindings
  • Thanh-Le Ha, Jan Niehues, and Alexander H. Waibel. 2016. Toward multilingual neural machine translation with universal encoder and decoder. CoRR, abs/1611.04798.
    Findings
  • Hany Hassan, Anthony Aue, Chang Chen, Vishal Chowdhary, Jonathan Clark, Christian Federmann, Xuedong Huang, Marcin Junczys-Dowmunt, William Lewis, Mu Li, Shujie Liu, Tie-Yan Liu, Renqian Luo, Arul Menezes, Tao Qin, Frank Seide, Xu Tan, Fei Tian, Lijun Wu, Shuangzhi Wu, Yingce Xia, Dongdong Zhang, Zhirui Zhang, and Ming Zhou. 2018. Achieving human parity on automatic chinese to english news translation. CoRR, abs/1803.05567.
    Findings
  • Tianyu He, Jiale Chen, Xu Tan, and Tao Qin. 2019. Language graph distillation for low-resource machine translation. arXiv preprint arXiv:1908.06258.
    Findings
  • Tianyu He, Xu Tan, Yingce Xia, Di He, Tao Qin, Zhibo Chen, and Tie-Yan Liu. 2018. Layer-wise coordination between encoder and decoder for neural machine translation. In Advances in Neural Information Processing Systems, pages 7944–7954.
    Google ScholarLocate open access versionFindings
  • Eric W Holman, Søren Wichmann, Cecil H Brown, Viveka Velupillai, Andre Muller, and Dik Bakker. 2008. Explorations in automated language classification. Folia Linguistica, 42(3-4):331–354.
    Google ScholarLocate open access versionFindings
  • Geoffrey Horrocks. 2009. Greek: A History of the Language and its Speakers. John Wiley & Sons.
    Google ScholarFindings
  • Melvin Johnson, Mike Schuster, Quoc V. Le, Maxim Krikun, Yonghui Wu, Zhifeng Chen, Nikhil Thorat, Fernanda B. Viegas, Martin Wattenberg, Greg Corrado, Macduff Hughes, and Jeffrey Dean. 2017.
    Google ScholarFindings
  • Diederik Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
    Findings
  • Yichong Leng, Xu Tan, Tao Qin, Xiang-Yang Li, and Tie-Yan Liu. 2019. Unsupervised pivot translation for distant languages. arXiv preprint arXiv:1906.02461.
    Findings
  • Lori S. Levin, Patrick Littell, David R. Mortensen, Ke Lin, Katherine Kairis, and Carlisle Turner. 20URIEL and lang2vec: Representing languages as typological, geographical, and phylogenetic vectors. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2017, Valencia, Spain, April 3-7, 2017, Volume 2: Short Papers, pages 8–14.
    Google ScholarLocate open access versionFindings
  • Haitao Liu and Wenwen Li. 2010. Language clusters based on linguistic complex networks. Chinese Science Bulletin, 55(30):3458–3465.
    Google ScholarLocate open access versionFindings
  • Yichao Lu, Phillip Keung, Faisal Ladhak, Vikas Bhardwaj, Shaonan Zhang, and Jason Sun. 2018. A neural interlingua for multilingual machine translation. CoRR, abs/1804.08198.
    Findings
  • Minh-Thang Luong, Quoc V. Le, Ilya Sutskever, Oriol Vinyals, and Lukasz Kaiser. 2015a. Multitask sequence to sequence learning. CoRR, abs/1511.06114.
    Findings
  • Thang Luong, Hieu Pham, and Christopher D. Manning. 2015b. Effective approaches to attentionbased neural machine translation. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, EMNLP 2015, Lisbon, Portugal, September 17-21, 2015, pages 1412–1421.
    Google ScholarLocate open access versionFindings
  • Chaitanya Malaviya, Graham Neubig, and Patrick Littell. 2017. Learning language representations for typology prediction. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, EMNLP 2017, Copenhagen, Denmark, September 9-11, 2017, pages 2529–2535.
    Google ScholarLocate open access versionFindings
  • Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.
    Findings
  • Kishore Papineni, Salim Roukos, Todd Ward, and WeiJing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, July 6-12, 2002, Philadelphia, PA, USA., pages 311–318.
    Google ScholarLocate open access versionFindings
  • Lewis M Paul, Gary F Simons, Charles D Fennig, et al. 2009. Ethnologue: Languages of the world. Dallas, TX: SIL International. Available online at www. ethnologue. com/. Retrieved June, 19:2011.
    Locate open access versionFindings
  • Emmanouil Antonios Platanios, Mrinmaya Sachan, Graham Neubig, and Tom Mitchell. 2018. Contextual parameter generation for universal neural machine translation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 425–435.
    Google ScholarLocate open access versionFindings
  • Lior Rokach and Oded Maimon. 2005. Clustering methods. In Data mining and knowledge discovery handbook, pages 321–352. Springer.
    Google ScholarFindings
  • Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Neural machine translation of rare words with subword units. In ACL 2016, August 7-12, 2016, Berlin, Germany, Volume 1: Long Papers.
    Google ScholarLocate open access versionFindings
  • Yanyao Shen, Xu Tan, Di He, Tao Qin, and Tie-Yan Liu. 2018. Dense information flow for neural machine translation. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 1294–1303.
    Google ScholarLocate open access versionFindings
  • Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, Jeff Klingner, Apurva Shah, Melvin Johnson, Xiaobing Liu, Lukasz Kaiser, Stephan Gouws, Yoshikiyo Kato, Taku Kudo, Hideto Kazawa, Keith Stevens, George Kurian, Nishant Patil, Wei Wang, Cliff Young, Jason Smith, Jason Riesa, Alex Rudnick, Oriol Vinyals, Greg Corrado, Macduff Hughes, and Jeffrey Dean. 2016. Google’s neural machine translation system: Bridging the gap between human and machine translation. CoRR, abs/1609.08144.
    Findings
  • Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, and TieYan Liu. 2019. Mass: Masked sequence to sequence pre-training for language generation. arXiv preprint arXiv:1905.02450.
    Findings
  • Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Sequence to sequence learning with neural networks. In NIPS 2014, December 8-13 2014, Montreal, Quebec, Canada, pages 3104–3112.
    Google ScholarLocate open access versionFindings
  • Xu Tan, Yi Ren, Di He, Tao Qin, and Tie-Yan Liu. 2019. Multilingual neural machine translation with knowledge distillation. In International Conference on Learning Representations.
    Google ScholarLocate open access versionFindings
  • Robert L Thorndike. 1953. Who belongs in the family? Psychometrika, 18(4):267–276.
    Google ScholarLocate open access versionFindings
  • Jorg Tiedemann and Robert Ostling. 2017. Continuous multilinguality with language vectors. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2017, Valencia, Spain, April 3-7, 2017, Volume 2: Short Papers, pages 644–649.
    Google ScholarLocate open access versionFindings
  • Ashish Vaswani, Samy Bengio, Eugene Brevdo, Francois Chollet, Aidan N. Gomez, Stephan Gouws, Llion Jones, Łukasz Kaiser, Nal Kalchbrenner, Niki Parmar, Ryan Sepassi, Noam Shazeer, and Jakob Uszkoreit. 2018. Tensor2tensor for neural machine translation. CoRR, abs/1803.07416.
    Findings
  • Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In NIPS 2017, 4-9 December 2017, Long Beach, CA, USA, pages 6000–6010.
    Google ScholarFindings
  • Lijun Wu, Xu Tan, Di He, Fei Tian, Tao Qin, Jianhuang Lai, and Tie-Yan Liu. 2018. Beyond error propagation in neural machine translation: Characteristics of language also matter. In Proceedings of the 2018
    Google ScholarLocate open access versionFindings
Full Text
Your rating :
0

 

Tags
Comments