A Latent Morphology Model for Open-Vocabulary Neural Machine Translation

Duygu Ataman
Duygu Ataman
Wilker Aziz
Wilker Aziz

ICLR, 2020.

Cited by: 0|Bibtex|Views52
EI
Other Links: academic.microsoft.com|dblp.uni-trier.de|arxiv.org
Weibo:
Our results show that our model can significantly outperform subword-level Neural machine translation models, whereas demonstrates better capacity than character-level models in coping with increased amounts of data sparsity

Abstract:

Translation into morphologically-rich languages challenges neural machine translation (NMT) models with extremely sparse vocabularies where atomic treatment of surface forms is unrealistic. This problem is typically addressed by either pre-processing words into subword units or performing translation directly at the level of characters. T...More

Code:

Data:

Introduction
  • Neural machine translation (NMT) models are conventionally trained by maximizing the likelihood of generating the target side of a bilingual parallel corpus of observations one word at a time conditioned of their full observed context.
  • One drawback related to this approach, is that the estimation of the subword vocabulary relies on word segmentation methods optimized using corpus-dependent statistics, disregarding any linguistic notion of morphology and the translation objective
  • This often produces subword units that are semantically ambiguous as they might be used in far too many lexical and syntactic contexts (Ataman et al, 2017).
  • To alleviate the sub-optimal effects of using explicit segmentation and generalize better to new morphological forms, recent studies explored the idea of extending NMT to model translation directly at the level of characters (Kreutzer & Sokolov, 2018; Cherry et al, 2018), which, in turn, have demonstrated the requirement of using comparably deeper networks, as the network would need to learn longer distance grammatical dependencies (Sennrich, 2017)
Highlights
  • Neural machine translation (NMT) models are conventionally trained by maximizing the likelihood of generating the target side of a bilingual parallel corpus of observations one word at a time conditioned of their full observed context
  • Such adverse conditions are typical of translation involving morphologically-rich languages, where any single root may lead to exponentially many different surface realizations depending on its syntactic context
  • We evaluate our model by comparing it in machine translation against three baselines which constitute the conventional open-vocabulary Neural machine translation methods, including architectures using atomic parameterization either with subword units segmented with Byte-Pair Encoding algorithm (Sennrich et al, 2016) or characters, and the hierarchical parameterization method employed for generating all words in the output
  • In this paper we presented a novel decoding architecture for Neural machine translation employing a hierarchical latent variable model to promote sparsity in lexical representations, which demonstrated promising application for morphologically-rich and low-resource languages
  • We evaluate our model against conventional open-vocabulary Neural machine translation solutions such as subword and character-level decoding methods in translationg English into three morphologically-rich languages with different morphological typologies under low to mid-resource settings
  • Our results show that our model can significantly outperform subword-level Neural machine translation models, whereas demonstrates better capacity than character-level models in coping with increased amounts of data sparsity
Results
  • 4.4.1 THE EFFECT OF MORPHOLOGICAL TYPOLOGY

    The experiment results given in Table 1 show the performance of each model in translating English into Arabic, Czech and Turkish.
  • In Turkish, the most sparse target language in our benchmark with rich agglutinative morphology, using character-based decoding shows to be more advantageous compared to the subword-level and hierarchical models, suggesting that increased granularity in the vocabulary units might aid in better learning accurate representations under conditions of high data sparsity.
  • The results in the English-to-Czech translation direction do not indicate a specific advantage of using either method for generating fusional morphology, where morphemes are already optimized at the surface level, our model is still able to achieve translation accuracy comparable to the character and subword-level models
Conclusion
  • In this paper we presented a novel decoding architecture for NMT employing a hierarchical latent variable model to promote sparsity in lexical representations, which demonstrated promising application for morphologically-rich and low-resource languages.
  • Our model generates words one character at a time by composing two latent features representing their lemmas and inflectional features.
  • We evaluate our model against conventional open-vocabulary NMT solutions such as subword and character-level decoding methods in translationg English into three morphologically-rich languages with different morphological typologies under low to mid-resource settings.
  • We conduct ablation studies on the impact of feature variations to the predictions, which prove that despite being completely unsupervised, our model can manage to learn morphosyntactic information and make use of it to generalize to different surface forms of words
Summary
  • Introduction:

    Neural machine translation (NMT) models are conventionally trained by maximizing the likelihood of generating the target side of a bilingual parallel corpus of observations one word at a time conditioned of their full observed context.
  • One drawback related to this approach, is that the estimation of the subword vocabulary relies on word segmentation methods optimized using corpus-dependent statistics, disregarding any linguistic notion of morphology and the translation objective
  • This often produces subword units that are semantically ambiguous as they might be used in far too many lexical and syntactic contexts (Ataman et al, 2017).
  • To alleviate the sub-optimal effects of using explicit segmentation and generalize better to new morphological forms, recent studies explored the idea of extending NMT to model translation directly at the level of characters (Kreutzer & Sokolov, 2018; Cherry et al, 2018), which, in turn, have demonstrated the requirement of using comparably deeper networks, as the network would need to learn longer distance grammatical dependencies (Sennrich, 2017)
  • Results:

    4.4.1 THE EFFECT OF MORPHOLOGICAL TYPOLOGY

    The experiment results given in Table 1 show the performance of each model in translating English into Arabic, Czech and Turkish.
  • In Turkish, the most sparse target language in our benchmark with rich agglutinative morphology, using character-based decoding shows to be more advantageous compared to the subword-level and hierarchical models, suggesting that increased granularity in the vocabulary units might aid in better learning accurate representations under conditions of high data sparsity.
  • The results in the English-to-Czech translation direction do not indicate a specific advantage of using either method for generating fusional morphology, where morphemes are already optimized at the surface level, our model is still able to achieve translation accuracy comparable to the character and subword-level models
  • Conclusion:

    In this paper we presented a novel decoding architecture for NMT employing a hierarchical latent variable model to promote sparsity in lexical representations, which demonstrated promising application for morphologically-rich and low-resource languages.
  • Our model generates words one character at a time by composing two latent features representing their lemmas and inflectional features.
  • We evaluate our model against conventional open-vocabulary NMT solutions such as subword and character-level decoding methods in translationg English into three morphologically-rich languages with different morphological typologies under low to mid-resource settings.
  • We conduct ablation studies on the impact of feature variations to the predictions, which prove that despite being completely unsupervised, our model can manage to learn morphosyntactic information and make use of it to generalize to different surface forms of words
Tables
  • Table1: Above: Machine translation accuracy in Arabic (AR), Czech (CS) and Turkish (TR) in terms of BLEU and ChrF3 metrics as well as BLEU scores computed on the output sentences tagged with the morphological analyzer (t-BLEU) using in-domain training data. Below: The performance of models trained with multi-domain data. Best scores are in bold. All improvements over the baselines are statistically significant (p-value < 0.05)
  • Table2: Percentage of out-of-vocabulary (OOV) words in the output, normalized perplexity measures (PPl) per characters and the KL divergence between the reference and outputs of systems trained with in-domain data on different language directions
  • Table3: Above: Outputs of LMM based on the lemma ‘git’ (‘go’) and different sets of inflectional features. Below: Examples of predicting inflections in context with or without using features
  • Table4: Training sets based on the TED Talks corpora (M: Million, K: Thousand)
  • Table5: The multi-domain training set (M: Million, K: Thousand)
  • Table6: Development and testing sets (K: Thousand)
Download tables as Excel
Funding
  • This project received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreements 825299 (GoURMET) and 688139 (SUMMA)
Reference
  • Duygu Ataman, Matteo Negri, Marco Turchi, and Marcello Federico. Linguistically-motivated vocabulary reduction for neural machine translation from Turkish to English. The Prague Bulletin of Mathematical Linguistics, 108(1):331–342, 2017.
    Google ScholarLocate open access versionFindings
  • Duygu Ataman, Orhan Firat, Mattia A Di Gangi, Marcello Federico, and Alexandra Birch. On the importance of word boundaries in character-level neural machine translation. arXiv preprint arXiv:1910.06753, 2019.
    Findings
  • Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014.
    Findings
  • Antonio Valerio Miceli Barone, Jindrich Helcl, Rico Sennrich, Barry Haddow, and Alexandra Birch. Deep architectures for neural machine translation. In Proceedings of the Second Conference on Machine Translation, pp. 99–107, 2017.
    Google ScholarLocate open access versionFindings
  • Joost Bastings, Wilker Aziz, and Ivan Titov. Interpretable neural predictions with differentiable binary variables. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 2963–2973, 2019.
    Google ScholarLocate open access versionFindings
  • Yoshua Bengio, Nicholas Leonard, and Aaron Courville. Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432, 2013.
    Findings
  • Leon Bottou and Yann L. Cun. Large scale online learning. In S. Thrun, L. K. Saul, and B. Scholkopf (eds.), Advances in Neural Information Processing Systems 16, pp. 217–224. MIT Press, 2004.
    Google ScholarLocate open access versionFindings
  • Mohamed Boudchiche, Azzeddine Mazroui, Mohamed Ould Abdallahi Ould Bebah, Abdelhak Lakhouaja, and Abderrahim Boudlal. Alkhalil morpho sys 2: A robust arabic morpho-syntactic analyzer. Journal of King Saud University-Computer and Information Sciences, 29(2):141–146, 2017.
    Google ScholarLocate open access versionFindings
  • Mauro Cettolo. Wit3: Web inventory of transcribed and translated talks. In Conference of European Association for Machine Translation, pp. 261–268, 2012.
    Google ScholarLocate open access versionFindings
  • Colin Cherry, George Foster, Ankur Bapna, Orhan Firat, and Wolfgang Macherey. Revisiting character-based neural machine translation with capacity and compression. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 4295–4305, 2018.
    Google ScholarLocate open access versionFindings
  • Kyunghyun Cho, Bart van Merrienboer, Dzmitry Bahdanau, and Yoshua Bengio. On the properties of neural machine translation: Encoder–decoder approaches. In Proceedings of 8th Workshop on Syntax, Semantics and Structure in Statistical Translation (SSST), pp. 103–111, 2014.
    Google ScholarLocate open access versionFindings
  • Jonathan H Clark, Chris Dyer, Alon Lavie, and Noah A Smith. Better hypothesis testing for statistical machine translation: Controlling for optimizer instability. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp. 176–181. Association for Computational Linguistics, 2011.
    Google ScholarLocate open access versionFindings
  • Ryan Cotterell, Sebastian J. Mielke, Jason Eisner, and Brian Roark. Are all languages equally hard to language-model? In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 2: Short Papers), pp. 536–541, 2018.
    Google ScholarLocate open access versionFindings
  • Hoang Cuong and Khalil Simaan. Latent domain translation models in mix-of-domains haystack. In Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics, pp. 1928–1939, 2014.
    Google ScholarLocate open access versionFindings
  • Anirudh Goyal Alias Parth Goyal, Alessandro Sordoni, Marc-Alexandre Cote, Nan Rosemary Ke, and Yoshua Bengio. Z-forcing: Training stochastic recurrent networks. In Advances in neural information processing systems, pp. 6713–6723, 2017.
    Google ScholarLocate open access versionFindings
  • Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with Gumbel-Softmax. International Conference on Learning Representations, 2017.
    Google ScholarLocate open access versionFindings
  • MichaelI. Jordan, Zoubin Ghahramani, TommiS. Jaakkola, and Lawrence K. Saul. An introduction to variational methods for graphical models. Machine Learning, 37(2):183–233, 1999.
    Google ScholarLocate open access versionFindings
  • D Kinga and J Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
    Findings
  • Diederik P. Kingma and Max Welling. Auto-encoding variational Bayes. arXiv preprint arXiv:1312.6114, 2013.
    Findings
  • Guillaume Klein, Yoon Kim, Yuntian Deng, Jean Senellart, and Alexander Rush. OpenNMT: Opensource toolkit for neural machine translation. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, System Demonstrations, pp. 67–72, 2017.
    Google ScholarLocate open access versionFindings
  • Julia Kreutzer and Artem Sokolov. Learning to segment inputs for nmt favors character-level processing. In Proceedings of the 15th International Workshop on Spoken Language Translation, pp. 166–172, 2018.
    Google ScholarLocate open access versionFindings
  • Ponnambalam Kumaraswamy. A generalized probability density function for double-bounded random processes. Journal of Hydrology, 46(1-2):79–88, 1980.
    Google ScholarLocate open access versionFindings
  • Wang Ling, Isabel Trancoso, Chris Dyer, and Alan W. Black. Character-based neural machine translation. arXiv preprint arXiv:1511.04586, 2015.
    Findings
  • Pierre Lison and Jorg Tiedemann. Opensubtitles2016: Extracting large parallel corpora from movie and TV subtitles. 2016.
    Google ScholarFindings
  • Christos Louizos, Max Welling, and Diederik P Kingma. Learning sparse neural networks through L0 regularization. arXiv preprint arXiv:1712.01312, 2018.
    Findings
  • Minh-Thang Luong and Christopher D. Manning. Achieving open vocabulary neural machine translation with hybrid word-character models. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1054–1063, 2016.
    Google ScholarLocate open access versionFindings
  • Minh-Thang Luong, Hieu Pham, and Christopher D Manning. Effective approaches to attentionbased neural machine translation. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp. 1412–1421, 2015.
    Google ScholarLocate open access versionFindings
  • Chris J. Maddison, Andriy Mnih, and Yee Whye Teh. The concrete distribution: A continous relaxation of discrete random variables. International Conference on Learning Representations, 2017.
    Google ScholarLocate open access versionFindings
  • Eric Nalisnick and Padhraic Smyth. Stick-breaking variational autoencoders. arXiv preprint arXiv:1605.06197, 2016.
    Findings
  • Kemal Oflazer and Ilker Kuruoz. Tagging and morphological disambiguation of turkish text. In Proceedings of the fourth conference on Applied natural language processing, pp. 144–149. Association for Computational Linguistics, 1994.
    Google ScholarLocate open access versionFindings
  • Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. BLEU: a Method for Automatic Evaluation of Machine Translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pp. 311–318, 2002.
    Google ScholarLocate open access versionFindings
  • Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in Pytorch. NeurIPS Autodiff Workshop, 2017.
    Google ScholarFindings
  • Maja Popovic. chrf: character n-gram f-score for automatic mt evaluation. In Proceedings of the Tenth Workshop on Statistical Machine Translation, pp. 392–395, 2015.
    Google ScholarLocate open access versionFindings
  • Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and approximate inference in deep generative models. In Proceedings of the 31st International Conference on Machine Learning, volume 32 of Proceedings of Machine Learning Research, pp. 1278–1286, 2014.
    Google ScholarLocate open access versionFindings
  • Herbert Robbins and Sutton Monro. A stochastic approximation method. The Annals of Mathematical Statistics, 22(3):400–407, 1951.
    Google ScholarLocate open access versionFindings
  • Gozde Gul Sahin and Mark Steedman. Character-level models versus morphology in semantic role labeling. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 386–396, 2018.
    Google ScholarLocate open access versionFindings
  • Philip Schulz, Wilker Aziz, and Trevor Cohn. A stochastic decoder for neural machine translation. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1243–1252, 2018.
    Google ScholarLocate open access versionFindings
  • Rico Sennrich. How grammatical is character-level neural machine translation? Assessing MT quality with contrastive translation pairs. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics (Volume 2, Short Papers), pp. 376–382, 2017.
    Google ScholarLocate open access versionFindings
  • Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1715–1725, 2016.
    Google ScholarLocate open access versionFindings
  • Raivis Skadins, Jorg Tiedemann, Roberts Rozis, and Daiga Deksne. Billions of parallel words for free: Building and using the eu bookshop corpus. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC), pp. 1850–1855, 2014.
    Google ScholarLocate open access versionFindings
  • Noah A Smith. Linguistic structure prediction. Synthesis lectures on human language technologies, 4(2):1–274, 2011.
    Google ScholarLocate open access versionFindings
  • Jana Strakova, Milan Straka, and Jan Hajic. Open-Source Tools for Morphology, Lemmatization, POS Tagging and Named Entity Recognition. In Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pp. 13–18. Association for Computational Linguistics, 2014.
    Google ScholarLocate open access versionFindings
  • Jorg Tiedemann. News from opus-a collection of multilingual parallel corpora with tools and interfaces. In Recent advances in natural language processing, volume 5, pp. 237–248, 2009.
    Google ScholarLocate open access versionFindings
  • Jorg Tiedemann. Parallel data, tools and interfaces in opus. In Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC), pp. 2214–2218, 2012.
    Google ScholarLocate open access versionFindings
  • Francis M Tyers and Murat Serdar Alperen. South-east european times: A parallel corpus of balkan languages. In Proceedings of the LREC Workshop on Exploitation of Multilingual Resources and Tools for Central and (South-) Eastern European Languages, pp. 49–53, 2010.
    Google ScholarLocate open access versionFindings
  • Clara Vania and Adam Lopez. From characters to words to in between: Do we capture morphology? In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 2016–2027, 2017.
    Google ScholarLocate open access versionFindings
  • Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8(3-4):229–256, 1992.
    Google ScholarLocate open access versionFindings
  • Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. Google’s neural machine translation system: Bridging the gap between human and machine translation. 2016.
    Google ScholarFindings
  • Biao Zhang, Deyi Xiong, Jinsong Su, Hong Duan, and Min Zhang. Variational neural machine translation. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 521–530, 2016.
    Google ScholarLocate open access versionFindings
  • Chunting Zhou and Graham Neubig. Multi-space variational encoder-decoders for semi-supervised labeled sequence transduction. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 310–320, 2017.
    Google ScholarLocate open access versionFindings
Full Text
Your rating :
0

 

Tags
Comments