AI helps you reading Science
AI generates interpretation videos
AI extracts and analyses the key points of the paper to generate videos automatically
AI parses the academic lineage of this thesis
AI extracts a summary of this paper
We showed how to successfully apply distributionally robust optimization to the Multilingual neural machine translation setting and automatically adjust the sampling distribution over language pairs resulting in sizeable improvements in performance
Distributionally Robust Multilingual Machine Translation
Multilingual neural machine translation (MNMT) learns to translate multiple language pairs with a single model, potentially improving both the accuracy and the memory-efficiency of deployed models. However, the heavy data imbalance between languages hinders the model from performing uniformly across language pairs. In this paper, we pro...More
PPT (Upload PPT)
- When model capacity is limited, this results in trade-offs or decreased performance on some languages, LRLs (Arivazhagan et al, 2019; Wang et al, 2020b, 2021).
- To better control this trade-off, a common practice is to balance the training distribution by heuristic oversampling of LRLs (Johnson et al, 2017; Neubig and Hu, 2018; Arivazhagan et al, 2019).
- Simple data balancing can improve the performance on LRLs significantly, it is far from optimal.
- Previous work has indicated the importance of learning strategies that are explicitly tailored for each multilingual learning scenario (Wang et al, 2020a)
- We empirically find that naively applying existing methods to multilingual learning yields inferior results to empirical risk minimization (ERM), mostly because (1) standard distributionally robust optimization (DRO) objectives tend to be overly conservative and only take into account language pairs with very large losses and (2) existing optimization algorithms for DRO essentially reweigh the gradients of examples in a mini-batch, which implicitly changes the scale of the learning rates
- To efficiently solve the min-max game, we propose an iterated best response scheme that, at each epoch, re-samples the training data according to the worst weighting for the current model parameters, and runs ERM training on the re-sampled dataset
- Our methodological contributions are two-fold: (i) we first describe the shortcomings of the Group DRO objective (2), propose a related training criterion that addresses these issues, (ii) we describe an optimization algorithm to solve the min-max optimization problem that is amenable to the Multilingual neural machine translation (MNMT) setting
- By taking a closer look at the BLEU scores for each individual language pairs, χ-iterated best response (IBR) improves over almost all the language pairs for both translation directions compared to ERM
- In experiments we found that naively applying existing DRO objectives fails to achieve performance on par with strong baselines, often improving results on language pairs with high losses but sacrificing too much performance overall
- We showed how to successfully apply DRO to the MNMT setting and automatically adjust the sampling distribution over language pairs resulting in sizeable improvements in performance
- In experiments the authors found that naively applying existing DRO objectives fails to achieve performance on par with strong baselines, often improving results on language pairs with high losses but sacrificing too much performance overall.
- The authors' main contribution is showing how to successfully apply DRO to the MNMT setting, and to the best of the knowledge, the work is the first to do so.
- The authors' methodological contributions are two-fold: (i) the authors first describe the shortcomings of the Group DRO objective (2), propose a related training criterion that addresses these issues, (ii) the authors describe an optimization algorithm to solve the min-max optimization problem that is amenable to the MNMT setting.
- The authors present the BLEU scores of en→any and any→en translation directions on TED and WMT data in Tab. 1 and 2 respectively.
- For both TED and WMT datasets, χ-IBR outperforms all the other baseline methods in terms of average BLEU score over all language pairs.
- By taking a closer look at the BLEU scores for each individual language pairs, χ-IBR improves over almost all the language pairs for both translation directions compared to ERM.
- Τ needs to be carefully tuned to achieve adequate performance on both HRLs and LRLs
- Table1: BLEU scores of the best ERM model (among τ =1/5/100, τ = 5/100 are significantly worse than τ = 1, thus we omit these results), MultiDDS (Wang et al, 2020a) and our approach on the test sets of the TED dataset. Bold (resp. underlined) values indicate the best (resp. second best) performance for each language pair. Values under the language codes are the proportion of the language in the training data
- Table2: BLEU scores of the ERM (τ =1/5/100), MultiDDS and our method on the test sets of the WMT dataset. The ratios of training data of de, fr, ta and tr are (0.499, 0.359, 0.102, 0.039)
- Table3: BLEU scores of different DRO objectives and algorithms—primal-dual (PD) and iterated best response (IBR)—on the WMT test sets
- Table4: Average BLEU on the test sets of en→any direction, BL is short for baseline loss
- Table5: Number of training sentences in the TED related and diverse sets respectively
- Table6: Basic hyper-parameters of Transformer
- This work was supported in part by a Facebook SRA Award and the NSF/Amazon Fairness in AI program under grant number 2040926. best response convincingly outperforms the same objective trained with primal-dual
Study subjects and analysis
language pairs: 3
A weakness of the objective (2) is that apart from the language pair with largest loss, the objective does not take into account the value of the loss on the other language pairs. To illustrate this, consider this example with N = 3 language pairs and suppose that there exists two parameters θ1 and θ2 with the following loss: L(θ1; D1) = 0.1, L(θ1; D2) = 0.1, L(θ1; D3) = 1.1 L(θ2; D1) = 1.0, L(θ2; D2) = 1.0, L(θ2; D3) = 1.0As we previewed in §2, Group DRO is a natural objective for the multilingual setting. However, in experiments we found that naively applying existing DRO objectives fails to achieve performance on par with strong baselines, often improving results on language pairs with high losses but sacrificing too much performance overall
language pairs: 3
A weakness of the objective (2) is that apart from the language pair with largest loss, the objective does not take into account the value of the loss on the other language pairs. To illustrate this, consider this example with N = 3 language pairs and suppose that there exists two parameters θ1 and θ2 with the following loss: L(θ1; D1) = 0.1, L(θ1; D2) = 0.1, L(θ1; D3) = 1.1 L(θ2; D1) = 1.0, L(θ2; D2) = 1.0, L(θ2; D3) = 1.0. We have that LGDRO(θ1; D) = 1.1 but LGDRO(θ2; D) = 1.0
sets of language pairs: 3
Our method— which we refer to as χ-IBR —incurs negligible additional computational cost compared to ERM. While this method applies to essentially any multilingual task, we specifically demonstrate its benefit on three sets of language pairs from two multilingual machine translation datasets. We experimentally test these choices by comparing several objectives and optimization algorithms, and results show that our method consistently outperforms existing DRO procedures and various strong baselines
- Roee Aharoni, Melvin Johnson, and Orhan Firat. 2019. Massively multilingual neural machine translation. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 3874– 3884.
- Naveen Arivazhagan, Ankur Bapna, Orhan Firat, Dmitry Lepikhin, Melvin Johnson, Maxim Krikun, Mia Xu Chen, Yuan Cao, George Foster, Colin Cherry, et al. 2019. Massively multilingual neural machine translation in the wild: Findings and challenges. arXiv preprint arXiv:1907.05019.
- Aharon Ben-Tal, Dick Den Hertog, Anja De Waegenaere, Bertrand Melenberg, and Gijs Rennen. 2013a. Robust solutions of optimization problems affected by uncertain probabilities. Management Science, 59(2):341–357.
- Aharon Ben-Tal, Dick den Hertog, Anja De Waegenaere, Bertrand Melenberg, and Gijs Rennen. 2013b. Robust solutions of optimization problems affected by uncertain probabilities. Management Science, 59(2):341–357.
- Dimitris Bertsimas, Vishal Gupta, and Nathan Kallus. 2018. Data-driven robust optimization. Mathematical Programming, Series A, 167(2):235–292.
- Stephen Boyd and Lieven Vandenberghe. 2004. Convex Optimization. Cambridge University Press.
- Sébastien Bubeck. 2015. Convex optimization: Algorithms and complexity. Foundations and Trends in Machine Learning, 8(3-4):231–357.
- Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Édouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. 2020. Unsupervised cross-lingual representation learning at scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8440– 8451.
- Imre Csiszár. 1967. Information-type measures of difference of probability distributions and indirect observation. Studia Scientifica Mathematica Hungary, 2:299–318.
- Erick Delage and Yinyu Ye. 20Distributionally robust optimization under moment uncertainty with application to data-driven problems. Operations Research, 58(3):595–612.
- Daniel Levy, Yair Carmon, John C. Duchi, and Aaron Sidford. 2020. Large-scale methods for distributionally robust optimization. In Advances in Neural Information Processing Systems 33.
- Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of NAACL-HLT, pages 4171– 4186.
- John Duchi, Peter Glynn, and Hongseok Namkoong. 2016. Statistics of robust optimization: A generalized empirical likelihood approach. arXiv preprint arXiv:1610.03425.
- John C. Duchi and Hongseok Namkoong. 2019. Variance-based regularization with convex objectives. Journal of Machine Learning Research, 20(68):1–55.
- John C. Duchi and Hongseok Namkoong. 2020. Learning models with uniform performance via distributionally robust optimization. Annals of Statistics, to appear.
- Hongseok Namkoong and John C. Duchi. 20Stochastic gradient methods for distributionally robust optimization with f -divergences. In Advances in Neural Information Processing Systems 29.
- A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. 2009. Robust stochastic approximation approach to stochastic programming. SIAM Journal on Optimization, 19(4):1574–1609.
- Arkadi Nemirovski. 2004. Prox-method with rate of convergence O(1/t) for variational inequalities with Lipschitz continuous monotone operators and smooth convex-concave saddle point problems. SIAM Journal on Optimization, 15(1):229–251.
- Graham Neubig and Junjie Hu. 2018. Rapid adaptation of neural machine translation to new languages. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 875–880.
- Orhan Firat, Kyunghyun Cho, and Yoshua Bengio. 2016. Multi-way, multilingual neural machine translation with a shared attention mechanism. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 866–875.
- Yonatan Oren, Shiori Sagawa, Tatsunori Hashimoto, and Percy Liang. 2019. Distributionally robust language modeling. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLPIJCNLP), pages 4218–4228.
- Thanh-Le Ha, Jan Niehues, and Alexander Waibel. 2016. Toward multilingual neural machine translation with universal encoder and decoder. arXiv preprint arXiv:1611.04798.
- Tatsunori Hashimoto, Megha Srivastava, Hongseok Namkoong, and Percy Liang. 2018. Fairness without demographics in repeated loss minimization. In Proceedings of the 35th International Conference on Machine Learning.
- Melvin Johnson, Mike Schuster, Quoc Le, Maxim Krikun, Yonghui Wu, Zhifeng Chen, Nikhil Thorat, Fernanda Viégas, Martin Wattenberg, Greg Corrado, et al. 2017. Google’s multilingual neural machine translation system: Enabling zero-shot translation. Transactions of the Association for Computational Linguistics, 5:339–351.
- Diederik Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
- Taku Kudo and John Richardson. 2018. Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 66–71.
- Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, Sam Gross, Nathan Ng, David Grangier, and Michael Auli. 2019. fairseq: A fast, extensible toolkit for sequence modeling. In Proceedings of NAACL-HLT 2019: Demonstrations.
- Kishore Papineni, Salim Roukos, Todd Ward, and WeiJing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In ACL 2002.
- Telmo Pires, Eva Schlinger, and Dan Garrette. 2019. How multilingual is multilingual bert? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4996–5001.
- Matt Post. 2018. A call for clarity in reporting BLEU scores. In Proceedings of the Third Conference on Machine Translation: Research Papers, pages 186– 191, Belgium, Brussels. Association for Computational Linguistics.
- Ye Qi, Devendra Sachan, Matthieu Felix, Sarguna Padmanabhan, and Graham Neubig. 2018. When and why are pre-trained word embeddings useful for neural machine translation? In Meeting of the North American Chapter of the Association for Computational Linguistics (NAACL), New Orleans, USA.
- Tim Roughgarden. 2016. Twenty Lectures on Algorithmic Game Theory. Cambridge University Press.
- Shiori Sagawa, Pang Wei Koh, Tatsunori B Hashimoto, and Percy Liang. 2020. Distributionally robust neural networks for group shifts: On the importance of regularization for worst-case generalization. In International Conference on Learning Representations (ICLR), Addis Ababa, Ethiopia.
- Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008.
- Xinyi Wang, Yulia Tsvetkov, and Graham Neubig. 2020a. Balancing training for multilingual neural machine translation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8526–8537.
- Zirui Wang, Zachary C Lipton, and Yulia Tsvetkov. 2020b. On negative interference in multilingual language models. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 4438–4450.
- Zirui Wang, Yulia Tsvetkov, Orhan Firat, and Yuan Cao. 2021. Gradient vaccine: Investigating and improving multi-task optimization in massively multilingual models. In International Conference on Learning Representations.
- Barret Zoph, Deniz Yuret, Jonathan May, and Kevin Knight. 2016. Transfer learning for low-resource neural machine translation. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 1568–1575.
- In this section, we describe the bisection procedure we use to solve the best response and update q shown in (7). This derivation generically exists in the literature (e.g. Appendix A.1.2 in (Levy et al., 2020)) but we specialize it to the χ2-ball centered at ptrain and include it here for completeness.
- Primal-dual algorithms (Nemirovski, 2004; Nemirovski et al., 2009) are the methods of choice to efficiently solve min-max problems.
- , where gx,t ∈ ∂xF (xt, yt) and gy,t ∈ ∂yF (xt, yt). After T steps, return (xT, yT ) with zT:= 1/T t≤T zt. Assuming F is appropriately Lipschitz and that X and Y are bounded, one finds an -approximate saddle point in O( −2) steps. Importantly, these guarantee still holds even when only having stochastic unbiased estimates of gx and gy (Nemirovski et al., 2009) which is essential in large-scale settings.
- 22. This results in the standard SGD update θt+1 = θt − ηgθ, where gθ is an unbiased stochastic gradient estimate that we compute following the previous section.
- 22. This leads to the following update qt+1 = argmin (qt + ηgq) − q 22, q∈∆N:χ2q,ptrain≤ρ or in other words, the projection of the gradient ascent step (qt + ηgq) onto the χ2-ball. We explain in Appendix B.3 how to efficiently compute this projection for arbitrary ptrain.
- q-update for U = Uα. We follow the implementation of Oren et al. (2019) which runs a hybrid between primal-dual methods and best response and thus we do not need to explicit include the projection onto the CVaR uncertainty set. We discuss the option here for completeness. For the CVaR
- uncertainty set, it is standard to also use the negative Shannon entropy hq = i qi log qi. We refer to Appendix F.6.2 of (Levy et al., 2020) and their provided code for more details on this projection.
- divergence (Csiszár, 1967) corresponding to t →
- While this projection is standard in the literature, it is often derived in the case of ptrain = 1/m (see e.g. (Namkoong and Duchi, 2016)). We show here how to efficiently do the projection for arbitrary ptrain ∈ ∆m.
- In contrast to the uniform case (i.e. ptrain = 1/m), one cannot derive the optimal dual variable λ∗ in closed-form and we have to solve for both dual variables. Finding an -accurate solution takes order m log(1/ ) time using cutting plane-type methods when the dimension is O(1) (Bubeck, 2015). In the large-scale applications we consider, this is negligible in comparison to computing the gradient of the loss with respect to the network parameters and thus the primal-dual algorithm incurs (almost) no additional computational overhead.