AI helps you reading Science

AI generates interpretation videos

AI extracts and analyses the key points of the paper to generate videos automatically


pub
Go Generating

AI Traceability

AI parses the academic lineage of this thesis


Master Reading Tree
Generate MRT

AI Insight

AI extracts a summary of this paper


Weibo:
We show that the output of vanilla Neural Machine Translation contains more high-frequency tokens and has lower lexical diversity

Token level Adaptive Training for Neural Machine Translation

EMNLP 2020, pp.1035-1046, (2020)

Cited by: 0|Views370
Full Text
Bibtex
Weibo

Abstract

There exists a token imbalance phenomenon in natural language as different tokens appear with different frequencies, which leads to different learning difficulties for tokens in Neural Machine Translation (NMT). The vanilla NMT model usually adopts trivial equal-weighted objectives for target tokens with different frequencies and tends to...More

Code:

Data:

0
Introduction
  • Neural machine translation (NMT) systems (Kalchbrenner and Blunsom, 2013; Cho et al, 2014; Token Order (Descending) [0, 10%) [10%, 30%) [30%, 50%) [50%, 70%) [70%, 100%]

    Average Frequency Reference

    Sutskever et al, 2014; Bahdanau et al, 2015; Gehring et al, 2017; Vaswani et al, 2017) are data driven models, which highly depend on the training corpus.
  • ; Ex[xJ ]], where Ex[xj] is the sum of the word embedding and the position embedding of the source word xj
  • This input sequence of vectors will be fed into the encoder and the output of the N -th layer will be taken as source hidden states.
  • In addition to the same kind of two sublayers in each encoder layer, the third crossattention sublayer is inserted between them, which
Highlights
  • Neural machine translation (NMT) systems (Kalchbrenner and Blunsom, 2013; Cho et al, 2014; Token Order (Descending) [0, 10%) [10%, 30%) [30%, 50%) [50%, 70%) [70%, 100%]

    Average Frequency Reference

    Sutskever et al, 2014; Bahdanau et al, 2015; Gehring et al, 2017; Vaswani et al, 2017) are data driven models, which highly depend on the training corpus
  • We proposed two heuristic criteria for designing token-level adaptive objectives and present two specific forms to alleviate the problem brought by the token imbalance phenomenon
  • To verify their effects on the high- and low-frequency tokens, we divided the validation set into two subsets based on the average token frequency of the sentences, the results of which are given in Table 2. It shows that these two methods can bring modest improvement in the translation of the low-frequency tokens, it does much harm to high-frequency tokens, which has a negative impact on the overall performance. We noted that both of these two methods reduced the weights of the high-frequency tokens to different degrees, and we argued that when the highfrequency tokens account for a large proportion in Neural Machine Translation (NMT) corpus, this hinders the normal training of them
  • We focus on the token imbalance problem of NMT
  • We show that the output of vanilla NMT contains more high-frequency tokens and has lower lexical diversity
  • Our methods can improve the translation performance without extra cost and can be combined with other techniques
  • We investigated existing adaptive objectives for other tasks and proposed two heuristic criteria based on the observations
Methods
  • The authors' work aims to explore suitable adaptive objectives that can improve the learning of low-frequency tokens and avoid harming the translation quality of high-frequency tokens.
  • The authors first investigated two existing adaptive objectives, which were proposed for solving the token imbalance problems in other tasks, and analyzed their performance.
  • There are some existing adaptive objectives which have been proven effective for other tasks.
  • The first objective the authors have investigated is the form in Focal loss (Lin et al., 2017), which was proposed for solving the label imbalance problem in the object detection task: w(yi) = (1 − p(yi))γ
Results
  • The results are shown in Table 4
  • It shows that the contrast methods can not bring stable improvements over the baseline system.
  • They bring excessive damages to the translation of high-frequency tokens which can be proved by the analyzing experiments .
  • The authors' methods can bring stable improvements over BaselineFT almost without any additional computing or storage expense.
  • More analyses based on the token frequency are shown
Conclusion
  • The authors show that the output of vanilla NMT contains more high-frequency tokens and has lower lexical diversity.
  • To alleviate this problem, the authors investigated existing adaptive objectives for other tasks and proposed two heuristic criteria based on the observations.
  • The final results show that the methods can achieve significant improvement in performance, especially on sentences that contain more low-frequency tokens.
  • Further analyses show that the method can improve the lexical diversity
Tables
  • Table1: The average frequency on the NIST training set and proportion of tokens with different frequencies in reference and the translation of the vanilla NMT model (a Transformer model) on the NIST test sets. All the target tokens (BPE sub-words with 30K merge operations ) of the training set are ranked by their frequencies in descending order. The ’Token Order’ column represents the frequency interval ([10%, 30%) means the frequency of token is between top 10% and 30%). The ’Average Frequency’ column represents the average frequencies of the tokens in each interval, which show the token imbalance phenomenon in natural language. The last two columns show the vanilla NMT model tends to generate more high-frequency tokens and less low-frequency tokens than reference
  • Table2: BLEU on the validation set of the ChineseEnglish translation task. ’Low’ is the subset of the validation set which contains more low-frequency tokens while ’High’ contains more high-frequency tokens
  • Table3: Performance of our methods on the validation sets for all the three language pairs with different hyperparameters T. Although the best hyperparameter for different languages may be different, it is easy for our method to get a stable improvement
  • Table4: BLEU scores on three translation tasks. The column of ∆ shows the improvement compared to BaselineFT. ** and * mean the improvements over Baseline-FT is statistically significant (<a class="ref-link" id="cCollins_et+al_2005_a" href="#rCollins_et+al_2005_a">Collins et al, 2005</a>) (ρ < 0.01 and ρ < 0.05, respectively). The results show that our methods can achieve significant improvements on translation quality
  • Table5: BLEU scores on different test subsets which are grouped by their rarities according to Eq 10. Sentences in the ‘Low’ contain more low-frequency tokens while the ‘High’ is reverse. The results show that our methods can improve the translation of low-frequency tokens significantly without hurting the translation of high-frequency tokens
  • Table6: EN→DE BLEU scores on different test subsets. The conclusion is identical to that in Table 5
  • Table7: The lexical diversity of translations. A larger value represents higher diversity. The results show that our method can improve the lexical diversity
  • Table8: Translation examples of the Basline-FT and our methods. The results show that our methods can generate low-frequency but more accurate tokens
Download tables as Excel
Related work
  • Rare Word Translation. Rare word translation is one of the key challenges for NMT. For word-level NMT models, NMT has its limitation in handling a larger vocabulary because of the training complexity and computing expense. Some work tries to solve this problem by maintaining phrase tables or back-off vocabulary (Luong et al, 2015; Jean et al, 2015; Li et al, 2016). The subword-based NMT (Sennrich et al, 2016; Luong and Manning, 2016; Wu et al, 2016) reduces the size of vocabulary greatly and become the mainstream technology gradually. Gowda and May (2020) gave a detailed analysis about the effects of the BPE size on the data distribution and translation quality. Some recent work tried to further improve the translation of the rare words with the help of the memory network or the pointer network (Zhao et al, 2018; Pham et al, 2018). In contrast, our methods can improve the translation performance without extra cost and can be combined with other techniques.
Funding
  • This work was supported by National Key R&D Program of China (NO. 2017YFE0192900)
Study subjects and analysis
language pairs with different hyperparameters T: 3
BLEU on the validation set of the ChineseEnglish translation task. ’Low’ is the subset of the validation set which contains more low-frequency tokens while ’High’ contains more high-frequency tokens. Performance of our methods on the validation sets for all the three language pairs with different hyperparameters T. Although the best hyperparameter for different languages may be different, it is easy for our method to get a stable improvement. BLEU scores on three translation tasks. The column of ∆ shows the improvement compared to BaselineFT. ** and * mean the improvements over Baseline-FT is statistically significant (Collins et al, 2005) (ρ < 0.01 and ρ < 0.05, respectively). The results show that our methods can achieve significant improvements on translation quality

Reference
  • Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural machine translation by jointly learning to align and translate. In 3rd International Conference on Learning Representations, ICLR.
    Google ScholarLocate open access versionFindings
  • Maher Baloch and Muhammad Rafi. 2015. An investigation on topic maps based document classification with unbalance classes. Journal of Independent Studies and Research, 13(1):50.
    Google ScholarLocate open access versionFindings
  • Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP, pages 1724– 1734.
    Google ScholarLocate open access versionFindings
  • Chenhui Chu, Raj Dabre, and Sadao Kurohashi. 2017. An empirical comparison of simple domain adaptation methods for neural machine translation. arXiv preprint arXiv:1701.03214.
    Findings
  • Michael Collins, Philipp Koehn, and Ivona Kucerova. 200Clause restructuring for statistical machine translation. In Proceedings of the 43rd annual meeting on association for computational linguistics, pages 531–540. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Sergey Edunov, Myle Ott, and Sam Gross. 2017. https://github.com/pytorch/fairseq.
    Findings
  • Katerina T Frantzi, Sophia Ananiadou, and Junichi Tsujii. 1998. The c-value/nc-value method of automatic recognition for multi-word terms. In International conference on theory and practice of digital libraries, pages 585–604. Springer.
    Google ScholarLocate open access versionFindings
  • Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, and Yann N. Dauphin. 2017. Convolutional sequence to sequence learning. In Proceedings of the 34th International Conference on Machine Learning, ICML, pages 1243–1252.
    Google ScholarLocate open access versionFindings
  • Thamme Gowda and Jonathan May. 2020. Neural machine translation with imbalanced classes. CoRR, abs/2004.02334.
    Findings
  • Shuhao Gu, Yang Feng, and Qun Liu. 2019. Improving domain adaptation translation with domain invariant and specific information. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), pages 3081–3091.
    Google ScholarLocate open access versionFindings
  • Caglar Gulcehre, Sungjin Ahn, Ramesh Nallapati, Bowen Zhou, and Yoshua Bengio. 2016. Pointing the unknown words. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, ACL 2016, August 7-12, 2016, Berlin, Germany, Volume 1: Long Papers.
    Google ScholarLocate open access versionFindings
  • Sebastien Jean, KyungHyun Cho, Roland Memisevic, and Yoshua Bengio. 2015. On using very large target vocabulary for neural machine translation. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing, ACL 2015, July 26-31, 2015, Beijing, China, Volume 1: Long Papers, pages 1–10.
    Google ScholarLocate open access versionFindings
  • Shaojie Jiang, Pengjie Ren, Christof Monz, and Maarten de Rijke. 2019. Improving neural response diversity with frequency-aware cross-entropy loss. In The World Wide Web Conference, WWW 2019, San Francisco, CA, USA, May 13-17, 2019, pages 2879–2885.
    Google ScholarFindings
  • Shaojie Jiang and Maarten de Rijke. 2018. Why are sequence-to-sequence models so dull? understanding the low-diversity problem of chatbots. In Proceedings of the 2nd International Workshop on Search-Oriented Conversational AI, SCAI@EMNLP 2018, Brussels, Belgium, October 31, 2018, pages 81–86.
    Google ScholarLocate open access versionFindings
  • Justin M Johnson and Taghi M Khoshgoftaar. 2019. Survey on deep learning with class imbalance. Journal of Big Data, 6(1):27.
    Google ScholarLocate open access versionFindings
  • Nal Kalchbrenner and Phil Blunsom. 2013. Recurrent continuous translation models. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 1700–1709.
    Google ScholarLocate open access versionFindings
  • Tom Kocmi and Ondrej Bojar. 20Curriculum learning and minibatch bucketing in neural machine translation. In Proceedings of the International Conference Recent Advances in Natural Language Processing, RANLP 2017, Varna, Bulgaria, September 2 - 8, 2017, pages 379–386.
    Google ScholarLocate open access versionFindings
  • Jason Lee, Elman Mansimov, and Kyunghyun Cho. 20Deterministic non-autoregressive neural sequence modeling by iterative refinement. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 1173– 1182.
    Google ScholarLocate open access versionFindings
  • Xiaoqing Li, Jiajun Zhang, and Chengqing Zong. 2016. Towards zero unknown word in neural machine translation. In Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence, pages 2852–2858. AAAI Press.
    Google ScholarLocate open access versionFindings
  • Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollar. 2017. Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision, pages 2980– 2988.
    Google ScholarLocate open access versionFindings
  • Minh-Thang Luong and Christopher D Manning. 2015. Stanford neural machine translation systems for spoken language domains. In Proceedings of the International Workshop on Spoken Language Translation, pages 76–79.
    Google ScholarLocate open access versionFindings
  • Minh-Thang Luong and Christopher D. Manning. 2016. Achieving open vocabulary neural machine translation with hybrid word-character models. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, ACL 2016, August 712, 2016, Berlin, Germany, Volume 1: Long Papers.
    Google ScholarLocate open access versionFindings
  • Thang Luong, Ilya Sutskever, Quoc V. Le, Oriol Vinyals, and Wojciech Zaremba. 2015. Addressing the rare word problem in neural machine translation. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing of the Asian Federation of Natural
    Google ScholarLocate open access versionFindings
  • Language Processing, ACL 2015, July 26-31, 2015, Beijing, China, Volume 1: Long Papers, pages 11– 19.
    Google ScholarLocate open access versionFindings
  • Philip M Mccarthy and Scott Jarvis. 2010.
    Google ScholarFindings
  • Nir Ofek, Lior Rokach, Roni Stern, and Asaf Shabtai. 2017. Fast-cbus: A fast clustering-based undersampling method for addressing the class imbalance problem. Neurocomputing, 243:88–102.
    Google ScholarLocate open access versionFindings
  • Kishore Papineni, Salim Roukos, Todd Ward, and WeiJing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, July 6-12, 2002, Philadelphia, PA, USA, pages 311–318.
    Google ScholarLocate open access versionFindings
  • Peyman Passban, Andy Way, and Qun Liu. 2018. Tailoring neural architectures for translating from morphologically rich languages. In Proceedings of the 27th International Conference on Computational Linguistics, COLING 2018, Santa Fe, New Mexico, USA, August 20-26, 2018, pages 3134–3145.
    Google ScholarLocate open access versionFindings
  • Gabriel Pereyra, George Tucker, Jan Chorowski, Lukasz Kaiser, and Geoffrey E. Hinton. 2017. Regularizing neural networks by penalizing confident output distributions. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Workshop Track Proceedings.
    Google ScholarLocate open access versionFindings
  • Ngoc-Quan Pham, Jan Niehues, and Alexander H. Waibel. 2018. Towards one-shot learning for rareword translation with external experts. In Proceedings of the 2nd Workshop on Neural Machine Translation and Generation, NMT@ACL 2018, Melbourne, Australia, July 20, 2018, pages 100–109.
    Google ScholarLocate open access versionFindings
  • Emmanouil Antonios Platanios, Otilia Stretcu, Graham Neubig, Barnabas Poczos, and Tom M. Mitchell. 2019. Competence-based curriculum learning for neural machine translation. In NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), pages 1162–1172.
    Google ScholarFindings
  • Marc’Aurelio Ranzato, Sumit Chopra, Michael Auli, and Wojciech Zaremba. 2016. Sequence level training with recurrent neural networks. In 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings.
    Google ScholarLocate open access versionFindings
  • Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, ACL.
    Google ScholarLocate open access versionFindings
  • Chenze Shao, Xilin Chen, and Yang Feng. 2018. Greedy search with probabilistic n-gram matching for neural machine translation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31 - November 4, 2018, pages 4778–4784.
    Google ScholarLocate open access versionFindings
  • Xu Sun, Wenjie Li, Houfeng Wang, and Qin Lu. 2014. Feature-frequency–adaptive on-line training for fast and accurate natural language processing. Computational Linguistics, 40(3):563–586.
    Google ScholarLocate open access versionFindings
  • Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pages 3104–3112.
    Google ScholarLocate open access versionFindings
  • Mildred C. Templin. 1957. Certain language skills in children: Their development and interrelationships.
    Google ScholarFindings
  • Eva Vanmassenhove, Christian Hardmeier, and Andy Way. 2019a. Getting gender right in neural machine translation. CoRR, abs/1909.05088.
    Findings
  • Eva Vanmassenhove, Dimitar Shterionov, and Andy Way. 2019b. Lost in translation: Loss and decay of linguistic richness in machine translation. In Proceedings of Machine Translation Summit XVII Volume 1: Research Track, MTSummit 2019, Dublin, Ireland, August 19-23, 2019, pages 222–232.
    Google ScholarLocate open access versionFindings
  • Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems, pages 5998–6008.
    Google ScholarLocate open access versionFindings
  • Thuy Vu, Aiti Aw, and Min Zhang. 2008. Term extraction through unithood and termhood unification. In Proceedings of the Third International Joint Conference on Natural Language Processing: Volume-II.
    Google ScholarLocate open access versionFindings
  • Rui Wang, Masao Utiyama, Lemao Liu, Kehai Chen, and Eiichiro Sumita. 2017. Instance weighting for neural machine translation domain adaptation. In EMNLP 2017, Copenhagen, Denmark, September 911, 2017, pages 1482–1488.
    Google ScholarLocate open access versionFindings
  • Shuo Wang, Zhaopeng Tu, Shuming Shi, and Yang Liu. 2020. On the inference calibration of neural machine translation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020, pages 3070–3079.
    Google ScholarLocate open access versionFindings
  • Wei Wei, Jinjiu Li, Longbing Cao, Yuming Ou, and Jiahang Chen. 2013. Effective detection of sophisticated online banking fraud on extremely imbalanced data. World Wide Web, 16(4):449–475.
    Google ScholarLocate open access versionFindings
  • Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. 2016. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144.
    Findings
  • Shiqi Zhang and Deyi Xiong. 2018. Sentence weighting for neural machine translation domain adaptation. In Proceedings of the 27th International Conference on Computational Linguistics, pages 3181– 3190.
    Google ScholarLocate open access versionFindings
  • Wen Zhang, Yang Feng, Fandong Meng, Di You, and Qun Liu. 2019. Bridging the gap between training and inference for neural machine translation. In Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers, pages 4334–4343.
    Google ScholarLocate open access versionFindings
  • Xuan Zhang, Gaurav Kumar, Huda Khayrallah, Kenton Murray, Jeremy Gwinnup, Marianna J. Martindale, Paul McNamee, Kevin Duh, and Marine Carpuat. 2018. An empirical exploration of curriculum learning for neural machine translation. CoRR, abs/1811.00739.
    Findings
  • Yang Zhao, Jiajun Zhang, Zhongjun He, Chengqing Zong, and Hua Wu. 2018. Addressing troublesome words in neural machine translation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 391–400.
    Google ScholarLocate open access versionFindings
  • Zhi-Hua Zhou and Xu-Ying Liu. 2005. Training costsensitive neural networks with methods addressing the class imbalance problem. IEEE Transactions on knowledge and data engineering, 18(1):63–77.
    Google ScholarLocate open access versionFindings
  • George Kingsley Zipf. 1949. Human behavior and the principle of least effort.
    Google ScholarFindings
Your rating :
0

 

Tags
Comments
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn
小科