AI helps you reading Science
AI generates interpretation videos
AI extracts and analyses the key points of the paper to generate videos automatically
AI parses the academic lineage of this thesis
AI extracts a summary of this paper
Experiments show that Statistical Machine Translation is significantly better and Neural Machine Translation under out-of-domain condition while NMT is better for in-domain evaluation; and the semi-supervised learning, transfer learning, and multilingual joint training can improv...
ChrEn: Cherokee English Machine Translation for Endangered Language Revitalization
EMNLP 2020, pp.577-595, (2020)
Cherokee is a highly endangered Native American language spoken by the Cherokee people. The Cherokee culture is deeply embedded in its language. However, there are approximately only 2,000 fluent first language Cherokee speakers remaining in the world and the number is declining every year. To help save this endangered language, we introd...More
PPT (Upload PPT)
- The Cherokee people are one of the indigenous peoples of the United States.
- Before the 1600s, they lived in what is the southeastern United States (Peake Raymond, 2008).
- There are three federally recognized nations of Cherokee.
- Src. ᎥᏝ ᎡᎶᎯ ᎠᏁᎯ ᏱᎩ, ᎾᏍᎩᏯ ᎠᏴ ᎡᎶᎯ ᎨᎢ ᏂᎨᏒᎾ ᏥᎩ.
- Ref. Ref
- They are not of the world, even as the author is not of the world.
- NMT the author is not the world, even as the author is not of the world
- The Cherokee people are one of the indigenous peoples of the United States
- We apply three semi-supervised methods: using additional monolingual data to train the language model for Statistical Machine Translation (SMT) (Koehn and Knowles, 2017); incorporating BERT (Devlin et al, 2019) representations for Neural Machine Translation (NMT) (Zhu et al, 2020), where we introduce four different ways to use BERT; and the back-translation method for both SMT and NMT (Bertoldi and Federico, 2009; Lambert et al, 2011; Sennrich et al, 2016b)
- Our main experimental results are shown in Table 3 and Table 4.14 Overall, the translation performance is poor compared with the results of some high-resource translations (Sennrich et al, 2016a), which means that current popular SMT and NMT techniques still struggle to translate well between Cherokee and English especially for the out-of
- 10http://data.statmt.org/news-crawl/en/ 11http://www.statmt.org/wmt18/index.html 12http://opus.nlpl.eu/bible-uedin.php 13BLEU+c.mixed+#.1+s.exp+tok.13a+v.1.4.4 14The confidence intervals in Table 3 and Table 4 are computed by the bootstrap method (Efron and Tibshirani, 1994)
- Back-translation with our Cherokee monolingual data barely improves performance for both in-domain and out-of-domain evaluations, probably because the monolingual data is out-of-domain, 72% of the unique Cherokee tokens are unseen in the whole parallel data
- Experiments show that SMT is significantly better and NMT under out-of-domain condition while NMT is better for in-domain evaluation; and the semi-supervised learning, transfer learning, and multilingual joint training can improve supervised baselines
- 5.1 Experimental Details
The authors randomly sample 5K-100K sentences from News Crawl 201710 as the English monolingual data.
- The authors conduct a smallscale human pairwise comparison by the coauthor between the translations generated by the NMT and SMT systems.
- The authors randomly sample 50 examples from Test or Out-test, anonymously shuffle the translations from two systems, and ask the coauthor to choose which one they think is better.15.
- For English-Cherokee translation, though the RNN-NMT+BERT (N5) has a better BLEU score than SMT+BT (S3) (12.2 vs 9.9), it is liked less by humans (21 vs 29), indicating that BLEU is possibly not a suitable for Cherokee evaluation.
- A detailed study is beyond the scope of this paper but is an interesting future work direction
- Conclusion and Future Work
In this paper, the authors make the effort to revitalize the Cherokee language by introducing a clean Cherokee-English parallel dataset, ChrEn, with
15The author, who conducted this human study, was not involved in the development of MT systems.
Condition | System IDs Win Lose Chr-En Test Out-test
N7 vs. S3 N7 vs. S2 En-Chr
N5 vs. S3 N7 vs. S3
14K sentence pairs; and 5K Cherokee monolingual sentences.
- Experiments show that SMT is significantly better and NMT under out-of-domain condition while NMT is better for in-domain evaluation; and the semi-supervised learning, transfer learning, and multilingual joint training can improve supervised baselines.
- The authors' best models achieve 15.8/12.7 BLEU for in-domain ChrEn/En-Chr translations and 6.5/5.0 BLEU for outof-domain Chr-En/En-Chr translations.
- The authors hope these diverse baselines will serve as useful strong starting points for future work by the community.
- The authors' future work involves converting the monolingual data to parallel and collecting more data from the news domain
- Table1: An example from the development set of ChrEn. NMT denotes our RNN-NMT model
- Table2: The key statistics of our parallel and monolingual data. Note that “% Unseen unique English tokens” is in terms of the Train split, for example, 13.3% of unique Engli5s/26h/2020tokens in Dev are unseen in Train. test2.svg
- Table3: Performance of our supervised/semi-supervised SMT/NMT systems. Bold numbers are our best out-ofdomain systems together with Table 4, selected by performance on Out-dev. (±x) shows 95% confidence interval
- Table4: Performance of our transfer and multilingual learning systems. Bold numbers are our best in-domain systems together with Table 3, selected by the performance on Dev. (±x) shows the 95% confidence interval
- Table5: Human comparison between the translations generated from our NMT and SMT systems. If A vs. B, “Win” or “lose” means that the evaluator favors A or B. Systems IDs correspond to the IDs in Table 3
- Table6: The comparison between our parallel data and the data provided on OPUS
- Table7: The hyper-parameter settings of Supervised and Semi-supervised Cherokee-English NMT systems in Table 3. Empty fields indicate that hyper-parameter is the same as the previous (left) system
- Table8: The hyper-parameter settings of Transferring Cherokee-English NMT systems in Table 4. Empty fields indicate that hyper-parameter is the same as the previous (left) system
- Table9: The hyper-parameter settings of Multilingual Cherokee-English NMT systems in Table 4. Empty fields indicate that hyper-parameter is the same as the previous (left) system
- Table10: The hyper-parameter settings of in-domain Supervised and Semi-supervised English-Cherokee NMT systems in Table 3. Empty fields indicate that hyper-parameter is the same as the previous (left) system
- Table11: The hyper-parameter settings of out-of-domain Supervised and Semi-supervised English-Cherokee NMT systems in Table 3. Empty fields indicate that hyper-parameter is the same as previous (left) system
- Table12: The hyper-parameter settings of Transferring English-Cherokee NMT systems in Table 4. Empty fields indicate that hyper-parameter is the same as the previous (left) system
- Table13: The hyper-parameter settings of Multilingual English-Cherokee NMT systems in Table 4. Empty fields indicate that hyper-parameter is the same as the previous (left) system
- Table14: Parallel Data Sources
- Table15: Monolingual Data Sources
- Cherokee Language Revitalization. In 2008, the Cherokee Nation launched the 10-year language preservation plan (Nation, 2001), which aims to have 80% or more of the Cherokee people be fluent in this language in 50 years. After that, a lot of revitalization works were proposed. Cherokee Nation and the EBCI have established language immersion programs and k-12 language curricula. Several universities, including the University of Oklahoma, Stanford University, etc., have begun offering Cherokee as a second language. However, given Cherokee has been rated at the highest level of learning difficulty (Peake Raymond, 2008), it is hard to be mastered without frequent language exposure. As mentioned by Crystal (2014), an endangered language will progress if its speakers can make use of electronic technology. Currently, the language is included among existing Unicode-compatible fonts, is supported by Gmail, and has a Wikipedia page. To revitalize Cherokee, a few Cherokee pedagogical books have been published (Holmes and Smith, 1976; Joyner, 2014), as well as several online learning platforms.3 Feeling (2018) provided detailed English translations and linguistic analysis of a number of Cherokee stories. A Digital Archive for American Indian Languages Preservation and Perseverance (DAILP) has been developed for transcribing, translating, and contextualizing historical Cherokee language documents (Bourns, 2019; Cushman, 2019).4 However, the translation between Cherokee and English still can only be done by human translators. Given that only 2,000 fluent first-language speakers are left, and the majority of them are elders, it is important and urgent to have a machine translation system that could assist them with translation. Therefore, we introduce a clean Cherokee-English parallel dataset to facilitate machine translation development and propose multiple translation systems as starting points of future works. We hope our work could attract more
- This work was supported by NSF-CAREER Award 1846185, ONR Grant N00014-18-1-2871, and faculty awards from Google, Facebook, and Microsoft
- The views contained in this article are those of the authors and not of the funding agency. languages
Study subjects and analysis
Southern Iroquoian Northern Iroquoian. Five Nations Iroquois to Cherokee revitalization by constructing a clean Cherokee-English parallel dataset, ChrEn, which results in 14,151 pairs of sentences with around 313K English tokens and 206K Cherokee tokens. We also collect 5,210 Cherokee monolingual sentences with 93K Cherokee tokens
fluent first-language speakers: 2000
However, the translation between Cherokee and English still can only be done by human translators. Given that only 2,000 fluent first-language speakers are left, and the majority of them are elders, it is important and urgent to have a machine translation system that could assist them with translation. Therefore, we introduce a clean Cherokee-English parallel dataset to facilitate machine translation development and propose multiple translation systems as starting points of future works
sentence pairs: 14151
This process is time-consuming and took several months. The resulting dataset consists of 14,151 sentence pairs. After tokenization,7 there are around
ment/testing sets. We separate all the sentence pairs from newspaper articles, 512 pairs in total, and randomly split them in half as out-of-domain development and testing sets, denoted by Out-dev and Out-test. The remaining sentence pairs are randomly split into in-domain Train, Dev, and Test
language pairs: 4
We randomly sample 5K-100K sentences (about 0.5-10 times the size of the parallel training set) from News Crawl 201710 as our English monolingual data. We randomly sample 12K-58K examples (about 1-5 times the size of parallel training set) for each of the 4 language pairs (Czech/German/Russian/Chinese-English) from News Commentary v13 of WMT201811 and Bibleuedin (Christodouloupoulos and Steedman, 2015) on OPUS12. We apply tokenizer and truecaser from Moses (Koehn et al, 2007)
- Elizabeth Albee. 2017. Immersion schools and language learning: A review of cherokee language revitalization efforts among the eastern band of cherokee indians.
- Alexei Baevski and Michael Auli. 2019. Adaptive input representations for neural language modeling. In International Conference on Learning Representations.
- Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural machine translation by jointly learning to align and translate. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings.
- Nicola Bertoldi and Marcello Federico. 2009. Domain adaptation for statistical machine translation with monolingual resources. In Proceedings of the fourth workshop on statistical machine translation, pages 182–189.
- Jeffrey Bourns. 2019. Cherokee syllabary texts: Digital documentation and linguistic description. In 2nd Conference on Language, Data and Knowledge (LDK 2019). Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik.
- Christos Christodouloupoulos and Mark Steedman. 2015. A massively parallel corpus: the bible in 100 languages. Language resources and evaluation, 49(2):375–395.
- David Crystal. 2014. Language Death. Canto Classics. Cambridge University Press.
- Ellen Cushman. 2019. Language perseverance and translation of cherokee documents. College English, 82(1):115–134.
- Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 201Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186.
- Kevin Duh, Paul McNamee, Matt Post, and Brian Thompson. 2020. Benchmarking neural and statistical machine translation on low-resource african
- D. Feeling. 1994. A structured approach to learning the basic inflections of the Cherokee verb. Indian University Press, Bacone College.
- Durbin Feeling. 2018. Cherokee Narratives: A Linguistic Study. University of Oklahoma Press.
- Benjamin Frey. 2020. “data is nice:” theoretical and pedagogical implications of an eastern cherokee corpus. LD&C Special Publication.
- Caglar Gulcehre, Orhan Firat, Kelvin Xu, Kyunghyun Cho, Loic Barrault, Huei-Chi Lin, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2015. On using monolingual corpora in neural machine translation. arXiv preprint arXiv:1503.03535.
- Francisco Guzmán, Peng-Jen Chen, Myle Ott, Juan Pino, Guillaume Lample, Philipp Koehn, Vishrav Chaudhary, and Marc’Aurelio Ranzato. 2019. The flores evaluation datasets for low-resource machine translation: Nepali–english and sinhala–english. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 6100– 6113.
- Kenneth Heafield, Ivan Pouzyrevsky, Jonathan H Clark, and Philipp Koehn. 2013. Scalable modified kneser-ney language model estimation. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 690–696.
- Ruth Bradley Holmes and Betty Sharp Smith. 1976. Beginning Cherokee. University of Oklahoma Press, Norman.
- Sébastien Jean, Orhan Firat, Kyunghyun Cho, Roland Memisevic, and Yoshua Bengio. 2015. Montreal neural machine translation systems for wmt’15. In Proceedings of the Tenth Workshop on Statistical Machine Translation, pages 134–140.
- Melvin Johnson, Mike Schuster, Quoc V Le, Maxim Krikun, Yonghui Wu, Zhifeng Chen, Nikhil Thorat, Fernanda Viégas, Martin Wattenberg, Greg Corrado, et al. 2017. Google’s multilingual neural machine translation system: Enabling zero-shot translation. Transactions of the Association for Computational Linguistics, 5:339–351.
- M. Joyner. 2014. Cherokee Language Lessons. Lulu Press, Incorporated.
- Guillaume Klein, Yoon Kim, Yuntian Deng, Jean Senellart, and Alexander Rush. 2017. OpenNMT: Opensource toolkit for neural machine translation. In Proceedings of ACL 2017, System Demonstrations, pages 67–72, Vancouver, Canada. Association for Computational Linguistics.
- Tom Kocmi and Ondrej Bojar. 2018. Trivial transfer learning for low-resource neural machine translation. WMT 2018, page 244.
- Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, et al. 2007. Moses: Open source toolkit for statistical machine translation. In Proceedings of the 45th annual meeting of the association for computational linguistics companion volume proceedings of the demo and poster sessions, pages 177–180.
- Philipp Koehn and Rebecca Knowles. 2017. Six challenges for neural machine translation. In Proceedings of the First Workshop on Neural Machine Translation, pages 28–39.
- Philipp Koehn, Franz Josef Och, and Daniel Marcu. 2003. Statistical phrase-based translation. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language TechnologyVolume 1, pages 48–54. Association for Computational Linguistics.
- Surafel M Lakew, Matteo Negri, and Marco Turchi. 2020. Low resource neural machine translation: A benchmark for five african languages. In AfricaNLP workshop at ICLR 2020.
- Patrik Lambert, Holger Schwenk, Christophe Servan, and Sadaf Abdul-Rauf. 2011. Investigations on translation model adaptation using monolingual data. In Proceedings of the Sixth Workshop on Statistical Machine Translation, pages 284–293. Association for Computational Linguistics.
- Guillaume Lample, Alexis Conneau, Ludovic Denoyer, and Marc’Aurelio Ranzato. 2018. Unsupervised machine translation using monolingual corpora only. In International Conference on Learning Representations.
- Minh-Thang Luong, Hieu Pham, and Christopher D Manning. 2015. Effective approaches to attentionbased neural machine translation. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 1412–1421.
- Brad Montgomery-Anderson. 2008. A reference grammar of Oklahoma Cherokee. Ph.D. thesis, University of Kansas.
- Cherokee Nation. 2001. Ga-du-gi: A vision for working together to preserve the cherokee language. report of a needs assessment survey and a 10-year language revitalization plan.
- Franz Josef Och. 2003. Minimum error rate training in statistical machine translation. In Proceedings of the 41st Annual Meeting on Association for Computational Linguistics-Volume 1, pages 160–167. Association for Computational Linguistics.
- Franz Josef Och and Hermann Ney. 2003. A systematic comparison of various statistical alignment models. Computational Linguistics, 29(1):19–51.
- Kishore Papineni, Salim Roukos, Todd Ward, and WeiJing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics, pages 311–318. Association for Computational Linguistics.
- M Peake Raymond. 2008. The cherokee nation and its language: Tsalagi ayeli ale uniwonishisdi. Tahlequah, OK: Cherokee Nation.
- Matt Post. 2018. A call for clarity in reporting bleu scores. In Proceedings of the Third Conference on Machine Translation: Research Papers, pages 186– 191.
- Ofir Press and Lior Wolf. 2017. Using the output embedding to improve language models. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, pages 157–163.
- Hammam Riza, Michael Purwoadi, Teduh Uliniansyah, Aw Ai Ti, Sharifah Mahani Aljunied, Luong Chi Mai, Vu Tat Thang, Nguyen Phuong Thai, Vichet Chea, Sethserey Sam, et al. 2016. Introduction of the asian language treebank. In 2016 Conference of The Oriental Chapter of International Committee for Coordination and Standardization of Speech Databases and Assessment Techniques (O-COCOSDA), pages 1–6. IEEE.
- Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016a. Edinburgh neural machine translation systems for wmt 16. In Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers, pages 371–376.
- Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016b. Improving neural machine translation models with monolingual data. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 86–96.
- Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016c. Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1715– 1725.
- Rico Sennrich and Biao Zhang. 2019. Revisiting lowresource neural machine translation: A case study. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 211– 221.
- Stephanie Strassel and Jennifer Tracey. 2016. Lorelei language packs: Data, tools, and resources for technology development in low resource languages. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), pages 3273–3280.
- Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. 2016. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2818–2826.
- Jörg Tiedemann. 2012. Parallel data, tools and interfaces in opus. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC’12), pages 2214–2218.
- Hiroto Uchihara. 2016. Tone and accent in Oklahoma Cherokee, volume 3. Oxford University Press.
- Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008.
- John Wieting, Taylor Berg-Kirkpatrick, Kevin Gimpel, and Graham Neubig. 2019. Beyond bleu: Training neural machine translation with semantic similarity. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4344–4355.
- Shiyue Zhang, Gulnigar Mahmut, Dong Wang, and Askar Hamdulla. 2017. Memory-augmented chinese-uyghur neural machine translation. In 2017 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), pages 1092–1096. IEEE.
- Jinhua Zhu, Yingce Xia, Lijun Wu, Di He, Tao Qin, Wengang Zhou, Houqiang Li, and Tieyan Liu. 2020. Incorporating bert into neural machine translation. In International Conference on Learning Representations.