AI helps you reading Science

AI generates interpretation videos

AI extracts and analyses the key points of the paper to generate videos automatically


pub
Go Generating

AI Traceability

AI parses the academic lineage of this thesis


Master Reading Tree
Generate MRT

AI Insight

AI extracts a summary of this paper


Weibo:
Experiments show that Statistical Machine Translation is significantly better and Neural Machine Translation under out-of-domain condition while NMT is better for in-domain evaluation; and the semi-supervised learning, transfer learning, and multilingual joint training can improv...

ChrEn: Cherokee English Machine Translation for Endangered Language Revitalization

EMNLP 2020, pp.577-595, (2020)

Cited by: 0|Views212
Full Text
Bibtex
Weibo

Abstract

Cherokee is a highly endangered Native American language spoken by the Cherokee people. The Cherokee culture is deeply embedded in its language. However, there are approximately only 2,000 fluent first language Cherokee speakers remaining in the world and the number is declining every year. To help save this endangered language, we introd...More

Code:

Data:

0
Introduction
  • The Cherokee people are one of the indigenous peoples of the United States.
  • Before the 1600s, they lived in what is the southeastern United States (Peake Raymond, 2008).
  • There are three federally recognized nations of Cherokee.
  • Src. ᎥᏝ ᎡᎶᎯ ᎠᏁᎯ ᏱᎩ, ᎾᏍᎩᏯ ᎠᏴ ᎡᎶᎯ ᎨᎢ ᏂᎨᏒᎾ ᏥᎩ.
  • Ref. Ref
  • They are not of the world, even as the author is not of the world.
  • NMT the author is not the world, even as the author is not of the world
Highlights
  • The Cherokee people are one of the indigenous peoples of the United States
  • We apply three semi-supervised methods: using additional monolingual data to train the language model for Statistical Machine Translation (SMT) (Koehn and Knowles, 2017); incorporating BERT (Devlin et al, 2019) representations for Neural Machine Translation (NMT) (Zhu et al, 2020), where we introduce four different ways to use BERT; and the back-translation method for both SMT and NMT (Bertoldi and Federico, 2009; Lambert et al, 2011; Sennrich et al, 2016b)
  • Our main experimental results are shown in Table 3 and Table 4.14 Overall, the translation performance is poor compared with the results of some high-resource translations (Sennrich et al, 2016a), which means that current popular SMT and NMT techniques still struggle to translate well between Cherokee and English especially for the out-of
  • 10http://data.statmt.org/news-crawl/en/ 11http://www.statmt.org/wmt18/index.html 12http://opus.nlpl.eu/bible-uedin.php 13BLEU+c.mixed+#.1+s.exp+tok.13a+v.1.4.4 14The confidence intervals in Table 3 and Table 4 are computed by the bootstrap method (Efron and Tibshirani, 1994)
  • Back-translation with our Cherokee monolingual data barely improves performance for both in-domain and out-of-domain evaluations, probably because the monolingual data is out-of-domain, 72% of the unique Cherokee tokens are unseen in the whole parallel data
  • Experiments show that SMT is significantly better and NMT under out-of-domain condition while NMT is better for in-domain evaluation; and the semi-supervised learning, transfer learning, and multilingual joint training can improve supervised baselines
Results
  • 5.1 Experimental Details

    The authors randomly sample 5K-100K sentences from News Crawl 201710 as the English monolingual data.
  • The authors conduct a smallscale human pairwise comparison by the coauthor between the translations generated by the NMT and SMT systems.
  • The authors randomly sample 50 examples from Test or Out-test, anonymously shuffle the translations from two systems, and ask the coauthor to choose which one they think is better.15.
  • For English-Cherokee translation, though the RNN-NMT+BERT (N5) has a better BLEU score than SMT+BT (S3) (12.2 vs 9.9), it is liked less by humans (21 vs 29), indicating that BLEU is possibly not a suitable for Cherokee evaluation.
  • A detailed study is beyond the scope of this paper but is an interesting future work direction
Conclusion
  • Conclusion and Future Work

    In this paper, the authors make the effort to revitalize the Cherokee language by introducing a clean Cherokee-English parallel dataset, ChrEn, with

    15The author, who conducted this human study, was not involved in the development of MT systems.

    Condition | System IDs Win Lose Chr-En Test Out-test

    N7 vs. S3 N7 vs. S2 En-Chr

    N5 vs. S3 N7 vs. S3

    14K sentence pairs; and 5K Cherokee monolingual sentences.
  • Experiments show that SMT is significantly better and NMT under out-of-domain condition while NMT is better for in-domain evaluation; and the semi-supervised learning, transfer learning, and multilingual joint training can improve supervised baselines.
  • The authors' best models achieve 15.8/12.7 BLEU for in-domain ChrEn/En-Chr translations and 6.5/5.0 BLEU for outof-domain Chr-En/En-Chr translations.
  • The authors hope these diverse baselines will serve as useful strong starting points for future work by the community.
  • The authors' future work involves converting the monolingual data to parallel and collecting more data from the news domain
Tables
  • Table1: An example from the development set of ChrEn. NMT denotes our RNN-NMT model
  • Table2: The key statistics of our parallel and monolingual data. Note that “% Unseen unique English tokens” is in terms of the Train split, for example, 13.3% of unique Engli5s/26h/2020tokens in Dev are unseen in Train. test2.svg
  • Table3: Performance of our supervised/semi-supervised SMT/NMT systems. Bold numbers are our best out-ofdomain systems together with Table 4, selected by performance on Out-dev. (±x) shows 95% confidence interval
  • Table4: Performance of our transfer and multilingual learning systems. Bold numbers are our best in-domain systems together with Table 3, selected by the performance on Dev. (±x) shows the 95% confidence interval
  • Table5: Human comparison between the translations generated from our NMT and SMT systems. If A vs. B, “Win” or “lose” means that the evaluator favors A or B. Systems IDs correspond to the IDs in Table 3
  • Table6: The comparison between our parallel data and the data provided on OPUS
  • Table7: The hyper-parameter settings of Supervised and Semi-supervised Cherokee-English NMT systems in Table 3. Empty fields indicate that hyper-parameter is the same as the previous (left) system
  • Table8: The hyper-parameter settings of Transferring Cherokee-English NMT systems in Table 4. Empty fields indicate that hyper-parameter is the same as the previous (left) system
  • Table9: The hyper-parameter settings of Multilingual Cherokee-English NMT systems in Table 4. Empty fields indicate that hyper-parameter is the same as the previous (left) system
  • Table10: The hyper-parameter settings of in-domain Supervised and Semi-supervised English-Cherokee NMT systems in Table 3. Empty fields indicate that hyper-parameter is the same as the previous (left) system
  • Table11: The hyper-parameter settings of out-of-domain Supervised and Semi-supervised English-Cherokee NMT systems in Table 3. Empty fields indicate that hyper-parameter is the same as previous (left) system
  • Table12: The hyper-parameter settings of Transferring English-Cherokee NMT systems in Table 4. Empty fields indicate that hyper-parameter is the same as the previous (left) system
  • Table13: The hyper-parameter settings of Multilingual English-Cherokee NMT systems in Table 4. Empty fields indicate that hyper-parameter is the same as the previous (left) system
  • Table14: Parallel Data Sources
  • Table15: Monolingual Data Sources
Download tables as Excel
Related work
  • Cherokee Language Revitalization. In 2008, the Cherokee Nation launched the 10-year language preservation plan (Nation, 2001), which aims to have 80% or more of the Cherokee people be fluent in this language in 50 years. After that, a lot of revitalization works were proposed. Cherokee Nation and the EBCI have established language immersion programs and k-12 language curricula. Several universities, including the University of Oklahoma, Stanford University, etc., have begun offering Cherokee as a second language. However, given Cherokee has been rated at the highest level of learning difficulty (Peake Raymond, 2008), it is hard to be mastered without frequent language exposure. As mentioned by Crystal (2014), an endangered language will progress if its speakers can make use of electronic technology. Currently, the language is included among existing Unicode-compatible fonts, is supported by Gmail, and has a Wikipedia page. To revitalize Cherokee, a few Cherokee pedagogical books have been published (Holmes and Smith, 1976; Joyner, 2014), as well as several online learning platforms.3 Feeling (2018) provided detailed English translations and linguistic analysis of a number of Cherokee stories. A Digital Archive for American Indian Languages Preservation and Perseverance (DAILP) has been developed for transcribing, translating, and contextualizing historical Cherokee language documents (Bourns, 2019; Cushman, 2019).4 However, the translation between Cherokee and English still can only be done by human translators. Given that only 2,000 fluent first-language speakers are left, and the majority of them are elders, it is important and urgent to have a machine translation system that could assist them with translation. Therefore, we introduce a clean Cherokee-English parallel dataset to facilitate machine translation development and propose multiple translation systems as starting points of future works. We hope our work could attract more
Funding
  • This work was supported by NSF-CAREER Award 1846185, ONR Grant N00014-18-1-2871, and faculty awards from Google, Facebook, and Microsoft
  • The views contained in this article are those of the authors and not of the funding agency. languages
Study subjects and analysis
pairs: 14151
Southern Iroquoian Northern Iroquoian. Five Nations Iroquois to Cherokee revitalization by constructing a clean Cherokee-English parallel dataset, ChrEn, which results in 14,151 pairs of sentences with around 313K English tokens and 206K Cherokee tokens. We also collect 5,210 Cherokee monolingual sentences with 93K Cherokee tokens

fluent first-language speakers: 2000
However, the translation between Cherokee and English still can only be done by human translators. Given that only 2,000 fluent first-language speakers are left, and the majority of them are elders, it is important and urgent to have a machine translation system that could assist them with translation. Therefore, we introduce a clean Cherokee-English parallel dataset to facilitate machine translation development and propose multiple translation systems as starting points of future works

sentence pairs: 14151
This process is time-consuming and took several months. The resulting dataset consists of 14,151 sentence pairs. After tokenization,7 there are around

pairs: 512
ment/testing sets. We separate all the sentence pairs from newspaper articles, 512 pairs in total, and randomly split them in half as out-of-domain development and testing sets, denoted by Out-dev and Out-test. The remaining sentence pairs are randomly split into in-domain Train, Dev, and Test

language pairs: 4
We randomly sample 5K-100K sentences (about 0.5-10 times the size of the parallel training set) from News Crawl 201710 as our English monolingual data. We randomly sample 12K-58K examples (about 1-5 times the size of parallel training set) for each of the 4 language pairs (Czech/German/Russian/Chinese-English) from News Commentary v13 of WMT201811 and Bibleuedin (Christodouloupoulos and Steedman, 2015) on OPUS12. We apply tokenizer and truecaser from Moses (Koehn et al, 2007)

Reference
  • Elizabeth Albee. 2017. Immersion schools and language learning: A review of cherokee language revitalization efforts among the eastern band of cherokee indians.
    Google ScholarFindings
  • Alexei Baevski and Michael Auli. 2019. Adaptive input representations for neural language modeling. In International Conference on Learning Representations.
    Google ScholarLocate open access versionFindings
  • Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural machine translation by jointly learning to align and translate. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings.
    Google ScholarLocate open access versionFindings
  • Nicola Bertoldi and Marcello Federico. 2009. Domain adaptation for statistical machine translation with monolingual resources. In Proceedings of the fourth workshop on statistical machine translation, pages 182–189.
    Google ScholarLocate open access versionFindings
  • Jeffrey Bourns. 2019. Cherokee syllabary texts: Digital documentation and linguistic description. In 2nd Conference on Language, Data and Knowledge (LDK 2019). Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik.
    Google ScholarLocate open access versionFindings
  • Christos Christodouloupoulos and Mark Steedman. 2015. A massively parallel corpus: the bible in 100 languages. Language resources and evaluation, 49(2):375–395.
    Google ScholarLocate open access versionFindings
  • David Crystal. 2014. Language Death. Canto Classics. Cambridge University Press.
    Google ScholarFindings
  • Ellen Cushman. 2019. Language perseverance and translation of cherokee documents. College English, 82(1):115–134.
    Google ScholarLocate open access versionFindings
  • Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 201Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186.
    Google ScholarLocate open access versionFindings
  • Kevin Duh, Paul McNamee, Matt Post, and Brian Thompson. 2020. Benchmarking neural and statistical machine translation on low-resource african
    Google ScholarFindings
  • D. Feeling. 1994. A structured approach to learning the basic inflections of the Cherokee verb. Indian University Press, Bacone College.
    Google ScholarFindings
  • Durbin Feeling. 2018. Cherokee Narratives: A Linguistic Study. University of Oklahoma Press.
    Google ScholarFindings
  • Benjamin Frey. 2020. “data is nice:” theoretical and pedagogical implications of an eastern cherokee corpus. LD&C Special Publication.
    Google ScholarFindings
  • Caglar Gulcehre, Orhan Firat, Kelvin Xu, Kyunghyun Cho, Loic Barrault, Huei-Chi Lin, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2015. On using monolingual corpora in neural machine translation. arXiv preprint arXiv:1503.03535.
    Findings
  • Francisco Guzmán, Peng-Jen Chen, Myle Ott, Juan Pino, Guillaume Lample, Philipp Koehn, Vishrav Chaudhary, and Marc’Aurelio Ranzato. 2019. The flores evaluation datasets for low-resource machine translation: Nepali–english and sinhala–english. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 6100– 6113.
    Google ScholarLocate open access versionFindings
  • Kenneth Heafield, Ivan Pouzyrevsky, Jonathan H Clark, and Philipp Koehn. 2013. Scalable modified kneser-ney language model estimation. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 690–696.
    Google ScholarLocate open access versionFindings
  • Ruth Bradley Holmes and Betty Sharp Smith. 1976. Beginning Cherokee. University of Oklahoma Press, Norman.
    Google ScholarFindings
  • Sébastien Jean, Orhan Firat, Kyunghyun Cho, Roland Memisevic, and Yoshua Bengio. 2015. Montreal neural machine translation systems for wmt’15. In Proceedings of the Tenth Workshop on Statistical Machine Translation, pages 134–140.
    Google ScholarLocate open access versionFindings
  • Melvin Johnson, Mike Schuster, Quoc V Le, Maxim Krikun, Yonghui Wu, Zhifeng Chen, Nikhil Thorat, Fernanda Viégas, Martin Wattenberg, Greg Corrado, et al. 2017. Google’s multilingual neural machine translation system: Enabling zero-shot translation. Transactions of the Association for Computational Linguistics, 5:339–351.
    Google ScholarLocate open access versionFindings
  • M. Joyner. 2014. Cherokee Language Lessons. Lulu Press, Incorporated.
    Google ScholarFindings
  • Guillaume Klein, Yoon Kim, Yuntian Deng, Jean Senellart, and Alexander Rush. 2017. OpenNMT: Opensource toolkit for neural machine translation. In Proceedings of ACL 2017, System Demonstrations, pages 67–72, Vancouver, Canada. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Tom Kocmi and Ondrej Bojar. 2018. Trivial transfer learning for low-resource neural machine translation. WMT 2018, page 244.
    Google ScholarLocate open access versionFindings
  • Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, et al. 2007. Moses: Open source toolkit for statistical machine translation. In Proceedings of the 45th annual meeting of the association for computational linguistics companion volume proceedings of the demo and poster sessions, pages 177–180.
    Google ScholarLocate open access versionFindings
  • Philipp Koehn and Rebecca Knowles. 2017. Six challenges for neural machine translation. In Proceedings of the First Workshop on Neural Machine Translation, pages 28–39.
    Google ScholarLocate open access versionFindings
  • Philipp Koehn, Franz Josef Och, and Daniel Marcu. 2003. Statistical phrase-based translation. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language TechnologyVolume 1, pages 48–54. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Surafel M Lakew, Matteo Negri, and Marco Turchi. 2020. Low resource neural machine translation: A benchmark for five african languages. In AfricaNLP workshop at ICLR 2020.
    Google ScholarLocate open access versionFindings
  • Patrik Lambert, Holger Schwenk, Christophe Servan, and Sadaf Abdul-Rauf. 2011. Investigations on translation model adaptation using monolingual data. In Proceedings of the Sixth Workshop on Statistical Machine Translation, pages 284–293. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Guillaume Lample, Alexis Conneau, Ludovic Denoyer, and Marc’Aurelio Ranzato. 2018. Unsupervised machine translation using monolingual corpora only. In International Conference on Learning Representations.
    Google ScholarLocate open access versionFindings
  • Minh-Thang Luong, Hieu Pham, and Christopher D Manning. 2015. Effective approaches to attentionbased neural machine translation. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 1412–1421.
    Google ScholarLocate open access versionFindings
  • Brad Montgomery-Anderson. 2008. A reference grammar of Oklahoma Cherokee. Ph.D. thesis, University of Kansas.
    Google ScholarFindings
  • Cherokee Nation. 2001. Ga-du-gi: A vision for working together to preserve the cherokee language. report of a needs assessment survey and a 10-year language revitalization plan.
    Google ScholarFindings
  • Franz Josef Och. 2003. Minimum error rate training in statistical machine translation. In Proceedings of the 41st Annual Meeting on Association for Computational Linguistics-Volume 1, pages 160–167. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Franz Josef Och and Hermann Ney. 2003. A systematic comparison of various statistical alignment models. Computational Linguistics, 29(1):19–51.
    Google ScholarLocate open access versionFindings
  • Kishore Papineni, Salim Roukos, Todd Ward, and WeiJing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics, pages 311–318. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • M Peake Raymond. 2008. The cherokee nation and its language: Tsalagi ayeli ale uniwonishisdi. Tahlequah, OK: Cherokee Nation.
    Google ScholarFindings
  • Matt Post. 2018. A call for clarity in reporting bleu scores. In Proceedings of the Third Conference on Machine Translation: Research Papers, pages 186– 191.
    Google ScholarLocate open access versionFindings
  • Ofir Press and Lior Wolf. 2017. Using the output embedding to improve language models. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, pages 157–163.
    Google ScholarLocate open access versionFindings
  • Hammam Riza, Michael Purwoadi, Teduh Uliniansyah, Aw Ai Ti, Sharifah Mahani Aljunied, Luong Chi Mai, Vu Tat Thang, Nguyen Phuong Thai, Vichet Chea, Sethserey Sam, et al. 2016. Introduction of the asian language treebank. In 2016 Conference of The Oriental Chapter of International Committee for Coordination and Standardization of Speech Databases and Assessment Techniques (O-COCOSDA), pages 1–6. IEEE.
    Google ScholarLocate open access versionFindings
  • Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016a. Edinburgh neural machine translation systems for wmt 16. In Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers, pages 371–376.
    Google ScholarLocate open access versionFindings
  • Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016b. Improving neural machine translation models with monolingual data. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 86–96.
    Google ScholarLocate open access versionFindings
  • Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016c. Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1715– 1725.
    Google ScholarLocate open access versionFindings
  • Rico Sennrich and Biao Zhang. 2019. Revisiting lowresource neural machine translation: A case study. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 211– 221.
    Google ScholarLocate open access versionFindings
  • Stephanie Strassel and Jennifer Tracey. 2016. Lorelei language packs: Data, tools, and resources for technology development in low resource languages. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), pages 3273–3280.
    Google ScholarLocate open access versionFindings
  • Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. 2016. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2818–2826.
    Google ScholarLocate open access versionFindings
  • Jörg Tiedemann. 2012. Parallel data, tools and interfaces in opus. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC’12), pages 2214–2218.
    Google ScholarLocate open access versionFindings
  • Hiroto Uchihara. 2016. Tone and accent in Oklahoma Cherokee, volume 3. Oxford University Press.
    Google ScholarLocate open access versionFindings
  • Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008.
    Google ScholarLocate open access versionFindings
  • John Wieting, Taylor Berg-Kirkpatrick, Kevin Gimpel, and Graham Neubig. 2019. Beyond bleu: Training neural machine translation with semantic similarity. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4344–4355.
    Google ScholarLocate open access versionFindings
  • Shiyue Zhang, Gulnigar Mahmut, Dong Wang, and Askar Hamdulla. 2017. Memory-augmented chinese-uyghur neural machine translation. In 2017 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), pages 1092–1096. IEEE.
    Google ScholarLocate open access versionFindings
  • Jinhua Zhu, Yingce Xia, Lijun Wu, Di He, Tao Qin, Wengang Zhou, Houqiang Li, and Tieyan Liu. 2020. Incorporating bert into neural machine translation. In International Conference on Learning Representations.
    Google ScholarLocate open access versionFindings
Author
Shiyue Zhang
Shiyue Zhang
Benjamin Frey
Benjamin Frey
Your rating :
0

 

Tags
Comments
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn
小科