AI helps you reading Science

AI generates interpretation videos

AI extracts and analyses the key points of the paper to generate videos automatically


pub
Go Generating

AI Traceability

AI parses the academic lineage of this thesis


Master Reading Tree
Generate MRT

AI Insight

AI extracts a summary of this paper


Weibo:
We showed that it is essential to analyze resource-lean scenarios across the different dimensions of data-availability

A Survey on Recent Approaches for Natural Language Processing in Low-Resource Scenarios

NAACL-HLT, pp.2545-2568, (2021)

Cited by: 0|Views535
EI
Full Text
Bibtex
Weibo

Abstract

Current developments in natural language processing offer challenges and opportunities for low-resource languages and domains. Deep neural networks are known for requiring large amounts of training data which might not be available in resource-lean scenarios. However, there is also a growing body of works to improve the performance in l...More

Code:

Data:

0
Introduction
  • Most of today’s research in natural language processing (NLP) is concerned with the processing of around 10 to 20 high-resource languages with a special focus on English, and ignores thousands of languages with billions of speakers (Bender, 2019).
  • Supporting technological developments for low-resource languages can help to increase participation of the speakers’ communities in a digital world.
  • Low-resource settings do concern low-resource languages and other scenarios, such as non-standard domains and tasks, for which only little training data is available
Highlights
  • Most of today’s research in natural language processing (NLP) is concerned with the processing of around 10 to 20 high-resource languages with a special focus on English, and ignores thousands of languages with billions of speakers (Bender, 2019)
  • The rise of data-hungry deep-learning systems increased the performance of NLP for high resource-languages, but the shortage of large-scale data in less-resourced languages makes the processing of them a challenging problem
  • We propose to categorize lowresource settings along the following three dimensions: availability of (i) task-specific labels, (ii) unlabeled language text, and (iii) auxiliary data: (i) The availability of task-specific labels in the target language is the most prominent dimension being necessary for supervised learning
  • We showed that it is essential to analyze resource-lean scenarios across the different dimensions of data-availability
  • We hope that our discussions on open issues for the different approaches can serve as inspiration for future work in this important and active research area
Methods
  • Method Data

    Augmentation (§ 4.1) Distant Supervision (§ 4.2) Cross-lingual projections (§ 4.3)

    Embeddings & Pretrained LMs (§ 5.1) LM domain adaptation (§ 5.4)

    Multilingual LMs (§ 5.3)

    Adversarial Discriminator (§ 6) Meta-Learning (§ 6)

    Requirements labeled data, heuristics* unlabeled data, heuristics* unlabeled data, highresource labeled data, cross-lingual alignment unlabeled data existing LM, unlabeled domain data multilingual unlabeled data additional datasets multiple auxiliary tasks

    Outcome additional labeled data additional labeled data additional labeled data better language representation domain-specific language representation multilingual feature representation independent representations better target task performance

    For low-resource languages domains kens in the Universal Dependency project (Nivre et al, 2020) while Garrette and Baldridge (2013) limit the time of the annotators to 2 hours resulting in up to 1-2k tokens. Loubser and Puttkammer (2020) report that most available datasets for South African languages have 40-60k labeled tokens.

    The amount of necessary resources is taskdependent.
  • As shown for POS tagging (Plank et al, 2016) and text classification (Melamud et al, 2019), in very low-resource settings, non-neural methods outperform more modern approaches while the latter need several hundred labeled instances
  • This makes evaluations interesting that vary the availability of a resource like the amount of labeled data (see e.g.
  • The authors will not focus on a specific low-resource scenario but rather specify which kind of resources the authors assume
Conclusion
  • The authors gave a structured overview of recent work in the field of low-resource natural language processing.
  • The authors showed that it is essential to analyze resource-lean scenarios across the different dimensions of data-availability.
  • This can reveal which techniques are expected to be applicable and helpful in a specific low-resource setting.
  • The authors hope that the discussions on open issues for the different approaches can serve as inspiration for future work in this important and active research area
Tables
  • Table1: Overview of low-resource methods surveyed in this paper. * Heuristics are typically gathered manually
  • Table2: Overview of tasks covered by six different languages
  • Table3: Overview of existing surveys on low-resource topics
Download tables as Excel
Related work
  • In the past, surveys on specific methods or certain low-resource language families have been published. These are listed in Table 3 in the Appendix. As recent surveys on low-resource machine translation (Liu et al, 2019) and unsupervised domain adaptation (Ramponi and Plank, 2020) are already available, we do not investigate them further in this paper. Instead, our focus lies on general methods for low-resource, supervised natural language processing including data augmentation, distant supervision and transfer learning. This is also in contrast to the task-specific survey by Magueresse et al (2020) who review highly influential work for several extraction tasks, but only provide little overview of recent approaches.

    The umbrella term low-resource covers a spectrum of scenarios with varying resource conditions. It includes work on threatened languages, such as Yongning Na, a Sino-Tibetan language with 40k speakers and only 3k written, unlabeled sentences (Adams et al, 2017). But, it also covers work on specialized domains or tasks in English, which is often treated as the most high-resource language.
Reference
  • Oliver Adams, Adam Makarucha, Graham Neubig, Steven Bird, and Trevor Cohn. 2017. Cross-lingual word embeddings for low-resource language modeling. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, pages 937–947, Valencia, Spain. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Heike Adel and Hinrich Schutze. 2015. CIS at TAC cold start 2015: Neural networks and coreference resolution for slot filling. In Proceedings of TAC KBP Workshop.
    Google ScholarLocate open access versionFindings
  • David Ifeoluwa Adelani, Michael A Hedderich, Dawei Zhu, Esther van den Berg, and Dietrich Klakow. 2020. Distant supervision and noisy label learning for low resource named entity recognition: A study on hausa and yor\ub\’a. arXiv preprint arXiv:2003.08370.
    Findings
  • Ashutosh Adhikari, Achyudh Ram, Raphael Tang, and Jimmy Lin. 2019. Docbert: Bert for document classification. arXiv preprint arXiv:1904.08398.
    Findings
  • Charu C Aggarwal, Xiangnan Kong, Quanquan Gu, Jiawei Han, and S Yu Philip. 2014. Active learning: A survey. In Data Classification: Algorithms and Applications, pages 571–60CRC Press.
    Google ScholarLocate open access versionFindings
  • Zeljko Agicand Ivan Vulic. 2019. JW300: A widecoverage parallel corpus for low-resource languages. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3204–3210, Florence, Italy. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Roee Aharoni and Yoav Goldberg. 2020. Unsupervised domain clusters in pretrained language models. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7747– 7763, Online. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Alan Akbik, Duncan Blythe, and Roland Vollgraf. 201Contextual string embeddings for sequence labeling. In Proceedings of the 27th International Conference on Computational Linguistics, pages 1638–1649, Santa Fe, New Mexico, USA. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Mahmoud Al-Ayyoub, Aya Nuseir, Kholoud Alsmearat, Yaser Jararweh, and Brij Gupta. 2018. Deep learning for arabic nlp: A survey. Journal of computational science, 26:522–531.
    Google ScholarLocate open access versionFindings
  • of The 12th Language Resources and Evaluation Conference, pages 2754–2762, Marseille, France. European Language Resources Association.
    Google ScholarLocate open access versionFindings
  • Gorkem Algan and Ilkay Ulusoy. 2019. Image classification with deep learning in the presence of noisy labels: A survey. arXiv preprint arXiv:1912.05170.
    Findings
  • Emily Alsentzer, John Murphy, William Boag, WeiHung Weng, Di Jindi, Tristan Naumann, and Matthew McDermott. 2019. Publicly available clinical BERT embeddings. In Proceedings of the 2nd Clinical Natural Language Processing Workshop, pages 72–78, Minneapolis, Minnesota, USA. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Christoph Alt, Marc Hubner, and Leonhard Hennig. 2019. Fine-tuning pre-trained transformer language models to distantly supervised relation extraction. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 1388–1398, Florence, Italy. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Ateret Anaby-Tavor, Boaz Carmeli, Esther Goldbraich, Amir Kantor, George Kour, Segev Shlomov, Naama Tepper, and Naama Zwerdling. 2020. Do not have enough data? deep learning to the rescue! In The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7-12, 2020, pages 7383– 7390. AAAI Press.
    Google ScholarLocate open access versionFindings
  • Stephen H. Bach, Daniel Rodriguez, Yintao Liu, Chong Luo, Haidong Shao, Cassandra Xia, Souvik Sen, Alexander Ratner, Braden Hancock, Houman Alborzi, Rahul Kuchhal, Christopher Re, and Rob Malkin. 2019. Snorkel drybell: A case study in deploying weak supervision at industrial scale. In Proceedings of the 2019 International Conference on Management of Data, SIGMOD Conference 2019, Amsterdam, The Netherlands, June 30 - July 5, 2019, pages 362–375. ACM.
    Google ScholarLocate open access versionFindings
  • N. Banik, M. H. Hafizur Rahman, S. Chakraborty, H. Seddiqui, and M. A. Azim. 2019. Survey on text-based sentiment analysis of bengali language. In 2019 1st International Conference on Advances in Science, Engineering and Robotics Technology (ICASERT), pages 1–6.
    Google ScholarLocate open access versionFindings
  • Muazzam Bashir, Azilawati Rozaimee, and Wan Malini Wan Isa. 20Automatic hausa language text summarization based on feature extraction using naive bayes model. World Applied Science Journal, 35(9):2074–2080.
    Google ScholarLocate open access versionFindings
  • Jesujoba Alabi, Kwabena Amponsah-Kaakyire, David Adelani, and Cristina Espana-Bonet. 2020. Massive vs. curated embeddings for low-resourced languages: the case of yorubaand twi. In Proceedings
    Google ScholarLocate open access versionFindings
  • Iz Beltagy, Kyle Lo, and Arman Cohan. 20SciBERT: A pretrained language model for scientific text. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the
    Google ScholarLocate open access versionFindings
  • 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3615– 3620, Hong Kong, China. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Emily Bender. 2019. The benderrule: On naming the languages we study and why it matters. The Gradient.
    Google ScholarLocate open access versionFindings
  • Elan van Biljon, Arnu Pretorius, and Julia Kreutzer. 2020. On optimal transformer depth for lowresource language translation. arXiv preprint arXiv:2004.04418.
    Findings
  • Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2017. Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics, 5:135–146.
    Google ScholarLocate open access versionFindings
  • Ondrej Bojar and Ales Tamchyna. 2011. Improving translation model by monolingual data. In Proceedings of the Sixth Workshop on Statistical Machine Translation, pages 330–336, Edinburgh, Scotland. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Steven Cao, Nikita Kitaev, and Dan Klein. 2020. Multilingual alignment of contextual word representations. In International Conference on Learning Representations.
    Google ScholarLocate open access versionFindings
  • Yixin Cao, Zikun Hu, Tat-seng Chua, Zhiyuan Liu, and Heng Ji. 2019. Low-resource name tagging learned with weakly labeled data. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 261–270, Hong Kong, China. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Daoyuan Chen, Yaliang Li, Kai Lei, and Ying Shen. 2020. Relabel the noise: Joint extraction of entities and relations via cooperative multiagents. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 5940– 5950, Online. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Junfan Chen, Richong Zhang, Yongyi Mao, Hongyu Guo, and Jie Xu. 2019. Uncover the ground-truth relations in distant supervision: A neural expectationmaximization framework. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 326–336, Hong Kong, China. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Xilun Chen, Yu Sun, Ben Athiwaratkun, Claire Cardie, and Kilian Weinberger. 2018. Adversarial deep averaging networks for cross-lingual sentiment classification. Transactions of the Association for Computational Linguistics, 6:557–570.
    Google ScholarLocate open access versionFindings
  • Yong Cheng, Lu Jiang, Wolfgang Macherey, and Jacob Eisenstein. 2020. AdvAug: Robust adversarial augmentation for neural machine translation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 5961– 5970, Online. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Christos Christodoulopoulos and Mark Steedman. 2015. A massively parallel corpus: the bible in 100 languages. Lang. Resour. Evaluation, 49(2):375– 395.
    Google ScholarLocate open access versionFindings
  • Alexandra Chronopoulou, Christos Baziotis, and Alexandros Potamianos. 2019. An embarrassingly simple approach for transfer learning from pretrained language models. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 2089–2095, Minneapolis, Minnesota. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Christopher Cieri, Mike Maxwell, Stephanie Strassel, and Jennifer Tracey. 2016. Selection criteria for low resource language programs. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), pages 4543– 4549, Portoroz, Slovenia. European Language Resources Association (ELRA).
    Google ScholarLocate open access versionFindings
  • Ronan Collobert, Jason Weston, Leon Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel Kuksa. 2011. Natural language processing (almost) from scratch. Journal of machine learning research, 12(ARTICLE):2493–2537.
    Google ScholarLocate open access versionFindings
  • Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzman, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. 2020. Unsupervised cross-lingual representation learning at scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8440– 8451, Online. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Ryan Cotterell, Christo Kirov, John Sylak-Glassman, Geraldine Walther, Ekaterina Vylomova, Arya D. McCarthy, Katharina Kann, Sabrina J. Mielke, Garrett Nicolai, Miikka Silfverberg, David Yarowsky, Jason Eisner, and Mans Hulden. 2018. The CoNLL– SIGMORPHON 2018 shared task: Universal morphological reinflection. In Proceedings of the CoNLL–SIGMORPHON 2018 Shared Task: Universal Morphological Reinflection, pages 1–27, Brussels. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Jan Christian Blaise Cruz and Charibeth Cheng. 2019. Evaluating language model finetuning techniques for low-resource languages. arXiv preprint arXiv:1907.00409.
    Findings
  • Xiang Dai and Heike Adel. 2020. An analysis of simple data augmentation for named entity recognition.
    Google ScholarFindings
  • Ali Daud, Wahab Khan, and Dunren Che. 2017. Urdu language processing: a survey. Artificial Intelligence Review, 47(3):279–311.
    Google ScholarLocate open access versionFindings
  • Guy De Pauw, Gilles-Maurice De Schryver, Laurette Pretorius, and Lori Levin. 2011. Introduction to the special issue on african language technology. Language Resources and Evaluation, 45(3):263–269.
    Google ScholarLocate open access versionFindings
  • Xiang Deng and Huan Sun. 2019. Leveraging 2-hop distant supervision from table entity pairs for relation extraction. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLPIJCNLP), pages 410–420, Hong Kong, China. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Anne Dirkson and Suzan Verberne. 2019. Transfer learning for health-related twitter data. In Proceedings of the Fourth Social Media Mining for Health Applications (#SMM4H) Workshop & Shared Task, pages 89–92, Florence, Italy. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Zi-Yi Dou, Keyi Yu, and Antonios Anastasopoulos. 2019. Investigating meta-learning algorithms for low-resource natural language understanding tasks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 1192– 1197, Hong Kong, China. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • David M. Eberhard, Gary F. Simons, and Charles D. Fennig (eds.). 2019. Ethnologue: Languages of the world. twenty-second edition.
    Google ScholarFindings
  • Felix Abidemi Fabuni and Akeem Segun Salawu. 2005. Is yorubaan endangered language? Nordic Journal of African Studies, 14(3):18–18.
    Google ScholarLocate open access versionFindings
  • Marzieh Fadaee, Arianna Bisazza, and Christof Monz. 2017. Data augmentation for low-resource neural machine translation. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 567– 573, Vancouver, Canada. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Meng Fang and Trevor Cohn. 2016. Learning when to trust distant supervision: An application to lowresource POS tagging using cross-lingual projection. In Proceedings of The 20th SIGNLL Conference on
    Google ScholarLocate open access versionFindings
  • Computational Natural Language Learning, pages 178–186, Berlin, Germany. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Meng Fang and Trevor Cohn. 2017. Model transfer for tagging low-resource languages using a bilingual dictionary. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 587–593, Vancouver, Canada. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Hao Fei, Meishan Zhang, and Donghong Ji. 2020. Cross-lingual semantic role labeling with highquality translated training corpus. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7014–7026, Online. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Chelsea Finn, Pieter Abbeel, and Sergey Levine. 2017. Model-agnostic meta-learning for fast adaptation of deep networks. In Proceedings of the 34th International Conference on Machine Learning - Volume 70, ICML’17, page 1126–1135. JMLR.org.
    Google ScholarLocate open access versionFindings
  • Benoıt Frenay and Michel Verleysen. 2013. Classification in the presence of label noise: a survey. IEEE transactions on neural networks and learning systems, 25(5):845–869.
    Google ScholarLocate open access versionFindings
  • Annemarie Friedrich, Heike Adel, Federico Tomazic, Johannes Hingerl, Renou Benteau, Anika Marusczyk, and Lukas Lange. 2020. The SOFC-exp corpus and neural approaches to information extraction in the materials science domain. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1255–1268, Online. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Dan Garrette and Jason Baldridge. 2013. Learning a part-of-speech tagger from two hours of annotation. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 138–147, Atlanta, Georgia. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial nets. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 27, pages 2672–2680. Curran Associates, Inc.
    Google ScholarLocate open access versionFindings
  • Daniel Grießhaber, Ngoc Thang Vu, and Johannes Maucher. 2020. Low-resource text classification using domain-adversarial learning. Computer Speech & Language, 62:101056.
    Google ScholarLocate open access versionFindings
  • Aditi Sharma Grover, Karen Calteaux, Gerhard van Huyssteen, and Marthinus Pretorius. 2010. An overview of hlts for south african bantu languages. In Proceedings of the 2010 Annual Research Conference of the South African Institute of Computer
    Google ScholarLocate open access versionFindings
  • Roman Grundkiewicz, Marcin Junczys-Dowmunt, and Kenneth Heafield. 2019. Neural grammatical error correction systems with unsupervised pre-training on synthetic data. In Proceedings of the Fourteenth Workshop on Innovative Use of NLP for Building Educational Applications, pages 252–263, Florence, Italy. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • I. Guellil, F. Azouaou, and A. Valitutti. 2019. English vs arabic sentiment analysis: A survey presenting 100 work studies, resources and tools. In 2019 IEEE/ACS 16th International Conference on Computer Systems and Applications (AICCSA), pages 1– 8.
    Google ScholarLocate open access versionFindings
  • Tao Gui, Qi Zhang, Haoran Huang, Minlong Peng, and Xuanjing Huang. 2017. Part-of-speech tagging for twitter with adversarial neural networks. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 2411–2420, Copenhagen, Denmark. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Kristina Gulordava, Piotr Bojanowski, Edouard Grave, Tal Linzen, and Marco Baroni. 2018. Colorless green recurrent networks dream hierarchically. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 1195–1205, New Orleans, Louisiana. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Suchin Gururangan, Ana Marasovic, Swabha Swayamdipta, Kyle Lo, Iz Beltagy, Doug Downey, and Noah A. Smith. 2020. Don’t stop pretraining: Adapt language models to domains and tasks. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8342–8360, Online. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • DN Hakro, AZ TALIB, and GN Mojai. 2016. Multilingual text image database for ocr. Sindh University Research Journal-SURJ (Science Series), 47(1).
    Google ScholarLocate open access versionFindings
  • BS Harish and R Kasturi Rangan. 2020. A comprehensive survey on indian regional language processing. SN Applied Sciences, 2(7):1–16.
    Google ScholarLocate open access versionFindings
  • Michael A Hedderich, David Adelani, Dawei Zhu, Jesujoba Alabi, Udia Markus, and Dietrich Klakow. 2020. Transfer learning and distant supervision for multilingual transformer models: A study on african languages. arXiv preprint arXiv:2010.03179.
    Findings
  • Michael A. Hedderich and Dietrich Klakow. 2018. Training a neural network in a low-resource setting on automatically annotated noisy data. In Proceedings of the Workshop on Deep Learning Approaches for Low-Resource NLP, pages 12–18, Melbourne. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Benjamin Heinzerling and Michael Strube. 2018. BPEmb: Tokenization-free Pre-trained Subword Embeddings in 275 Languages. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan. European Language Resources Association (ELRA).
    Google ScholarLocate open access versionFindings
  • Vu Cong Duy Hoang, Philipp Koehn, Gholamreza Haffari, and Trevor Cohn. 2018. Iterative backtranslation for neural machine translation. In Proceedings of the 2nd Workshop on Neural Machine Translation and Generation, pages 18–24, Melbourne, Australia. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Timothy Hospedales, Antreas Antoniou, Paul Micaelli, and Amos Storkey. 2020. Meta-learning in neural networks: A survey. arXiv preprint arXiv:2004.05439.
    Findings
  • Junjie Hu, Sebastian Ruder, Aditya Siddhant, Graham Neubig, Orhan Firat, and Melvin Johnson. 2020. Xtreme: A massively multilingual multi-task benchmark for evaluating cross-lingual generalization. abs/2003.11080.
    Google ScholarFindings
  • Linmei Hu, Luhao Zhang, Chuan Shi, Liqiang Nie, Weili Guan, and Cheng Yang. 2019. Improving distantly-supervised relation extraction with joint label embedding. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLPIJCNLP), pages 3821–3829, Hong Kong, China. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Kexin Huang, Jaan Altosaar, and Rajesh Ranganath. 2019. Clinicalbert: Modeling clinical notes and predicting hospital readmission. arXiv preprint arXiv:1904.05342.
    Findings
  • Yuyun Huang and Jinhua Du. 2019. Self-attention enhanced CNNs and collaborative curriculum learning for distantly supervised relation extraction. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 389– 398, Hong Kong, China. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Patrick Huber and Giuseppe Carenini. 2019. Predicting discourse structure using distant supervision from sentiment. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLPIJCNLP), pages 2306–2316, Hong Kong, China. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Ayush Jain and Meenachi Ganesamoorty. 2020. Nukebert: A pre-trained language model for low resource nuclear domain. arXiv preprint arXiv:2003.13821.
    Findings
  • Wei Jia, Dai Dai, Xinyan Xiao, and Hua Wu. 2019. ARNOR: Attention regularization based noise reduction for distant supervision relation classification. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 1399– 1408, Florence, Italy. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Zhanming Jie, Pengjun Xie, Wei Lu, Ruixue Ding, and Linlin Li. 2019. Better modeling of incomplete annotations for named entity recognition. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 729–734, Minneapolis, Minnesota. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Armand Joulin, Piotr Bojanowski, Tomas Mikolov, Herve Jegou, and Edouard Grave. 2018. Loss in translation: Learning bilingual word mapping with a retrieval criterion. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing.
    Google ScholarLocate open access versionFindings
  • Jakob Jungmaier, Nora Kassner, and Benjamin Roth. 2020. Dirichlet-smoothed word embeddings for low-resource settings. In Proceedings of The 12th Language Resources and Evaluation Conference, pages 3560–3565, Marseille, France. European Language Resources Association.
    Google ScholarLocate open access versionFindings
  • Karthikeyan K, Zihan Wang, Stephen Mayhew, and Dan Roth. 2020. Cross-lingual ability of multilingual bert: An empirical study. In International Conference on Learning Representations.
    Google ScholarLocate open access versionFindings
  • Kushal Kafle, Mohammed Yousefhussien, and Christopher Kanan. 2017. Data augmentation for visual question answering. In Proceedings of the 10th International Conference on Natural Language Generation, pages 198–202, Santiago de Compostela, Spain. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Katharina Kann, Ophelie Lacroix, and Anders Søgaard. 2020. Weakly supervised POS taggers perform poorly on Truly low-resource languages. In The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7-12, 2020, pages 8066– 8073. AAAI Press.
    Google ScholarLocate open access versionFindings
  • Jungo Kasai, Kun Qian, Sairam Gurajada, Yunyao Li, and Lucian Popa. 2019. Low-resource deep entity resolution with transfer and active learning. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 5851–5861, Florence, Italy. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Talaat Khalil, Kornel Kiełczewski, Georgios Christos Chouliaras, Amina Keldibek, and Maarten Versteegh. 2019. Cross-lingual intent classification in a low resource industrial setting. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 6419–6424, Hong Kong, China. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Joo-Kyung Kim, Young-Bum Kim, Ruhi Sarikaya, and Eric Fosler-Lussier. 2017. Cross-lingual transfer learning for POS tagging without cross-lingual resources. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 2832–2838, Copenhagen, Denmark. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Jan-Christoph Klie, Michael Bugert, Beto Boullosa, Richard Eckart de Castilho, and Iryna Gurevych. 2018. The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In COLING 2018, The 27th International Conference on Computational Linguistics: System Demonstrations, Santa Fe, New Mexico, August 20-26, 2018, pages 5–9. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Sosuke Kobayashi. 2018. Contextual augmentation: Data augmentation by words with paradigmatic relations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pages 452–457, New Orleans, Louisiana. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Sandra Kubler and Desislava Zhekova. 2016. Multilingual coreference resolution. Language and Linguistics Compass, 10(11):614–631.
    Google ScholarLocate open access versionFindings
  • Varun Kumar, Ashutosh Choudhary, and Eunah Cho. 2020. Data augmentation using pre-trained transformer models.
    Google ScholarFindings
  • Lukas Lange, Heike Adel, and Jannik Strotgen. 2019a. NLNDE: Enhancing neural sequence taggers with attention and noisy channel for robust pharmacological entity detection. In Proceedings of The 5th Workshop on BioNLP Open Shared Tasks, pages 26–32, Hong Kong, China. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Lukas Lange, Michael A. Hedderich, and Dietrich Klakow. 2019b. Feature-dependent confusion matrices for low-resource NER labeling with noisy labels. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3554– 3559, Hong Kong, China. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Lukas Lange, Anastasiia Iurshina, Heike Adel, and Jannik Strotgen. 2020. Adversarial alignment of multilingual models for extracting temporal expressions from text. In Proceedings of the 5th Workshop on Representation Learning for NLP, pages 103–109, Online. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Anne Lauscher, Vinit Ravishankar, Ivan Vulic, and Goran Glavas. 2020. From zero to hero: On the limitations of zero-shot cross-lingual transfer with multilingual transformers. arXiv preprint arXiv:2005.00633.
    Findings
  • Phong Le and Ivan Titov. 2019. Distant learning for entity linking with automatic noise detection. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4081– 4090, Florence, Italy. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Jieh-Sheng Lee and Jieh Hsiang. 2020. Patent classification by fine-tuning bert language model. World Patent Information, 61:101965.
    Google ScholarLocate open access versionFindings
  • Jinhyuk Lee, Wonjin Yoon, Sungdong Kim, Donghyeon Kim, Sunkyu Kim, Chan Ho So, and Jaewoo Kang. 2020. Biobert: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics (Oxford, England), 36(4):1234—1240.
    Google ScholarFindings
  • Kuang-Huei Lee, Xiaodong He, Lei Zhang, and Linjun Yang. 2018. Cleannet: Transfer learning for scalable image classifier training with label noise. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, pages 5447–5456. IEEE Computer Society.
    Google ScholarLocate open access versionFindings
  • Junnan Li, Richard Socher, and Steven C. H. Hoi. 2020. Dividemix: Learning with noisy labels as semi-supervised learning. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net.
    Google ScholarFindings
  • Shen Li, Joao Graca, and Ben Taskar. 2012. Wiki-ly supervised part-of-speech tagging. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pages 1389–1398, Jeju Island, Korea. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Wen Li, Limin Wang, Wei Li, Eirikur Agustsson, and Luc Van Gool. 2017. Webvision database: Visual learning and understanding from web data. CoRR, abs/1708.02862.
    Findings
  • Jindrich Libovicky, Rudolf Rosa, and Alexander Fraser. 2020. On the language neutrality of pretrained multilingual representations. arXiv preprint arXiv:2004.05160.
    Findings
  • Chen Lin, Steven Bethard, Dmitriy Dligach, Farig Sadeque, Guergana Savova, and Timothy A Miller. 2020. Does bert need domain adaptation for clinical negation detection? Journal of the American Medical Informatics Association, 27(4):584–591.
    Google ScholarLocate open access versionFindings
  • Pierre Lison, Jeremy Barnes, Aliaksandr Hubin, and Samia Touileb. 2020. Named entity recognition without labelled data: A weak supervision approach. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1518–1533, Online. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • D. Liu, N. Ma, F. Yang, and X. Yang. 2019. A survey of low resource neural machine translation. In 2019 4th International Conference on Mechanical, Control and Computer Engineering (ICMCCE), pages 39–393.
    Google ScholarFindings
  • Pengfei Liu, Xipeng Qiu, and Xuanjing Huang. 2017. Adversarial multi-task learning for text classification. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, ACL 2017, Vancouver, Canada, July 30 - August 4, Volume 1: Long Papers, pages 1–10. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Qianchu Liu, Diana McCarthy, Ivan Vulic, and Anna Korhonen. 2019a. Investigating cross-lingual alignment methods for contextualized embeddings with token-level evaluation. In Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL).
    Google ScholarLocate open access versionFindings
  • Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019b. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.
    Findings
  • Zihan Liu, Genta Indra Winata, and Pascale Fung. 2020. Zero-resource cross-domain named entity recognition. In Proceedings of the 5th Workshop on Representation Learning for NLP, pages 1–6, Online. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Loubser and Martin Puttkammer. 2020. Viability of neural networks for core technologies for resourcescarce languages. Information, 11:41.
    Google ScholarLocate open access versionFindings
  • Jose Lozano, Waldir Farfan, and Juan Cruz. 2013. Syntactic analyzer for quechua language.
    Google ScholarFindings
  • Bingfeng Luo, Yansong Feng, Zheng Wang, Zhanxing Zhu, Songfang Huang, Rui Yan, and Dongyan Zhao. 2017. Learning with noise: Enhance distantly supervised relation extraction with dynamic transition matrix. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 430–439, Vancouver, Canada. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Shuming Ma, Pengcheng Yang, Tianyu Liu, Peng Li, Jie Zhou, and Xu Sun. 2019. Key fact as pivot: A two-stage model for low resource table-to-text generation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2047–2057, Florence, Italy. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Manuel Mager, Ximena Gutierrez-Vasques, Gerardo Sierra, and Ivan Meza-Ruiz. 2018. Challenges of language technologies for the indigenous languages of the Americas. In Proceedings of the 27th International Conference on Computational Linguistics, pages 55–69, Santa Fe, New Mexico, USA. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Alexandre Magueresse, Vincent Carles, and Evan Heetderks. 2020. Low-resource languages: A review of past work and future challenges. arXiv preprint arXiv:2006.07264.
    Findings
  • Dhruv Mahajan, Ross B. Girshick, Vignesh Ramanathan, Kaiming He, Manohar Paluri, Yixuan Li, Ashwin Bharambe, and Laurens van der Maaten. 2018. Exploring the limits of weakly supervised pretraining. CoRR, abs/1805.00932.
    Findings
  • Carmen Martınez-Gil, Alejandro Zempoalteca-Perez, Venustiano Soancatl-Aguilar, Marıa de Jesus Estudillo-Ayala, Jose Edgar Lara-Ramırez, and Sayde Alcantara-Santiago. 2012. Computer systems for analysis of nahuatl. Res. Comput. Sci., 47:11–16.
    Google ScholarLocate open access versionFindings
  • Thomas Mayer and Michael Cysouw. 2014. Creating a massively parallel Bible corpus. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), pages 3158– 3163, Reykjavik, Iceland. European Language Resources Association (ELRA).
    Google ScholarLocate open access versionFindings
  • Stephen Mayhew, Snigdha Chaturvedi, Chen-Tse Tsai, and Dan Roth. 2019. Named entity recognition with partially annotated training data. In Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL), pages 645–655, Hong Kong, China. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Stephen Mayhew and Dan Roth. 2018. TALEN: Tool for annotation of low-resource ENtities. In Proceedings of ACL 2018, System Demonstrations, pages 80–86, Melbourne, Australia. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Stephen Mayhew, Chen-Tse Tsai, and Dan Roth. 2017. Cheap translation for cross-lingual named entity recognition. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 2536–2545, Copenhagen, Denmark. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Oren Melamud, Mihaela Bornea, and Ken Barker. 2019. Combining unsupervised pre-training and annotator rationales to improve low-shot text classification. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3884–3893, Hong Kong, China. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013a. Efficient estimation of word representations in vector space. In 1st International Conference on Learning Representations, ICLR 2013, Scottsdale, Arizona, USA, May 2-4, 2013, Workshop Track Proceedings.
    Google ScholarLocate open access versionFindings
  • Tomas Mikolov, Quoc V. Le, and Ilya Sutskever. 2013b. Exploiting similarities among languages for machine translation. arXiv preprint arXiv:1309.4168.
    Findings
  • Junghyun Min, R. Thomas McCoy, Dipanjan Das, Emily Pitler, and Tal Linzen. 2020. Syntactic data augmentation increases robustness to inference heuristics. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 2339–2352, Online. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Mike Mintz, Steven Bills, Rion Snow, and Daniel Jurafsky. 2009. Distant supervision for relation extraction without labeled data. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, pages 1003–1011, Suntec, Singapore. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Kaili Muurisep and Pilleriin Mutso. 2005. Estsumestonian newspaper texts summarizer. In Proceedings of The Second Baltic Conference on Human Language Technologies, pages 311–316.
    Google ScholarLocate open access versionFindings
  • Joakim Nivre, Marie-Catherine de Marneffe, Filip Ginter, Jan Hajic, Christopher D. Manning, Sampo Pyysalo, Sebastian Schuster, Francis M. Tyers, and Daniel Zeman. 2020. Universal dependencies v2: An evergrowing multilingual treebank collection. In Proceedings of The 12th Language Resources and Evaluation Conference, LREC 2020, Marseille, France, May 11-16, 2020, pages 4034–4043. European Language Resources Association.
    Google ScholarLocate open access versionFindings
  • Farhad Nooralahzadeh, Jan Tore Lønning, and Lilja Øvrelid. 2019. Reinforcement-based denoising of distantly supervised NER with partial annotation. In Proceedings of the 2nd Workshop on Deep Learning Approaches for Low-Resource NLP (DeepLo 2019), pages 225–233, Hong Kong, China. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Christopher Norman, Mariska Leeflang, Rene Spijker, Evangelos Kanoulas, and Aurelie Neveol. 2019. A distantly supervised dataset for automated data extraction from diagnostic studies. In Proceedings of the 18th BioNLP Workshop and Shared Task, pages 105–114, Florence, Italy. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Fredrik Olsson. 2009. A literature survey of active machine learning in the context of natural language processing.
    Google ScholarFindings
  • Yasumasa Onoe and Greg Durrett. 2019. Learning to denoise distantly-labeled data for entity typing. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages
    Google ScholarLocate open access versionFindings
  • 2407–2417, Minneapolis, Minnesota. Association for Computational Linguistics.
    Google ScholarFindings
  • Hille Pajupuu, Rene Altrov, and Jaan Pajupuu. 2016. Identifying polarity in different text types. Folklore: Electronic Journal of Folklore, 64:126–138.
    Google ScholarLocate open access versionFindings
  • Xiaoman Pan, Boliang Zhang, Jonathan May, Joel Nothman, Kevin Knight, and Heng Ji. 2017. Crosslingual name tagging and linking for 282 languages. In Proceedings of ACL 2017, pages 1946–1958.
    Google ScholarLocate open access versionFindings
  • Shantipriya Parida and Petr Motlicek. 2019. Abstract text summarization: A low resource challenge. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 5994– 5998, Hong Kong, China. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Debjit Paul, Mittul Singh, Michael A. Hedderich, and Dietrich Klakow. 2019. Handling noisy labels for robustly learning from self-training data for lowresource sequence labeling. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Student Research Workshop, pages 29–34, Minneapolis, Minnesota. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Minlong Peng, Xiaoyu Xing, Qi Zhang, Jinlan Fu, and Xuanjing Huang. 2019. Distantly supervised named entity recognition using positive-unlabeled learning. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2409–2419, Florence, Italy. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 2227–2237, New Orleans, Louisiana. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Telmo Pires, Eva Schlinger, and Dan Garrette. 2019. How multilingual is multilingual BERT? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4996– 5001, Florence, Italy. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Barbara Plank and Zeljko Agic. 2018. Distant supervision from disparate sources for low-resource partof-speech tagging. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 614–620, Brussels, Belgium. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Barbara Plank, Anders Søgaard, and Yoav Goldberg. 2016. Multilingual part-of-speech tagging with bidirectional long short-term memory models and auxiliary loss. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, ACL 2016, August 7-12, 2016, Berlin, Germany, Volume 2: Short Papers. The Association for Computer Linguistics.
    Google ScholarLocate open access versionFindings
  • Xipeng Qiu, Tianxiang Sun, Yige Xu, Yunfan Shao, Ning Dai, and Xuanjing Huang. 2020. Pre-trained models for natural language processing: A survey. arXiv preprint arXiv:2003.08271.
    Findings
  • Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language models are unsupervised multitask learners. OpenAI Blog, 1(8):9.
    Google ScholarLocate open access versionFindings
  • Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2019. Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv preprint arXiv:1910.10683.
    Findings
  • Afshin Rahimi, Yuan Li, and Trevor Cohn. 2019. Massively multilingual transfer for NER. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 151–164, Florence, Italy. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Jonathan Raiman and John Miller. 2017. Globally normalized reader. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 1059–1069, Copenhagen, Denmark. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Alan Ramponi and Barbara Plank. 2020. Neural unsupervised domain adaptation in nlp—a survey. arXiv preprint arXiv:2006.00632.
    Findings
  • Alexander Ratner, Stephen H. Bach, Henry Ehrenberg, Jason Fries, Sen Wu, and Christopher Re. 2017. Snorkel: Rapid training data creation with weak supervision. Proc. VLDB Endow., 11(3):269–282.
    Google ScholarLocate open access versionFindings
  • Ines Rehbein and Josef Ruppenhofer. 2017. Detecting annotation noise in automatically labelled data. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1160–1170, Vancouver, Canada. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Sebastian Riedel, Limin Yao, and Andrew McCallum. 2010. Modeling relations and their mentions without labeled text. In Machine Learning and Knowledge Discovery in Databases, pages 148–163, Berlin, Heidelberg. Springer Berlin Heidelberg.
    Google ScholarLocate open access versionFindings
  • Shruti Rijhwani, Shuyan Zhou, Graham Neubig, and Jaime Carbonell. 2020. Soft gazetteers for lowresource named entity recognition. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8118–8123, Online. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Anna Rogers, Olga Kovaleva, and Anna Rumshisky. 2020. A primer in bertology: What we know about how bert works. arXiv preprint arXiv:2002.12327.
    Findings
  • Benjamin Roth, Tassilo Barth, Michael Wiegand, and Dietrich Klakow. 2013. A survey of noise reduction methods for distant supervision. In Proceedings of the 2013 workshop on Automated knowledge base construction, pages 73–78.
    Google ScholarLocate open access versionFindings
  • Guy Rotman and Roi Reichart. 2019. Deep contextualized self-training for low resource dependency parsing. Transactions of the Association for Computational Linguistics, 7:695–713.
    Google ScholarLocate open access versionFindings
  • Sebastian Ruder. 2019. The 4 biggest open problems in nlp.
    Google ScholarFindings
  • Sebastian Ruder, Ivan Vulic, and Anders Søgaard. 2019. A survey of cross-lingual word embedding models. Journal of Artificial Intelligence Research, 65:569–631.
    Google ScholarLocate open access versionFindings
  • Gozde Gul Sahin and Mark Steedman. 2018. Data augmentation via dependency tree morphing for lowresource languages. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 5004–5009, Brussels, Belgium. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Tal Schuster, Ori Ram, Regina Barzilay, and Amir Globerson. 2019. Cross-lingual alignment of contextual word embeddings, with applications to zeroshot dependency parsing. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers).
    Google ScholarLocate open access versionFindings
  • Burr Settles. 2009. Active learning literature survey. Technical report, University of Wisconsin-Madison Department of Computer Sciences.
    Google ScholarFindings
  • Yong Shi, Yang Xiao, and Lingfeng Niu. 2019. A brief survey of relation extraction based on distant supervision. In International Conference on Computational Science, pages 293–303. Springer.
    Google ScholarLocate open access versionFindings
  • Jasdeep Singh, Bryan McCann, Richard Socher, and Caiming Xiong. 2019. BERT is not an interlingua and the bias of tokenization. In Proceedings of the 2nd Workshop on Deep Learning Approaches for Low-Resource NLP (DeepLo 2019), pages 47–55, Hong Kong, China. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Alisa Smirnova and Philippe Cudre-Mauroux. 2018. Relation extraction using distant supervision: A survey. ACM Computing Surveys (CSUR), 51(5):1–35.
    Google ScholarLocate open access versionFindings
  • Samuel L. Smith, David H. P. Turban, Steven Hamblin, and Nils Y. Hammerla. 2017. Offline bilingual word vectors, orthogonal transformations and the inverted softmax. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings.
    Google ScholarLocate open access versionFindings
  • Ralf Steinberger. 2012. A survey of methods to ease the development of highly multilingual text mining applications. Language resources and evaluation, 46(2):155–176.
    Google ScholarLocate open access versionFindings
  • Stephanie Strassel and Jennifer Tracey. 2016. LORELEI language packs: Data, tools, and resources for technology development in low resource languages. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), pages 3273–3280, Portoroz, Slovenia. European Language Resources Association (ELRA).
    Google ScholarLocate open access versionFindings
  • Cong Sun and Zhihao Yang. 2019. Transfer learning in biomedical named entity recognition: An evaluation of BERT in the PharmaCoNER task. In Proceedings of The 5th Workshop on BioNLP Open Shared Tasks, pages 100–104, Hong Kong, China. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Mihai Surdeanu, Julie Tibshirani, Ramesh Nallapati, and Christopher D. Manning. 2012. Multi-instance multi-label learning for relation extraction. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pages 455– 465, Jeju Island, Korea. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Oscar Tackstrom, Dipanjan Das, Slav Petrov, Ryan McDonald, and Joakim Nivre. 2013. Token and type constraints for cross-lingual part-of-speech tagging. Transactions of the Association for Computational Linguistics, 1:1–12.
    Google ScholarLocate open access versionFindings
  • Jorg Tiedemann. 2012. Parallel data, tools and interfaces in OPUS. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC’12), pages 2214–2218, Istanbul, Turkey. European Language Resources Association (ELRA).
    Google ScholarLocate open access versionFindings
  • Tatiana Tsygankova, Francesca Marini, Stephen Mayhew, and Dan Roth. 2020. Building low-resource ner models using non-speaker annotation.
    Google ScholarFindings
  • Clara Vania, Yova Kementchedjhieva, Anders Søgaard, and Adam Lopez. 2019. A systematic comparison of methods for low-resource dependency parsing on genuinely low-resource languages. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 1105–1116, Hong Kong, China. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 5998–6008. Curran Associates, Inc.
    Google ScholarLocate open access versionFindings
  • Hao Wang, Bing Liu, Chaozhuo Li, Yan Yang, and Tianrui Li. 2019. Learning with noisy labels for sentence-level sentiment classification. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 6286–6292, Hong Kong, China. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Jason Wei and Kai Zou. 2019. EDA: Easy data augmentation techniques for boosting performance on text classification tasks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 6382–6388, Hong Kong, China. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Garrett Wilson and Diane J Cook. 2018. A survey of unsupervised deep domain adaptation. arXiv preprint arXiv:1812.02849.
    Findings
  • Guillaume Wisniewski, Nicolas Pecheux, Souhir Gahbiche-Braham, and Francois Yvon. 2014. Crosslingual part-of-speech tagging through ambiguous learning. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1779–1785, Doha, Qatar. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Unknown intent detection using Gaussian mixture model with an application to zero-shot intent classification. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1050–1060, Online. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Yaosheng Yang, Wenliang Chen, Zhenghua Li, Zhengqiu He, and Min Zhang. 2018. Distantly supervised NER with partial annotation learning and reinforcement learning. In Proceedings of the 27th International Conference on Computational Linguistics, pages 2159–2169, Santa Fe, New Mexico, USA. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Ze Yang, Wei Wu, Jian Yang, Can Xu, and Zhoujun Li. 2019. Low-resource response generation with template prior. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLPIJCNLP), pages 1886–1897, Hong Kong, China. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • David Yarowsky, Grace Ngai, and Richard Wicentowski. 2001. Inducing multilingual text analysis tools via robust projection across aligned corpora. In Proceedings of the First International Conference on Human Language Technology Research.
    Google ScholarLocate open access versionFindings
  • Shijie Wu, Alexis Conneau, Haoran Li, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Emerging cross-lingual structure in pretrained language models. arXiv preprint arXiv:1911.01464.
    Findings
  • Shijie Wu and Mark Dredze. 2019.
    Google ScholarFindings
  • Beto, bentz, becas: The surprising cross-lingual effectiveness of BERT. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 833–844, Hong Kong, China. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Shijie Wu and Mark Dredze. 2020. Are all languages created equal in multilingual BERT? In Proceedings of the 5th Workshop on Representation Learning for NLP, pages 120–130, Online. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Tong Xiao, Tian Xia, Yi Yang, Chang Huang, and Xiaogang Wang. 2015. Learning from massive noisy labeled data for image classification. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, June 7-12, 2015, pages 2691–2699. IEEE Computer Society.
    Google ScholarLocate open access versionFindings
  • Hu Xu, Bing Liu, Lei Shu, and Philip S Yu. 2020. Dombert: Domain-oriented language model for aspect-based sentiment analysis. arXiv preprint arXiv:2004.13816.
    Findings
  • Guangfeng Yan, Lu Fan, Qimai Li, Han Liu, Xiaotong Zhang, Xiao-Ming Wu, and Albert Y.S. Lam. 2020.
    Google ScholarFindings
  • Michihiro Yasunaga, Jungo Kasai, and Dragomir Radev. 2018. Robust multilingual part-of-speech tagging via adversarial training. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 976–986, New Orleans, Louisiana. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Qinyuan Ye, Liyuan Liu, Maosen Zhang, and Xiang Ren. 2019. Looking beyond label noise: Shifted label distribution matters in distantly supervised relation extraction. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLPIJCNLP), pages 3841–3850, Hong Kong, China. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Jihene Younes, Emna Souissi, Hadhemi Achour, and Ahmed Ferchichi. 2020. Language resources for maghrebi arabic dialects’ nlp: a survey. LANGUAGE RESOURCES AND EVALUATION.
    Google ScholarFindings
  • Mo Yu, Xiaoxiao Guo, Jinfeng Yi, Shiyu Chang, Saloni Potdar, Yu Cheng, Gerald Tesauro, Haoyu Wang, and Bowen Zhou. 2018. Diverse few-shot text classification with multiple metrics. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 1206–1215, New Orleans, Louisiana. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • BI Yude. 2011. A brief survey of korean natural language processing research. Journal of Chinese Information Processing, 6.
    Google ScholarLocate open access versionFindings
  • Meishan Zhang, Yue Zhang, and Guohong Fu. 2019a. Cross-lingual dependency parsing using code-mixed TreeBank. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 997–1006, Hong Kong, China. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Rui Zhang, Caitlin Westerfield, Sungrok Shim, Garrett Bingham, Alexander Fabbri, William Hu, Neha Verma, and Dragomir Radev. 2019b. Improving lowresource cross-lingual document retrieval by reranking with deep bilingual representations. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3173–3179, Florence, Italy. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • as tokenization, to higher-level tasks, such as question answering. For this short study, we have chosen the following languages. The number of speakers are the combined L1 and L2 speakers according to Eberhard et al. (2019).
    Google ScholarLocate open access versionFindings
  • (2) Yoruba: An African language, which is spoken by ca. 40 million speakers and contained in the EXTREME benchmark (Hu et al., 2020). Even with that many speakers, this language is often considered as a low-resource language and it’s still discussed whether this language is also endangered (Fabuni and Salawu, 2005).
    Google ScholarFindings
  • Shun Zheng, Xu Han, Yankai Lin, Peilin Yu, Lu Chen, Ling Huang, Zhiyuan Liu, and Wei Xu. 2019. DIAG-NRE: A neural pattern diagnosis framework for distantly supervised neural relation extraction. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 1419–1429, Florence, Italy. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • (3) Hausa: An African language with over 60 million speakers. It is not covered in EXTREME or the universal dependencies project (Nivre et al., 2020).
    Google ScholarFindings
  • Joey Tianyi Zhou, Hao Zhang, Di Jin, Hongyuan Zhu, Meng Fang, Rick Siow Mong Goh, and Kenneth Kwok. 2019. Dual adversarial neural transfer for low-resource named entity recognition. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3461–3471, Florence, Italy. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Yi Zhu, Benjamin Heinzerling, Ivan Vulic, Michael Strube, Roi Reichart, and Anna Korhonen. 2019. On the importance of subword information for morphological tasks in truly low-resource languages. In Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL), pages 216– 226, Hong Kong, China. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • (5) Nahuatl and (6) Estonian: Both have between 1 and 2 million speakers, but are spoken in very different regions (North America & Europe).
    Google ScholarFindings
  • All speaker numbers according to (Eberhard et al., 2019) reflecting the total number of users (L1 + L2). The task were choosen from a list of popular NLP tasks3. We selected two tasks for the lower-lever groups and three tasks for the higherlevel groups, which reflects the application diversity with increasing complexity. Table 2 shows which tasks were addressed for each language.
    Google ScholarFindings
  • Word segmentation, lemmatization, part-ofspeech tagging, sentence breaking and (semantic) parsing are covered for Yoruba and Estonian by treebanks from the universal dependencies project (Nivre et al., 2020). Cusco Quechua is listed as an upcoming language in the UD project, but no treebank is accessible at this moment. The WikiAnn corpus for named entity recognition (Pan et al., 2017) has resources and tools for NER and sentence breaking for all six languages Lemmatization
    Google ScholarLocate open access versionFindings
Your rating :
0

 

Tags
Comments
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn
小科