Building a User-Generated Content North-African Arabizi Treebank: Tackling Hell

Farah Essaidi
Farah Essaidi
Amal Fethi
Amal Fethi
Matthieu Futeral
Matthieu Futeral
Benjamin Muller
Benjamin Muller
Abhishek Srivastava
Abhishek Srivastava

ACL, pp. 1139-1150, 2020.

Cited by: 2|Views54
EI
Weibo:
We introduced the first treebank for an Arabic dialect spoken in North-Africa and written in romanized form, NArabizi

Abstract:

We introduce the first treebank for a romanized user-generated content variety of Algerian, a North-African Arabic dialect known for its frequent usage of code-switching. Made of 1500 sentences, fully annotated in morpho-syntax and Universal Dependency syntax, with full translation at both the word and the sentence levels, this treebank i...More

Code:

Data:

0
ZH
Full Text
Bibtex
Weibo
Introduction
  • Until the rise of fully unsupervised techniques that would free the field from its addiction to annotated data, the question of building useful data sets for under-resourced languages at a reasonable cost is still crucial.
  • In conjunction with the resource scarcity issue, the codeswitching variability displayed by these languages challenges most standard NLP pipelines, if not all
  • What makes these dialects especially interesting is their widespread use in user-generated content found on social media platforms, where they are generally written using a romanized version of the Arabic script, called Arabizi, which is neither standardized nor formalized.
  • The absence of standardization for this script adds another layer of variation in addition to well-known user generated content idiosyncrasies, making the processing of this kind of text an even more challenging task
Highlights
  • Until the rise of fully unsupervised techniques that would free our field from its addiction to annotated data, the question of building useful data sets for under-resourced languages at a reasonable cost is still crucial
  • Taking as an example the Arabic dialects spoken in North-Africa, mostly from Morocco to Tunisia, sometimes called Maghribi, sometimes Darija, these idioms notoriously contain various degrees of code-switching with languages of former colonial powers such as French, Spanish, and, to a much lesser extent, Italian, depending on the area of usage (Habash, 2010; Cotterell et al, 2014; Saadane and Habash, 2015)
  • We introduced the first treebank for an Arabic dialect spoken in North-Africa and written in romanized form, NArabizi
  • More over, being made of user-generated content, this treebank covers a large variety of language variation among native speakers and displays a high level of codeswitching
  • Annotated with 4 standard morphosyntactic layers, two of them following the Universal Dependency annotation scheme, and provided with translation to French as well as glosses and word language identification, we believe that this corpus will be useful for the community at large, both for linguistic purposes and as training data for resource-scarce NLP in a high-variability scenario
  • In addition to the annotated data, we provide around 1 million tokens of unlabeled NArabizi content, resulting in the largest dataset available for this dialect
Results
  • As the NArabizi treebank contains more than 36% of French words, it is of interest to use recent methods of visualization to see how interleaved it is.
Conclusion
  • How interleaved are French and NArabizi? As stated before, NArabizi takes its root in Classical Arabic and in multiple sources of integration of French, MSA and Berber, the Amazigh language.
  • Given the large degree of interleaving between French and NArabizi, it is interesting to assess the impact of the French vocabulary on the performance of a POS-tagger trained on French data only.
  • For these experiments, the authors use the StanfordNLP neural tagger (Qi et al, 2019), which ranked 1st in POS tagging at the 2018 UD shared task, trained on the UD French.
  • The authors' corpora are freely available14 under the CC-BY-SA license and the NArabizi treebank is released as part of the Universal Dependencies project
Summary
  • Introduction:

    Until the rise of fully unsupervised techniques that would free the field from its addiction to annotated data, the question of building useful data sets for under-resourced languages at a reasonable cost is still crucial.
  • In conjunction with the resource scarcity issue, the codeswitching variability displayed by these languages challenges most standard NLP pipelines, if not all
  • What makes these dialects especially interesting is their widespread use in user-generated content found on social media platforms, where they are generally written using a romanized version of the Arabic script, called Arabizi, which is neither standardized nor formalized.
  • The absence of standardization for this script adds another layer of variation in addition to well-known user generated content idiosyncrasies, making the processing of this kind of text an even more challenging task
  • Results:

    As the NArabizi treebank contains more than 36% of French words, it is of interest to use recent methods of visualization to see how interleaved it is.
  • Conclusion:

    How interleaved are French and NArabizi? As stated before, NArabizi takes its root in Classical Arabic and in multiple sources of integration of French, MSA and Berber, the Amazigh language.
  • Given the large degree of interleaving between French and NArabizi, it is interesting to assess the impact of the French vocabulary on the performance of a POS-tagger trained on French data only.
  • For these experiments, the authors use the StanfordNLP neural tagger (Qi et al, 2019), which ranked 1st in POS tagging at the 2018 UD shared task, trained on the UD French.
  • The authors' corpora are freely available14 under the CC-BY-SA license and the NArabizi treebank is released as part of the Universal Dependencies project
Tables
  • Table1: Examples of lexical variation in NArabizi
  • Table2: F1-scores of both language classification models on the Arabizi class
  • Table3: Corpus statistics
  • Table4: BLEU score of both transliteration systems
  • Table5: POS tagging results
  • Table6: Results of UDPipe (trained 100 epochs) on the preliminary test set
  • Table7: POS tagging Performance with regard to codemix proportion trained on UD French Partut treebank
  • Table8: Treebanking costs. The annotation phases are (i) Morphology/tokenization, (ii) Translation, (iii) Preannotation Syntax, (iv) Correction, (v) Final Syntax. P.M stands for person.month
Download tables as Excel
Related work
  • Research on Arabic dialects is quite extensive. Space is lacking to describe it exhaustively. In relation to our work regarding North-African dialect, we refer to the work of (Samih, 2017) who along his PhD covered an large range of topics regarding the dialect spoken specifically in Morocco and generally regarding language identification (Samih et al, 2016) in code-switching scenario for various Arabic dialects (Attia et al, 2019).

    Unlike NArabizi dialects, the resource situation for Arabic dialects in canonical written form can hardly be qualified as scarce given the amount of resources produced by the Linguistic Data Consortium regarding these languages, see (Diab et al, 2013) for details on those corpora. These data have been extensively covered in various NLP aspects by the former members of the Columbia Arabic NLP team, among which Mona Diab, Nizar Habash, and Owen Rambow, in their respective subsequent lines of works. Many small to medium scale linguistics resources, such as morphological lexicons or bilingual dictionaries have been produced (Shoufan and Alameri, 2015). Recently, in addition to the release of a small-range parallel corpus for some Arabic dialects (Bouamor et al, 2014), a larger corpus collection was released, covering 25 city dialects in the travel domain (Bouamor et al, 2018).
Funding
  • The work was partially funded by the French Research Agency projects ParSiTi (ANR-16-CE330021), SoSweet (ANR15-CE38-0011-01) and by the French Ministry of Industry and Ministry of Foreign Affairs via the PHC Maimonide FranceIsrael cooperation programme, as well as by the Sagot’s chair in the PRAIRIE institute funded by the French national agency ANR as part of the “Investissements d’avenir” programme under the reference ANR-19-P3IA-0001
Reference
  • Ahmed Abdelali, Kareem Darwish, Nadir Durrani, and Hamdy Mubarak. 2016. Farasa: A fast and furious segmenter for arabic. In HLT-NAACL Demos.
    Google ScholarFindings
  • Mohammed Attia, Younes Samih, Ali Elkahky, Hamdy Mubarak, Ahmed Abdelali, and Kareem Darwish. 2019. POS tagging for improving code-switching identification in Arabic. In
    Google ScholarLocate open access versionFindings
  • Yonatan Belinkov and James Glass. 2015. Arabic diacritization with recurrent neural networks. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 2281–2285, Lisbonne, Portugal. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Houda Bouamor, Nizar Habash, and Kemal Oflazer. 201A multidialectal parallel corpus of Arabic. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), pages 1240–1245, Reykjavik, Iceland. European Language Resources Association (ELRA).
    Google ScholarLocate open access versionFindings
  • Houda Bouamor, Nizar Habash, Mohammad Salameh, Wajdi Zaghouani, Owen Rambow, Dana Abdulrahim, Ossama Obeid, Salam Khalifa, Fadhl Eryani, Alexander Erdmann, and Kemal Oflazer. 2018. The MADAR Arabic dialect corpus and lexicon. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan. European Language Resources Association (ELRA).
    Google ScholarLocate open access versionFindings
  • Ryan Cotterell, Adithya Renduchintala, Naomi Saphra, and Chris Callison-Burch. 2014. An Algerian Arabic-French code-switched corpus. In Workshop on Free/Open-Source Arabic Corpora and Corpora Processing Tools Workshop Programme, page 34.
    Google ScholarLocate open access versionFindings
  • Mona Diab, Nizar Habash, Owen Rambow, and Ryan Roth. 2013. LDC Arabic treebanks and associated corpora: Data divisions manual. Technical Report CCLS-13-02, Center for Computational Learning Systems, Columbia University.
    Google ScholarFindings
  • Nizar Habash. 2010. Introduction to Arabic Natural Language Processing. Morgan and Claypool.
    Google ScholarFindings
  • Armand Joulin, Edouard Grave, Piotr Bojanowski, Matthijs Douze, Hérve Jégou, and Tomas Mikolov. 2016. Fasttext. zip: Compressing text classification models. arXiv preprint arXiv:1612.03651.
    Findings
  • Éric de La Clergerie, Benoît Sagot, and Djamé Seddah. 2017. The ParisNLP entry at the ConLL UD shared task 2017: A tale of a #ParsingTragedy. In Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, pages 243–252, Vancouver, Canada. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Marco Lui and Timothy Baldwin. 2012. langid.py: An off-the-shelf language identification tool. In The 50th Annual Meeting of the Association for Computational Linguistics, Proceedings of the System Demonstrations, July 10, 2012, Jeju Island, Korea, pages 25–30. The Association for Computer Linguistics.
    Google ScholarLocate open access versionFindings
  • Teresa Lynn and Kevin Scannell. 2019. Codeswitching in irish tweets: A preliminary analysis. In Proceedings of the Celtic Language Technology Workshop, pages 32–40.
    Google ScholarLocate open access versionFindings
  • Héctor Martínez Alonso, Djamé Seddah, and Benoît Sagot. 2016. From Noisy Questions to Minecraft Texts: Annotation Challenges in Extreme Syntax Scenarios. In The 2nd Workshop on Noisy User-generated Text (W-NUT).
    Google ScholarFindings
  • Tomas Mikolov, Edouard Grave, Piotr Bojanowski, Christian Puhrsch, and Armand Joulin. 2018. Advances in pre-training distributed word representations. In Proceedings of the International Conference on Language Resources and Evaluation (LREC 2018).
    Google ScholarLocate open access versionFindings
  • Robert Munro. 2010. Crowdsourced translation for emergency response in haiti: the global collaboration of local knowledge. In AMTA Workshop on Collaborative Crowdsourcing for Translation, pages 1–4.
    Google ScholarLocate open access versionFindings
  • Carol Myers-Scotton. 1993. Common and uncommon ground: Social and structural factors in codeswitching. Language in Society, 22(4):475–503.
    Google ScholarLocate open access versionFindings
  • Adam Nossiter. 2019. Algeria protests grow against president bouteflika, ailing and out of sight. In New York Times (March 01, 2019).
    Google ScholarLocate open access versionFindings
  • Pedro Javier Ortiz Suárez, Benoît Sagot, and Laurent Romary. 2019. Asynchronous Pipeline for Processing Huge Corpora on Medium to Low Resource Infrastructures. In 7th Workshop on the Challenges in the Management of Large Corpora (CMLC-7), Cardiff, United Kingdom. Leibniz-Institut für Deutsche Sprache.
    Google ScholarLocate open access versionFindings
  • Kishore Papineni, Salim Roukos, Todd Ward, and Wei-jing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 311–318.
    Google ScholarLocate open access versionFindings
  • Slav Petrov, Dipanjan Das, and Ryan McDonald. 2011. A universal part-of-speech tagset. arXiv preprint arXiv:1104.2086.
    Findings
  • Peng Qi, Timothy Dozat, Yuhao Zhang, and Christopher D Manning. 2019. Universal dependency parsing from scratch. arXiv preprint arXiv:1901.10457.
    Findings
  • Houda Saadane and Nizar Habash. 2015. A conventional orthography for Algerian arabic. In Proceedings of the Second Workshop on Arabic Natural Language Processing, pages 69–79.
    Google ScholarLocate open access versionFindings
  • Younes Samih. 2017. Dialectal Arabic processing Using Deep Learning. Ph.D. thesis, Düsseldorf, Germany.
    Google ScholarFindings
  • Younes Samih, Suraj Maharjan, Mohammed Attia, Laura Kallmeyer, and Thamar Solorio. 2016. Multilingual code-switching identification via LSTM recurrent neural networks. In Proceedings of the Second Workshop on Computational Approaches to Code Switching, pages 50–59, Austin, Texas. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Djamé Seddah, Benoît Sagot, Marie Candito, Virginie Mouilleron, and Vanessa Combet. 2012. The French Social Media Bank: a Treebank of Noisy User Generated Content. In CoLing, Mumbai, India.
    Google ScholarLocate open access versionFindings
  • Abdulhadi Shoufan and Sumaya Alameri. 2015. Natural language processing for dialectical Arabic: A survey. In Proceedings of the Second Workshop on Arabic Natural Language Processing, pages 36–48, Beijing, China. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Abhishek Srivastava, Benjamin Muller, and Djamé Seddah. 2019. Unsupervised Learning for Handling Code-Mixed Data: A Case Study on POS Tagging of North-African Arabizi Dialect. EurNLP - First annual EurNLP. Poster.
    Google ScholarFindings
  • Milan Straka and Jana Straková. 2017.
    Google ScholarFindings
  • Tokenizing, pos tagging, lemmatizing and parsing ud 2.0 with udpipe. In Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, pages 88– 99, Vancouver, Canada. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Reut Tsarfaty, Djamé Seddah, Yoav Goldberg, Sandra Kübler, Marie Candito, Jennifer Foster, Yannick Versley, Ines Rehbein, and Lamia Tounsi. 2010. Statistical parsing of morphologically rich languages (spmrl): what, how and whither. In Proceedings of the NAACL HLT 2010 First Workshop on Statistical Parsing of Morphologically-Rich Languages, pages 1–12. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Nasser Zalmout and Nizar Habash. 2019. Joint diacritization, lemmatization, normalization, and fine-grained morphological tagging. arXiv preprint arXiv:1910.02267.
    Findings
Your rating :
0

 

Tags
Comments