Compact Transformer-based Language Models for the Moroccan Darija.

Mohamed Aghzal, Mohamed Amine El Bouni, Saad Driouech,Asmaa Mourhir

2023 7th IEEE Congress on Information Science and Technology (CiSt)(2023)

引用 0|浏览0
暂无评分
摘要
Over the past few years, pre-trained language models based on transformer architectures revolutionized the field of natural language processing, achieving state-of-the-art performance on various tasks. However, due to these models’ dependability on enormous corpora for training, very few such models were trained for underresourced languages and dialects, such as Moroccan Darija. In this work, we introduce DarRoBERTa and DarELECTRA which are two transformer-based language models for Darija. We evaluate the language models on the extrinsic tasks of text summarization and topic classification. As for the text summarization task, the results show that DarELECTRA achieves state-of-the art results with a score of 19.25 for Rouge 1, 5.79 for Rouge 2, and 18.01 for Rouge L. For the topic classification task, DarRoBERTa achieved an F1 score of 0.84 and accuracy of 0.86. While our Daija language models achieved results close the Arabic language models, they are much smaller and more efficient.
更多
查看译文
关键词
Darija,language modeling,deep learning,low resource language,transformer
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要