BERT-Based Arabic Diacritization: A state-of-the-art approach for improving text accuracy and pronunciation

Ruba Kharsa, Ashraf Elnagar,Sane Yagi

EXPERT SYSTEMS WITH APPLICATIONS(2024)

引用 0|浏览0
暂无评分
摘要
In order to accurately represent the meaning and pronunciation of Arabic words and sentences, the presence of diacritics plays a crucial role. Over the years, researchers have dedicated significant efforts to enhancing automated diacritization systems. This paper introduces a novel approach for Arabic diacritization utilizing Bidirectional Encoder representations from Transformers (BERT) models. To evaluate the effectiveness of the proposed approach, two publicly available datasets, namely the Arabic Diacritization (AD) dataset and the Tashkeela Processed (TP) dataset, were employed. The performance of the models was assessed using various error metrics, including Diacritic Error Rate (DER) and Word Error Rate (WER). The findings demonstrate the superior performance of BERT in the diacritization process, surpassing all models employed in other diacritization systems. On the AD dataset, the proposed system achieved state-of-the-art (SOTA) syntactic DER and WER of 1.14% and 3.34%, respectively. For morphological diacritization, the best results yielded a DER of 0.92% and a WER of 1.91%. These outcomes reflect a remarkable relative error reduction of over 30% compared to previous research. Additionally, on the TP dataset, the BERT models exhibited a substantial decrease in DER, reducing the benchmark from 4.0% to 1.11%. Furthermore, this study introduces a realtime diacritization system called SUKOUN, which offers diacritized text through a user-friendly website. A comparison with existing automatic diacritization tools, using six example texts, reveals the superior prediction accuracy and preservation of input format provided by SUKOUN.
更多
查看译文
关键词
Automated diacritization system,Real-time diacritization,Word error rate,Diacritic error rate,BERT Arabic contextualized models
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要