FluSa-Tweet: A Benchmark Dataset for Influenza Detection in Saudi Arabia

2022 13th International Conference on Information and Communication Systems (ICICS)(2022)

引用 1|浏览0
暂无评分
摘要
Automatic flu detection is an important task in specialized natural language processing and its applications. Most existing studies emphasized on English language for influenza detection. However, Arabic language lacks significant resources, such as labeled datasets and NLP tools to cope with the current technologies. To this end, this paper presents the first dataset to detect influenza in Arabic tweets, particularly in Saudi Arabia. The dataset is manually annotated and contains 9145 tweets, which are categorized into three classes: (i) Awareness, (ii) Infection, and (iii) Unrelated. To mitigate the imbalance issue in the original dataset, three data resampling techniques are applied: (i) undersampling, (ii) oversampling, and (iii) oversampling using SMOTE. Five machine learning algorithms, such as Naive Bayes, Logistic Regression, Linear Support Vector Classifier, Random Forest, XGBoost, and three transformer-based models, are trained and tested to evaluate the dataset. Accuracy, precision, recall, and F1-Score are used to assess the models’ performance. Finally, transformer-based model achieves the highest results (97 % F1- score). This dataset is publically available for future research to automatically identify influenza in Arab countries.
更多
查看译文
关键词
Arabic Language,Influenza,Machine Learning,NLP,Saudi Dialect,Twitter
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要