Audio-textual multi-label demographic recognition of Arabic speakers using deep learning

Journal of Intelligent & Fuzzy Systems(2024)

引用 0|浏览0
暂无评分
摘要
Speaker demographic recognition and segmentation analytics play a key role in offering personalized experiences across different automated industries and businesses. This paper aims at developing a multi-label demographic recognition system for Arabic speakers from audio and associated textual modalities. The system can detect age groups, genders, and dialects, but it can be easily extended to incorporate more demographic traits. The proposed method is based on deep learning for feature learning and recognition. Representations of audio modality are learned through 3D spectrogram and AlexNet CNN-based architecture. An AraBERT transformer is employed for learning representations of the textual modality. Additionally, a method is provided for fusing audio and textual representations. The effectiveness of the proposed method is evaluated using the Saudi Audio Dataset for Arabic (SADA), which is a recently published database containing audio recordings of TV shows in different Arabic dialects. The experimental findings show that when using models with standalone modalities for multi-label demographic classification, textual modality using AraBERT performed better than the audio modality represented using 3D spectrogram along with AlexNet CNN-based architecture. Furthermore, when combining both modalities, audio and textual, significant improvement has been attained for all demographic traits.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要