A comparison of data augmentation methods in voice pathology detection

Comput. Speech Lang.(2023)

引用 0|浏览7
暂无评分
摘要
To distinguish pathological voices from healthy voices, automatic voice pathology detection systems can be built using machine learning (ML) and deep learning (DL) techniques. To fully exploit such systems, large quantities of training data are typically required. The amount of training data is, however, small in the area of pathological voice, and therefore data augmentation (DA) becomes a potential technology to artificially increase the quantity of training data. This study presents a systematic comparison between various DA methods in the detection of pathological voice, including three time domain methods (noise addition, pitch shifting and time stretching), one time-frequency domain method (SpecAugment), and two vocoder-based methods (harmonic-to-noise ratio (HNR) modification and glottal pulse length modification). Detection systems were built using four popular spectral feature representations (static melfrequency cepstral coefficients (MFCCs), dynamic MFCCs, spectrogram and melspectrogram). As classifiers, two widely used ML models (support vector machine (SVM) and random forest (RF)) and two DL models (long short-term memory (LSTM) network and convolutional neural network (CNN) with 1-dimensional (1-D) and 2-dimensional (2-D) architectures) were used. These systems were trained using a small number of training samples from two popular databases of pathological voice (HUPA and SVD) to find the best feature/classifier combination for each database. As a result, one ML-based detection system (mel-spectrogram/SVM for HUPA and SVD) and two DL-based detection systems (dynamic MFCCs/2-D CNN for HUPA and melspectrogram/2-D CNN for SVD) were selected for the comparison of the DA methods. The results show that by using DA in the system training, detection accuracy increased compared to the baseline systems that were trained without using DA. This improvement in accuracy was, however, clearly larger for the 2D-CNN system than for the SVM system. Furthermore, all six DA methods improved accuracy of the 2-D CNN system compared to the baseline system for both databases. The highest improvements were achieved using the time-frequency domain SpecAugment DA method, which improved accuracy by 1.5% and 3.8% (absolute) for the HUPA and SVD database, respectively.
更多
查看译文
关键词
Voice pathology,Data augmentation,Deep learning,CNNs,Mel-spectrogram
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要