DSVAE: Disentangled Representation Learning for Synthetic Speech Detection.

Amit Kumar Singh Yadav,Kratika Bhagtani,Ziyue Xiang,Paolo Bestagini,Stefano Tubaro,Edward J. Delp

International Conference on Machine Learning and Applications（2023）

Cited 0|Views11

No score

Abstract

Tools to generate high quality synthetic speech that is perceptually indistinguishable from speech recorded from hu-man speakers are easily available. Many incidents report misuse of synthetic speech for spreading misinformation and committing financial fraud. Several approaches have been proposed for detecting synthetic speech. Many of these approaches use deep learning methods without providing reasoning for the decisions they make. This limits the explainability of these approaches. In this paper, we use disentangled representation learning for developing a synthetic speech detector. We propose Disentangled Spectrogram Variational Auto Encoder (DSVAE) which is a two stage trained variational autoencoder that processes spectrograms of speech to generate features that disentangle synthetic and bona fide speech. We evaluated DSVAE using the ASVspoof2019 dataset. Our experimental results show high accuracy (> 98%) on detecting synthetic speech from 6 known and 10 unknown speech synthesizers. Further, the visualization of disentangled features obtained from DSVAE provides rea-soning behind the working principle of DSVAE, improving its explainability. DSVAE performs well compared to several existing methods. Additionally, DSVAE works in practical scenarios such as detecting synthetic speech uploaded on social platforms and against simple attacks such as removing silence regions.

Translated text

Key words

disentangled representation learning,synthetic speech detection,explainable AI,autoencoder

AI Read Science

Must-Reading Tree

Example

Generate MRT to find the research sequence of this paper

Chat Paper

Summary is being generated by the instructions you defined