Alleviate Cross-chunk Permutation through Chunk-level Speaker Embedding for Blind Speech Separation
2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)(2019)
摘要
Speaker-independent speech separation (SI-SS) refers to recovering speech of unknown speakers from multispeaker mixtures. The well-known deep clustering (DC) based SI-SS methods cast the speech separation problem into a clustering problem in an embedding space, where time-frequency (T-F) features are encoded as high-dimensional vectors (T-F embeddings). In training stage, the T-F embeddings from the same speaker are trained to be close to each other, otherwise far away. In prediction stage, the T-F embeddings are partitioned into clusters by K-Means, where each cluster corresponds to an unknown speaker from the mixture. To reduce the latency, the T-F embeddings are usually extracted on short speech chunks rather than utterances, which unfortunately leads to a cross-chunk permutation (CCP) problem. In this study, we focus on solving this CCP problem by using the speaker labels as the auxiliary supervision information to train a deep model to map the T-F embeddings of one cluster to one chunk-level speaker embedding (CL-SE). Therefore, in prediction stage, the generated CL-SEs are used to calculate the similarity between each cluster over consecutive chunks. As a result, the speech chunks with the more similar CL-SEs are concatenated to yield the complete utterances. The evaluation is conducted on the well-known WSJ0-2mix and the signal-to-distortion ratio (SDR) is adopted for performance evaluation. Noted that we obtain 41% SDR gain over DC baseline and up to 32% over other speaker-aware methods in open conditions.
更多查看译文
关键词
deep clustering based SI-SS methods,T-F embeddings,training stage,prediction stage,cross-chunk permutation problem,CCP problem,speaker labels,consecutive chunks,speaker-aware methods,blind speech separation,speaker-independent speech separation,chunk-level speaker embedding space,high-dimensional vectors,K-means clustering,signal-to-distortion ratio,performance evaluation,multispeaker mixtures,auxiliary supervision information
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络