AI helps you reading Science

AI generates interpretation videos

AI extracts and analyses the key points of the paper to generate videos automatically


pub
Go Generating

AI Traceability

AI parses the academic lineage of this thesis


Master Reading Tree
Generate MRT

AI Insight

AI extracts a summary of this paper


Weibo:
In this paper we propose the utterance-level Permutation Invariant Training technique. uPIT is a practically applicable, end-to-end, deep learning based solution for speaker independent multi-talker speech separation

Multitalker Speech Separation With Utterance-Level Permutation Invariant Training of Deep Recurrent Neural Networks.

IEEE/ACM Transactions on Audio, Speech, and Language Processing, no. 10 (2017): 1901-1913

Cited: 515|Views445
EI WOS

Abstract

In this paper, we propose the utterance-level permutation invariant training (uPIT) technique. uPIT is a practically applicable, end-to-end, deep-learning-based solution for speaker independent multitalker speech separation. Specifically, uPIT extends the recently proposed permutation invariant training (PIT) technique with an utterance-l...More

Code:

Data:

0
Introduction
  • H AVING a conversation in a complex acoustic environment, with multiple noise sources and competing background speakers, is a task humans are remarkably good at [1], [2].
  • For multi-talker speech separation, both CASA and NMF have led to limited success [4], [5] and the most successful techniques, before the deep learning era, are based on probabilistic models [15]–[17], such as factorial GMM-HMM [18], that model the temporal dynamics and the complex interactions of the target and competing speech signals
  • These models assume and only work under closed-set speaker conditions, i.e. the identity of the speakers must be known a priori
Highlights
  • H AVING a conversation in a complex acoustic environment, with multiple noise sources and competing background speakers, is a task humans are remarkably good at [1], [2]
  • We evaluated utterance-level Permutation Invariant Training on the WSJ0 and Danish two- and three-talker mixed-speech separation tasks and found that utterance-level Permutation Invariant Training outperforms techniques based on Non-negative Matrix Factorization (NMF) and Computational Auditory Scene Analysis (CASA), and compares favorably with Deep Clustering (DPCL) and the Deep Attractor Network (DANet)
  • The models were evaluated on their potential to improve the Signal-to-Distortion Ratio (SDR) [44] and the Perceptual Evaluation of Speech Quality (PESQ) [49] score, both of which are metrics widely used to evaluate speech enhancement performance for multi-talker speech separation tasks
  • We evaluated utterance-level Permutation Invariant Training on the WSJ0-2mix, WSj0-3mix3 and Danish-2mix datasets using 129-dimensional Short-Time discrete Fourier Transformation magnitude spectra computed with a sampling frequency of 8 kHz, a frame size of 32 ms and a 16 ms frame shift
  • We have introduced the utterance-level Permutation Invariant Training technique for speaker independent multi-talker speech separation
  • We consider utterance-level Permutation Invariant Training an interesting step towards solving the important cocktail party problem in a real-world setup, where the set of speakers is unknown during the training time
Methods
  • ON THE WSJ0-2MIX DATASET WITHOUT ADDITIONAL TRACING (I.E., DEF. ASSIGN.). ‡ INDICATES CURRICULUM TRAINING.

    Oracle NMF [36] CASA [36] DPCL [36] DPCL+ [37] DANet [37] DANet‡ [37] DPCL++ [39] DPCL++‡ [39]

    PIT-DNN PIT-CNN uPIT-BLSTM uPIT-BLSTM-ST

    51\51 51\51 PSM-ReLU PSM-ReLU PESQ Imp.
  • ON THE WSJ0-2MIX DATASET WITHOUT ADDITIONAL TRACING (I.E., DEF.
  • ASSIGN.).
  • ‡ INDICATES CURRICULUM TRAINING.
  • Oracle NMF [36] CASA [36] DPCL [36] DPCL+ [37] DANet [37] DANet‡ [37] DPCL++ [39] DPCL++‡ [39].
  • 51\51 51\51 PSM-ReLU PSM-ReLU PESQ Imp. CC OC SDR Imp. CC OC
Results
  • The authors evaluated uPIT on various setups and all models were implemented using the Microsoft Cognitive Toolkit (CNTK) [47], [48]2.
  • The 30h training set and the 10h validation set contain two-speaker mixtures generated by randomly selecting from 49 male and 51 female speakers and utterances from the WSJ0 training set si tr s, and mixing them at various Signal-to-Noise Ratios (SNRs) uniformly chosen between 0 dB and 5 dB.
  • The 5h test set was generated using utterances from 16 speakers from the WSJ0 validation set si dt 05 and evaluation set si et 05.
  • The WSJ0-3mix dataset was generated using a similar approach but contains mixtures of speech from three talkers
Conclusion
  • CONCLUSION AND DISCUSSION

    In this paper, the authors have introduced the utterance-level Permutation Invariant Training technique for speaker independent multi-talker speech separation.
  • The authors' experiments on two- and three-talker mixed speech separation tasks indicate that uPIT can effectively deal with the label permutation problem.
  • These experiments show that bi-directional Long Short-Term Memory (LSTM) Frequency [kHz].
  • The authors show that a single model can handle both two-speaker and three-speaker mixtures
  • This indicates that it might be possible to train a universal speech separation model using speech in various speaker, language and noise conditions
Tables
  • Table1: SDR IMPROVEMENTS (DB) FOR DIFFERENT SEPARATION METHODS ON
  • Table2: SDR (DB) AND PESQ IMPROVEMENTS ON WSJ0-2MIX AND DANISH-2MIX WITH UPIT-BLSTM-PSM-RELU TRAINED ON
  • Table3: SDR (DB) IMPROVEMENTS ON TEST SETS OF WSJ0-2MIX DIVIDED INTO
  • Table4: SDR (DB) AND PESQ IMPROVEMENTS FOR DIFFERENT SEPARATION
  • Table5: FURTHER IMPROVEMENT ON THE WSJ0-2MIX DATASET WITH ADDITIONAL TRAINING EPOCHS WITH REDUCED DROPOUT (-RD) OR
  • Table6: SDR IMPROVEMENTS (DB) FOR DIFFERENT SEPARATION METHODS ON THE WSJ0-3MIX DATASET. ‡ INDICATES CURRICULUM TRAINING
Download tables as Excel
Reference
  • S. Haykin and Z. Chen, “The Cocktail Party Problem,” Neural Comput., vol. 17, no. 9, pp. 1875–1902, 2005.
    Google ScholarLocate open access versionFindings
  • A. W. Bronkhorst, “The Cocktail Party Phenomenon: A Review of Research on Speech Intelligibility in Multiple-Talker Conditions,” Acta Acust united Ac, vol. 86, no. 1, pp. 117–128, 2000.
    Google ScholarLocate open access versionFindings
  • E. C. Cherry, “Some Experiments on the Recognition of Speech, with One and with Two Ears,” J. Acoust. Soc. Am., vol. 25, no. 5, pp. 975– 979, Sep. 1953.
    Google ScholarLocate open access versionFindings
  • M. Cooke, J. R. Hershey, and S. J. Rennie, “Monaural Speech Separation and Recognition Challenge,” Comput. Speech Lang., vol. 24, no. 1, pp. 1–15, Jan. 2010.
    Google ScholarLocate open access versionFindings
  • P. Divenyi, Speech Separation by Humans and Machines. Springer, 2005.
    Google ScholarFindings
  • D. P. W. Ellis, “Prediction-driven computational auditory scene analysis,” Ph.D. dissertation, Massachusetts Institute of Technology, 1996.
    Google ScholarFindings
  • M. Cooke, Modelling Auditory Processing and Organisation. Cambridge University Press, 2005.
    Google ScholarFindings
  • D. Wang and G. J. Brown, Computational Auditory Scene Analysis: Principles, Algorithms, and Applications. Wiley-IEEE Press, 2006.
    Google ScholarFindings
  • Y. Shao and D. Wang, “Model-based sequential organization in cochannel speech,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 14, no. 1, pp. 289–298, Jan. 2006.
    Google ScholarLocate open access versionFindings
  • K. Hu and D. Wang, “An Unsupervised Approach to Cochannel Speech Separation,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 21, no. 1, pp. 122–131, Jan. 2013.
    Google ScholarLocate open access versionFindings
  • M. N. Schmidt and R. K. Olsson, “Single-Channel Speech Separation using Sparse Non-Negative Matrix Factorization,” in Proc. INTERSPEECH, 2006, pp. 2614–2617.
    Google ScholarLocate open access versionFindings
  • P. Smaragdis, “Convolutive Speech Bases and Their Application to Supervised Speech Separation,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 15, no. 1, pp. 1–12, Jan. 2007.
    Google ScholarLocate open access versionFindings
  • J. L. Roux, F. Weninger, and J. R. Hershey, “Sparse NMF – half-baked or well done?” Mitsubishi Electric Research Labs (MERL), Tech. Rep. TR2015-023, 2015.
    Google ScholarLocate open access versionFindings
  • D. D. Lee and H. S. Seung, “Algorithms for Non-negative Matrix Factorization,” in NIPS, 2000, pp. 556–562.
    Google ScholarLocate open access versionFindings
  • T. T. Kristjansson et al., “Super-human multi-talker speech recognition: the IBM 2006 speech separation challenge system,” in Proc. INTERSPEECH, 2006, pp. 97–100.
    Google ScholarLocate open access versionFindings
  • T. Virtanen, “Speech Recognition Using Factorial Hidden Markov Models for Separation in the Feature Space,” in Proc. INTERSPEECH, 2006.
    Google ScholarLocate open access versionFindings
  • M. Stark, M. Wohlmayr, and F. Pernkopf, “Source-Filter-Based SingleChannel Speech Separation Using Pitch Information,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 19, no. 2, pp. 242–255, Feb. 2011.
    Google ScholarLocate open access versionFindings
  • Z. Ghahramani and M. I. Jordan, “Factorial Hidden Markov Models,” Machine Learning, vol. 29, no. 2-3, pp. 245–273, 1997.
    Google ScholarLocate open access versionFindings
  • I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. MIT Press, 2016. [Online]. Available: http://www.deeplearningbook.org
    Findings
  • D. Yu, L. Deng, and G. E. Dahl, “Roles of Pre-Training and FineTuning in Context-Dependent DBN-HMMs for Real-World Speech Recognition,” in NIPS Workshop on Deep Learning and Unsupervised Feature Learning, 2010.
    Google ScholarLocate open access versionFindings
  • G. E. Dahl, D. Yu, L. Deng, and A. Acero, “Context-Dependent PreTrained Deep Neural Networks for Large-Vocabulary Speech Recognition,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 20, no. 1, pp. 30–42, Jan. 2012.
    Google ScholarLocate open access versionFindings
  • F. Seide, G. Li, and D. Yu, “Conversational speech transcription using context-dependent deep neural networks,” in Proc. INTERSPEECH, 2011, pp. 437–440.
    Google ScholarLocate open access versionFindings
  • G. Hinton et al., “Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups,” IEEE Sig. Process. Mag., vol. 29, no. 6, pp. 82–97, Nov. 2012.
    Google ScholarLocate open access versionFindings
  • W. Xiong et al., “Achieving Human Parity in Conversational Speech Recognition,” arXiv:1610.05256 [cs], 2016.
    Findings
  • G. Saon et al., “English Conversational Telephone Speech Recognition by Humans and Machines,” arXiv:1703.02136 [cs], 2017.
    Findings
  • Y. Wang and D. Wang, “Towards Scaling Up Classification-Based Speech Separation,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 21, no. 7, pp. 1381–1390, Jul. 2013.
    Google ScholarLocate open access versionFindings
  • Y. Wang, A. Narayanan, and D. Wang, “On Training Targets for Supervised Speech Separation,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 22, no. 12, pp. 1849–1858, 2014.
    Google ScholarLocate open access versionFindings
  • Y. Xu, J. Du, L.-R. Dai, and C.-H. Lee, “An Experimental Study on Speech Enhancement Based on Deep Neural Networks,” IEEE Sig. Process. Let., vol. 21, no. 1, pp. 65–68, Jan. 2014.
    Google ScholarLocate open access versionFindings
  • F. Weninger et al., “Speech Enhancement with LSTM Recurrent Neural Networks and Its Application to Noise-Robust ASR,” in LVA/ICA. Springer, 2015, pp. 91–99.
    Google ScholarFindings
  • P.-S. Huang, M. Kim, M. Hasegawa-Johnson, and P. Smaragdis, “Joint Optimization of Masks and Deep Recurrent Neural Networks for Monaural Source Separation,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 23, no. 12, pp. 2136–2147, 2015.
    Google ScholarLocate open access versionFindings
  • J. Chen et al., “Large-scale training to increase speech intelligibility for hearing-impaired listeners in novel noises,” J. Acoust. Soc. Am., vol. 139, no. 5, pp. 2604–2612, 2016.
    Google ScholarLocate open access versionFindings
  • M. Kolbæk, Z. H. Tan, and J. Jensen, “Speech Intelligibility Potential of General and Specialized Deep Neural Network Based Speech Enhancement Systems,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 25, no. 1, pp. 153–167, 2017.
    Google ScholarLocate open access versionFindings
  • J. Du et al., “Speech separation of a target speaker based on deep neural networks,” in ICSP, 2014, pp. 473–477.
    Google ScholarLocate open access versionFindings
  • T. Goehring et al., “Speech enhancement based on neural networks improves speech intelligibility in noise for cochlear implant users,” Hearing Research, vol. 344, pp. 183–194, 2017.
    Google ScholarLocate open access versionFindings
  • C. Weng, D. Yu, M. L. Seltzer, and J. Droppo, “Deep Neural Networks for Single-Channel Multi-Talker Speech Recognition,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 23, no. 10, pp. 1670–1679, 2015.
    Google ScholarLocate open access versionFindings
  • J. R. Hershey, Z. Chen, J. L. Roux, and S. Watanabe, “Deep clustering: Discriminative embeddings for segmentation and separation,” in Proc. ICASSP, 2016, pp. 31–35.
    Google ScholarLocate open access versionFindings
  • Z. Chen, Y. Luo, and N. Mesgarani, “Deep attractor network for singlemicrophone speaker separation,” in Proc. ICASSP, 2017, pp. 246–250.
    Google ScholarLocate open access versionFindings
  • D. Yu, M. Kolbæk, Z.-H. Tan, and J. Jensen, “Permutation Invariant Training of Deep Models for Speaker-Independent Multi-talker Speech Separation,” in Proc. ICASSP, 2017, pp. 241–245.
    Google ScholarLocate open access versionFindings
  • Y. Isik et al., “Single-Channel Multi-Speaker Separation Using Deep Clustering,” in Proc. INTERSPEECH, 2016, pp. 545–549.
    Google ScholarLocate open access versionFindings
  • Z. Chen, “Single Channel Auditory Source Separation with Neural Network,” Ph.D., Columbia University, United States – New York, 2017.
    Google ScholarLocate open access versionFindings
  • S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural Comput, vol. 9, no. 8, pp. 1735–1780, 1997.
    Google ScholarLocate open access versionFindings
  • D. S. Williamson, Y. Wang, and D. Wang, “Complex Ratio Masking for Monaural Speech Separation,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 24, no. 3, pp. 483–492, Mar. 2016.
    Google ScholarLocate open access versionFindings
  • H. Erdogan, J. R. Hershey, S. Watanabe, and J. L. Roux, “Deep Recurrent Networks for Separation and Recognition of Single Channel Speech in Non-stationary Background Audio,” in New Era for Robust Speech Recognition: Exploiting Deep Learning. Springer, 2017.
    Google ScholarFindings
  • E. Vincent, R. Gribonval, and C. Fevotte, “Performance measurement in blind audio source separation,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 14, no. 4, pp. 1462–1469, 2006.
    Google ScholarLocate open access versionFindings
  • H. Erdogan, J. R. Hershey, S. Watanabe, and J. L. Roux, “Phase-sensitive and recognition-boosted speech separation using deep recurrent neural networks,” in Proc. ICASSP, 2015, pp. 708–712.
    Google ScholarLocate open access versionFindings
  • Y. Tu et al., “Deep neural network based speech separation for robust speech recognition,” in ICSP, 2014, pp. 532–536.
    Google ScholarLocate open access versionFindings
  • D. Yu, K. Yao, and Y. Zhang, “The Computational Network Toolkit,” IEEE Sig. Process. Mag., vol. 32, no. 6, pp. 123–126, Nov. 2015.
    Google ScholarLocate open access versionFindings
  • A. Agarwal et al., “An introduction to computational networks and the computational network toolkit,” Microsoft Technical Report {MSRTR}-2014-112, Tech. Rep., 2014.
    Google ScholarFindings
  • A. Rix, J. Beerends, M. Hollier, and A. Hekstra, “Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs,” in Proc. ICASSP, vol. 2, 2001, pp. 749–752.
    Google ScholarLocate open access versionFindings
  • J. Garofolo, D. Graff, P. Doug, and D. Pallett, “CSR-I (WSJ0) Complete LDC93s6a,” 1993, philadelphia: Linguistic Data Consortium.
    Google ScholarFindings
  • Y. Gal and Z. Ghahramani, “A Theoretically Grounded Application of Dropout in Recurrent Neural Networks,” arXiv:1512.05287, Dec. 2015.
    Findings
  • X. L. Zhang and D. Wang, “A Deep Ensemble Learning Method for Monaural Speech Separation,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 24, no. 5, pp. 967–977, May 2016.
    Google ScholarLocate open access versionFindings
  • S. Nie, H. Zhang, X. Zhang, and W. Liu, “Deep stacking networks with time series for speech separation,” in Proc. ICASSP, 2014, pp. 6667– 6671.
    Google ScholarLocate open access versionFindings
  • Z.-Q. Wang and D. Wang, “Recurrent Deep Stacking Networks for Supervised Speech Separation,” in Proc. ICASSP, 2017, pp. 71–75.
    Google ScholarLocate open access versionFindings
  • Y. Bengio, J. Louradour, R. Collobert, and J. Weston, “Curriculum Learning,” in ICML, 2009, pp. 41–48. Morten Kolbæk received the B.Eng. degree in electronic design at Aarhus University, Business and Social Sciences, AU Herning, Denmark, in 2013 and the M.Sc. in signal processing and computing from Aalborg University, Denmark, in 2015. He is currently pursuing his PhD degree at the section for Signal and Information Processing at the Department of Electronic Systems, Aalborg University, Denmark. His research interests include speech enhancement, deep learning, and intelligibility improvement of noisy speech.
    Google ScholarLocate open access versionFindings
  • Dong Yu (M’97-SM’06) is a distinguished scientist and vice general manager at Tencent AI Lab. Before joining Tencent, he was a principal researcher at Microsoft Research where he joined in 1998. His pioneer works on deep learning based speech recognition have been recognized by the prestigious IEEE Signal Processing Society 2013 and 2016 best paper award. He has served in various technical committees, editorial boards, and conference organization committees.
    Google ScholarLocate open access versionFindings
  • Jesper Jensen is a Senior Researcher with Oticon A/S, Denmark, where he is responsible for scouting and development of signal processing concepts for hearing instruments. He is also a Professor in Dept. Electronic Systems, Aalborg University. He is also a co-head of the Centre for Acoustic Signal Processing Research (CASPR) at Aalborg University. His work on speech intelligibility prediction received the 2017 IEEE Signal Processing Society’s best paper award. His main interests are in the area of acoustic signal processing, including signal retrieval from noisy observations, intelligibility enhancement of speech signals, signal processing for hearing aid applications, and perceptual aspects of signal processing.
    Google ScholarFindings
0
Your rating :

No Ratings

Tags
Comments
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn