Unsupervised Sound Separation Using Mixture Invariant Training

NIPS 2020, 2020.

Cited by: 4|Bibtex|Views49
Other Links: arxiv.org
Weibo:
Across several tasks including speech separation, speech enhancement, and universal sound separation, we demonstrated that mixture invariant training can approach the performance of supervised permutation-invariant training, and is especially helpful in a semisupervised setup to ...

Abstract:

In recent years, rapid progress has been made on the problem of single-channel sound separation using supervised training of deep neural networks. In such supervised approaches, the model is trained to predict the component sources from synthetic mixtures created by adding up isolated ground-truth sources. The reliance on this synthetic...More

Code:

Data:

0
Introduction
  • Audio perception is fraught with a fundamental problem: individual sounds are convolved with unknown acoustic reverberation functions and mixed together at the acoustic sensor in a way that is impossible to disentangle without prior knowledge of the source characteristics
  • It is a hallmark of human hearing that the authors are able to hear the nuances of different sources, even when presented with a monaural mixture of sounds.
  • The more general “universal sound separation” problem of separating arbitrary classes of sound from each other has recently been addressed [19, 39]
Highlights
  • Audio perception is fraught with a fundamental problem: individual sounds are convolved with unknown acoustic reverberation functions and mixed together at the acoustic sensor in a way that is impossible to disentangle without prior knowledge of the source characteristics
  • Contributions: (1) we propose the first purely unsupervised learning method that is effective for audio-only single-channel separation tasks such as speech separation and find that it can achieve competitive performance compared to supervised methods; (2) we provide extensive experiments with cross-domain adaptation to show the effectiveness of mixture invariant training (MixIT) for domain adaptation to different reverberation characteristics in semi-supervised settings; (3) the proposed method opens up the use of a wider variety of data, such as training speech enhancement models from noisy mixtures by only using speech activity labels, or improving performance universal sound separation models by training on large amounts of unlabeled, in-the-wild data
  • The experiments show that MixIT works well for speech separation, speech enhancement, and universal sound separation
  • For universal sound separation and speech enhancement, the unsupervised training does not help as much, presumably because the synthetic test sets are wellmatched to the supervised training domain
  • We have presented MixIT, a new paradigm for training sound separation models in a completely unsupervised manner where ground-truth source references are not required
  • We significantly improve reverberant speech separation performance by incorporating reverberant mixtures, train a speech enhancement system from noisy mixtures, and improve universal sound separation by incorporating a large amount of in-the-wild data
  • Across several tasks including speech separation, speech enhancement, and universal sound separation, we demonstrated that MixIT can approach the performance of supervised permutation-invariant training (PIT), and is especially helpful in a semisupervised setup to adapt to mismatched domains
Methods
  • The authors generalize the permutation-invariant training framework to operate directly on unsupervised mixtures, as illustrated in Figure 1.
  • A supervised separation dataset is comprised of pairs of 1: source 1 : mixture.
  • 2: source 2 1: mixture 1.
  • (a) Supervised permutation invariant training (PIT).
  • SNR Loss Permute with matrix SNR Loss.
  • SNR Loss: mixture of mixtures: estimated sources.
  • Remix with matrix 2: mixture 2.
  • (b) Unsupervised mixture invariant training (MixIT)
Results
  • The authors significantly improve reverberant speech separation performance by incorporating reverberant mixtures, train a speech enhancement system from noisy mixtures, and improve universal sound separation by incorporating a large amount of in-the-wild data.
Conclusion
  • The experiments show that MixIT works well for speech separation, speech enhancement, and universal sound separation.
  • For universal sound separation and speech enhancement, the unsupervised training does not help as much, presumably because the synthetic test sets are wellmatched to the supervised training domain.
  • Unsupervised performance is at its worst in the single-source mixture case of the FUSS task
  • This may be because MixIT does not discourage further separation of single sources.
  • Across several tasks including speech separation, speech enhancement, and universal sound separation, the authors demonstrated that MixIT can approach the performance of supervised PIT, and is especially helpful in a semisupervised setup to adapt to mismatched domains.
  • MixIT opens new lines of research where massive amounts of previously untapped in-the-wild data can be leveraged to train sound separation systems
Summary
  • Introduction:

    Audio perception is fraught with a fundamental problem: individual sounds are convolved with unknown acoustic reverberation functions and mixed together at the acoustic sensor in a way that is impossible to disentangle without prior knowledge of the source characteristics
  • It is a hallmark of human hearing that the authors are able to hear the nuances of different sources, even when presented with a monaural mixture of sounds.
  • The more general “universal sound separation” problem of separating arbitrary classes of sound from each other has recently been addressed [19, 39]
  • Objectives:

    Though the goal is to explore MixIT for less supervised learning, the models are competitive on anechoic datasets with state-of-the-art approaches that do not exploit additional information such as speaker identity.
  • Methods:

    The authors generalize the permutation-invariant training framework to operate directly on unsupervised mixtures, as illustrated in Figure 1.
  • A supervised separation dataset is comprised of pairs of 1: source 1 : mixture.
  • 2: source 2 1: mixture 1.
  • (a) Supervised permutation invariant training (PIT).
  • SNR Loss Permute with matrix SNR Loss.
  • SNR Loss: mixture of mixtures: estimated sources.
  • Remix with matrix 2: mixture 2.
  • (b) Unsupervised mixture invariant training (MixIT)
  • Results:

    The authors significantly improve reverberant speech separation performance by incorporating reverberant mixtures, train a speech enhancement system from noisy mixtures, and improve universal sound separation by incorporating a large amount of in-the-wild data.
  • Conclusion:

    The experiments show that MixIT works well for speech separation, speech enhancement, and universal sound separation.
  • For universal sound separation and speech enhancement, the unsupervised training does not help as much, presumably because the synthetic test sets are wellmatched to the supervised training domain.
  • Unsupervised performance is at its worst in the single-source mixture case of the FUSS task
  • This may be because MixIT does not discourage further separation of single sources.
  • Across several tasks including speech separation, speech enhancement, and universal sound separation, the authors demonstrated that MixIT can approach the performance of supervised PIT, and is especially helpful in a semisupervised setup to adapt to mismatched domains.
  • MixIT opens new lines of research where massive amounts of previously untapped in-the-wild data can be leveraged to train sound separation systems
Tables
  • Table1: Multi-source SI-SNR improvement (MSi) and single-source SI-SNR (SS) in dB on FUSS
  • Table2: Separation network with TDCN++ architecture configuration. Variables are number of encoder basis coefficients N = 256, encoder basis kernel size L, which is 40 for 16 kHz data and 20 for 8 kHz data, number of waveform samples T , number of coefficient frames F , and number of separated sources M
  • Table3: SI-SNRi in dB as a function of SNRmax for unsupervised MixIT on WSJ0-2mix 2-source mixtures
  • Table4: Effect of incorporating the zero loss L0 (7) on supervised separation performance on the WSJ0-2mix and FUSS validation sets without additional reverb after 200k training steps
Download tables as Excel
Reference
  • N. Alamdari, A. Azarang, and N. Kehtarnavaz. Self-supervised deep learning-based speech denoising. arXiv preprint arXiv:1904.12069, 2019.
    Findings
  • J. B. Allen and D. A. Berkley. Image method for efficiently simulating small-room acoustics. The Journal of the Acoustical Society of America, 65(4):943–950, 1979.
    Google ScholarLocate open access versionFindings
  • D. Berthelot, N. Carlini, E. D. Cubuk, A. Kurakin, K. Sohn, H. Zhang, and C. Raffel. Remixmatch: Semi-supervised learning with distribution alignment and augmentation anchoring. arXiv preprint arXiv:1911.09785, 2019.
    Findings
  • D. Berthelot, N. Carlini, I. Goodfellow, N. Papernot, A. Oliver, and C. A. Raffel. MixMatch: A holistic approach to semi-supervised learning. In Advances in Neural Information Processing Systems, pages 5050–5060, 2019.
    Google ScholarLocate open access versionFindings
  • K. Bousmalis, N. Silberman, D. Dohan, D. Erhan, and D. Krishnan. Unsupervised pixel-level domain adaptation with generative adversarial networks. In Proc. IEEE Conference on Computer Vision and Pattern Recognition, pages 3722–3731, 2017.
    Google ScholarLocate open access versionFindings
  • K. Bousmalis, G. Trigeorgis, N. Silberman, D. Krishnan, and D. Erhan. Domain separation networks. In Advances in Neural Information Processing Systems, pages 343–351, 2016.
    Google ScholarLocate open access versionFindings
  • J. Cosentino, M. Pariente, S. Cornell, A. Deleforge, and E. Vincent. LibriMix: An open-source dataset for generalizable speech separation. arXiv preprint arXiv:2005.11262, 2020.
    Findings
  • L. Drude, D. Hasenklever, and R. Haeb-Umbach. Unsupervised training of a deep clustering model for multichannel blind source separation. In Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pages 695–699, 2019.
    Google ScholarLocate open access versionFindings
  • E. Fonseca, X. Favory, J. Pons, F. Font, and X. Serra. FSD50k: an open dataset of human-labeled sound events. In arXiv, 2020.
    Google ScholarLocate open access versionFindings
  • E. Fonseca, J. Pons Puig, X. Favory, F. Font Corbera, D. Bogdanov, A. Ferraro, S. Oramas, A. Porter, and X. Serra. Freesound datasets: a platform for the creation of open audio datasets. In Proc. International Society for Music Information Retrieval ISMIR, pages 486–493, 2017.
    Google ScholarLocate open access versionFindings
  • Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, M. Marchand, and V. Lempitsky. Domain-adversarial training of neural networks. Journal of Machine Learning Research, 17(1):2096–2030, 2016.
    Google ScholarLocate open access versionFindings
  • R. Gao and K. Grauman. Co-separating sounds of visual objects. In Proc. IEEE International Conference on Computer Vision, pages 3879–3888, 2019.
    Google ScholarLocate open access versionFindings
  • J. F. Gemmeke, D. P. W. Ellis, D. Freedman, A. Jansen, W. Lawrence, R. C. Moore, M. Plakal, and M. Ritter. Audio set: An ontology and human-labeled dataset for audio events. In Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pages 776–780, 2017.
    Google ScholarLocate open access versionFindings
  • J. R. Hershey and M. Casey. Audio-visual sound separation via hidden markov models. In Advances in Neural Information Processing Systems, pages 1173–1180, 2002.
    Google ScholarLocate open access versionFindings
  • J. R. Hershey, Z. Chen, J. Le Roux, and S. Watanabe. Deep clustering: Discriminative embeddings for segmentation and separation. In Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pages 31–35, 2016.
    Google ScholarLocate open access versionFindings
  • Y. Hoshen. Towards unsupervised single-channel blind source separation using adversarial pair unmix-andremix. In Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pages 3272–3276, 2019.
    Google ScholarLocate open access versionFindings
  • P.-S. Huang, M. Kim, M. Hasegawa-Johnson, and P. Smaragdis. Deep learning for monaural speech separation. In Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pages 1562–1566, 2014.
    Google ScholarLocate open access versionFindings
  • Y. Isik, J. L. Roux, Z. Chen, S. Watanabe, and J. R. Hershey. Single-channel multi-speaker separation using deep clustering. arXiv preprint arXiv:1607.02173, 2016.
    Findings
  • I. Kavalerov, S. Wisdom, H. Erdogan, B. Patton, K. Wilson, J. Le Roux, and J. R. Hershey. Universal sound separation. In Proc. IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), 2019.
    Google ScholarLocate open access versionFindings
  • D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. In Proc. International Conference on Learning Representations (ICLR), 2015.
    Google ScholarLocate open access versionFindings
  • T. Kristjansson, J. Hershey, P. Olsen, S. Rennie, and R. Gopinath. Super-human multi-talker speech recognition: The IBM 2006 speech separation challenge system. In Proc. Ninth International Conference on Spoken Language Processing, 2006.
    Google ScholarLocate open access versionFindings
  • M. W. Lam, J. Wang, D. Su, and D. Yu. Mixup-breakdown: a consistency training method for improving generalization of speech separation models. In Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pages 6374–6378. IEEE, 2020.
    Google ScholarLocate open access versionFindings
  • J. Le Roux, S. Wisdom, H. Erdogan, and J. R. Hershey. SDR–half-baked or well done? In Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pages 626–630, 2019.
    Google ScholarLocate open access versionFindings
  • J. Lehtinen, J. Munkberg, J. Hasselgren, S. Laine, T. Karras, M. Aittala, and T. Aila. Noise2noise: Learning image restoration without clean data. In Proc. International Conference on Machine Learning (ICML), pages 2965–2974, 2018.
    Google ScholarLocate open access versionFindings
  • Y. Luo, Z. Chen, and T. Yoshioka. Dual-path RNN: efficient long sequence modeling for time-domain single-channel speech separation. In Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pages 46–50, 2020.
    Google ScholarLocate open access versionFindings
  • Y. Luo and N. Mesgarani. Conv-TasNet: Surpassing ideal time–frequency magnitude masking for speech separation. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 27(8):1256–1266, 2019.
    Google ScholarLocate open access versionFindings
  • M. Maciejewski, G. Sell, L. P. Garcia-Perera, S. Watanabe, and S. Khudanpur. Building corpora for single-channel speech separation across multiple domains. arXiv preprint arXiv:1811.02641, 2018.
    Findings
  • E. Manilow, G. Wichern, P. Seetharaman, and J. Le Roux. Cutting music source separation some slakh: A dataset to study the impact of training data quality and quantity. In IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pages 45–49, 2019.
    Google ScholarLocate open access versionFindings
  • E. Nachmani, Y. Adi, and L. Wolf. Voice separation with an unknown number of multiple speakers. arXiv preprint arXiv:2003.01531, 2020.
    Findings
  • V. Panayotov, G. Chen, D. Povey, and S. Khudanpur. LibriSpeech: an ASR corpus based on public domain audio books. In Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pages 5206–5210, Apr. 2015.
    Google ScholarLocate open access versionFindings
  • M. Pariente, S. Cornell, A. Deleforge, and E. Vincent. Filterbank design for end-to-end speech separation. In Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pages 6364–6368, 2020.
    Google ScholarLocate open access versionFindings
  • F. Pishdadian, G. Wichern, and J. Le Roux. Finding strength in weakness: Learning to separate sounds with weak supervision. arXiv preprint arXiv:1911.02182, 2019.
    Findings
  • S. T. Roweis. One microphone source separation. In Advances in Neural Information Processing Systems, pages 793–799, 2001.
    Google ScholarLocate open access versionFindings
  • M. N. Schmidt and R. K. Olsson. Single-channel speech separation using sparse non-negative matrix factorization. In Proc. Ninth International Conference on Spoken Language Processing, 2006.
    Google ScholarLocate open access versionFindings
  • P. Seetharaman, G. Wichern, J. Le Roux, and B. Pardo. Bootstrapping single-channel source separation via unsupervised spatial clustering on stereo mixtures. In Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pages 356–360, 2019.
    Google ScholarLocate open access versionFindings
  • P. Smaragdis. Non-negative matrix factor deconvolution; extraction of multiple sound sources from monophonic inputs. In Proc. International Conference on Independent Component Analysis and Signal Separation, pages 494–499, 2004.
    Google ScholarLocate open access versionFindings
  • E. Tzeng, J. Hoffman, K. Saenko, and T. Darrell. Adversarial discriminative domain adaptation. In Proc. IEEE Conference on Computer Vision and Pattern Recognition, pages 7167–7176, 2017.
    Google ScholarLocate open access versionFindings
  • E. Tzinis, S. Venkataramani, and P. Smaragdis. Unsupervised deep clustering for source separation: Direct learning from mixtures using spatial information. In Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pages 81–85, 2019.
    Google ScholarLocate open access versionFindings
  • E. Tzinis, S. Wisdom, J. R. Hershey, A. Jansen, and D. P. W. Ellis. Improving universal sound separation using sound classification. In Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pages 96–100, 2020.
    Google ScholarLocate open access versionFindings
  • Z.-Q. Wang, J. Le Roux, and J. R. Hershey. Multi-channel deep clustering: Discriminative spectral and spatial embeddings for speaker-independent speech separation. In Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2018.
    Google ScholarLocate open access versionFindings
  • R. J. Weiss and D. P. W. Ellis. Speech separation using speaker-adapted eigenvoice speech models. Computer Speech & Language, 24(1):16–29, 2010.
    Google ScholarLocate open access versionFindings
  • F. Weninger, H. Erdogan, S. Watanabe, E. Vincent, J. Le Roux, J. R. Hershey, and B. Schuller. Speech enhancement with LSTM recurrent neural networks and its application to noise-robust ASR. In International Conference on Latent Variable Analysis and Signal Separation, pages 91–99, 2015.
    Google ScholarLocate open access versionFindings
  • S. Wisdom, H. Erdogan, D. P. W. Ellis, and J. R. Hershey. Free Universal Sound Separation (FUSS) dataset, 2020. https://github.com/google-research/sound-separation/tree/master/datasets/fuss.
    Findings
  • S. Wisdom, H. Erdogan, D. P. W. Ellis, R. Serizel, N. Turpault, E. Fonseca, J. Salamon, P. Seetharaman, and J. R. Hershey. What’s all the FUSS about free universal sound separation data? In preparation, 2020.
    Google ScholarLocate open access versionFindings
  • S. Wisdom, J. R. Hershey, K. Wilson, J. Thorpe, M. Chinen, B. Patton, and R. A. Saurous. Differentiable consistency constraints for improved deep speech enhancement. In Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pages 900–904, 2019.
    Google ScholarLocate open access versionFindings
  • D. Yu, M. Kolbæk, Z.-H. Tan, and J. Jensen. Permutation invariant training of deep models for speakerindependent multi-talker speech separation. In Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pages 241–245, 2017.
    Google ScholarLocate open access versionFindings
  • N. Zeghidour and D. Grangier. Wavesplit: End-to-end speech separation by speaker clustering. arXiv preprint arXiv:2002.08933, 2020.
    Findings
  • H. Zhang, M. Cisse, Y. N. Dauphin, and D. Lopez-Paz. mixup: Beyond empirical risk minimization. In Proc. International Conference on Learning Representations (ICLR), 2018.
    Google ScholarLocate open access versionFindings
Full Text
Your rating :
0

 

Tags
Comments