## AI generates interpretation videos

AI extracts and analyses the key points of the paper to generate videos automatically

Go Generating

## AI Traceability

AI parses the academic lineage of this thesis

Generate MRT

## AI Insight

AI extracts a summary of this paper

Weibo:
As a step towards this goal, we introduce a deep neural network technique that can be steered to any direction at run time, cancelling all audio sources outside a specified angular window, aka cone of silence

# The Cone of Silence: Speech Separation by Localization

NIPS 2020, (2020)

Cited by: 7|Views49
EI
Full Text
Bibtex
Weibo

Abstract

Given a multi-microphone recording of an unknown number of speakers talking concurrently, we simultaneously localize the sources and separate the individual speakers. At the core of our method is a deep network, in the waveform domain, which isolates sources within an angular region $\theta \pm w/2$, given an angle of interest $\theta$ ...More
0
Introduction
• The ability of humans to separate and localize sounds in noisy environments is a remarkable phenomenon known as the “cocktail party effect.” our natural ability only goes so far – we may still have trouble hearing a conversation partner in a noisy restaurant or during a call with other speakers in the background.
• As a step towards this goal, the authors introduce a deep neural network technique that can be steered to any direction at run time, cancelling all audio sources outside a specified angular window, aka cone of silence (CoS) [1].
• The authors further show that this directionally sensitive CoS network can be used as a building block to yield simple yet powerful solutions to 1) sound localization, and 2) audio source separation.
• Audio demos can be found at the project website.2
Highlights
• The ability of humans to separate and localize sounds in noisy environments is a remarkable phenomenon known as the “cocktail party effect.” our natural ability only goes so far – we may still have trouble hearing a conversation partner in a noisy restaurant or during a call with other speakers in the background
• As a step towards this goal, we introduce a deep neural network technique that can be steered to any direction at run time, cancelling all audio sources outside a specified angular window, aka cone of silence (CoS) [1]
• How do you know what direction to listen to? We further show that this directionally sensitive CoS network can be used as a building block to yield simple yet powerful solutions to 1) sound localization, and 2) audio source separation
• We report the SI-SDR improvement (SI-SDRi) and median angular error, together with the precision and recall of localizing the voices within 15◦ of the ground truth when the algorithm has no information about the number of speakers
• We stop the algorithm at a coarser window size (23◦) and use inputs corresponding to 1.5 seconds of audio
• We concatenate sources that are in adjacent regions from one time step to the
Methods
• The authors describe the Cone of Silence network for angle based separation.
• The center of the coordinate system is always the center of the microphone array, and the angular position of each source, θi, is defined based on this coordinate system.
• Waveform-based Conv-TasNet [18] TAC [40] Ours - Binary Search Ours - Oracle Location.
• The authors' network can accept explicitly known source locations, allowing the separation performance to improve further when the source positions are given.
Results
• The authors stop the algorithm at a coarser window size (23◦) and use inputs corresponding to 1.5 seconds of audio
• With these parameters, the authors find that it is possible to handle substantial movement because the angular window size captures each source for the duration of the input.
• Because the real captured data does not have precise ground truth positions or perfectly clean source signals, numerical results are not as reliable as the synthetic experiment.
Conclusion
• The authors introduced a novel method for joint localization and separation of audio sources in the waveform domain.
• The authors described how to create a network that separates sources within a specific angular region, and how to use that network for a binary search approach to separation and localization.
• Examples on real world data show that the proposed method is applicable to real-life scenarios.
• The authors' work has the potential to be extended beyond speech to perform separation and localization of arbitrary sound types
Tables
• Table1: Separation Performance. Larger SI-SDRi is better. The SI-SDRi is computed by finding the median of SI-SDR increases from Figure 3
• Table2: Localization Performance
• Table3: Generalization to arbitrary many speakers. We report the separation and localization performance as the number of speakers varies
• Table4: Separation performance on the real dataset
• Table5: Separation and localization performances on datasets with different sampling rates
Related work
• Source separation has seen tremendous progress in recent years, particularly with the increasing popularity of learning methods, which improve over traditional methods such as [2, 3]. In particular, unsupervised source modeling methods train a model for each source type and apply the model to the mixture for separation by using methods like NMF [4, 5], clustering [6, 7, 8], or bayesian methods [9, 10, 11]. Supervised source modeling methods train a model for each source from annotated isolated signals of each source type, e.g., pitch information for music [12]. Separation based training methods like [13, 14, 15] employ deep neural networks to learn source separation from mixtures given the ground truth signals as training data, also known as the mix-and-separate framework.

Recent trends include the move to operating directly on waveforms [16, 17, 18] yielding performance improvements over frequency-domain spectrogram techniques such as [19, 20, 21, 6, 22, 23, 24]. A second trend is increasing the numbers of microphones, as methods based on multi-channel microphone arrays [22, 23, 25] and binaural recordings [24, 26] perform better than single-channel source separation techniques. Combined audio-visual techniques [27, 28] have also shown promise.
Funding
• This work was supported by the UW Reality Lab, Facebook, Google, Futurewei, and Amazon. Broader Impact Statement We believe that our method has the potential to help people hear better in a variety of everyday scenarios
Study subjects and analysis
data: 2
4.2 Source Separation. To evaluate the source separation performance of our method, we create mixtures consisting of 2 voices (N = 2) and 1 background, allowing comparisons with deep learning methods that require a fixed number of foreground sources. We use the popular metric scale-invariant signal-to-distortion ratio (SI-SDR) [63]

speakers with no background: 8
4.4 Varying Number of Speakers. To show that our method generalizes to an arbitrary number of speakers, we evaluate separation and localization on mixtures containing up to 8 speakers with no background. We train the network with mixtures of 1 background and up to 4 voices and evaluate the separation results with median SI-SDRi and the localization performance with median angular error

Reference
• https://en.wikipedia.org/wiki/Cone_of_Silence_(Get_Smart).
• J-F Cardoso. Blind signal separation: statistical principles. Proceedings of the IEEE, 86(10):2009–2025, 1998.
• Francesco Nesta, Piergiorgio Svaizer, and Maurizio Omologo. Convolutive bss of short mixtures by ica recursively regularized across frequencies. IEEE transactions on audio, speech, and language processing, 19(3):624–639, 2010.
• Bhiksha Raj, Tuomas Virtanen, Sourish Chaudhuri, and Rita Singh. Non-negative matrix factorization based compensation of music for automatic speech recognition. In Eleventh Annual Conference of the International Speech Communication Association, 2010.
• Nasser Mohammadiha, Paris Smaragdis, and Arne Leijon. Supervised and unsupervised speech enhancement using nonnegative matrix factorization. IEEE Transactions on Audio, Speech, and Language Processing, 21(10):2140–2151, 2013.
• Efthymios Tzinis, Shrikant Venkataramani, and Paris Smaragdis. Unsupervised deep clustering for source separation: Direct learning from mixtures using spatial information. In ICASSP 20192019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 81–85. IEEE, 2019.
• Hiroshi Sawada, Shoko Araki, and Shoji Makino. Underdetermined convolutive blind source separation via frequency bin-wise clustering and permutation alignment. IEEE Transactions on Audio, Speech, and Language Processing, 19(3):516–527, 2010.
• Y. Luo, Z. Chen, J. R. Hershey, J. Le Roux, and N. Mesgarani. Deep clustering and conventional networks for music separation: Stronger together. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 61–65, 2017.
• Kousuke Itakura, Yoshiaki Bando, Eita Nakamura, Katsutoshi Itoyama, Kazuyoshi Yoshii, and Tatsuya Kawahara. Bayesian multichannel audio source separation based on integrated source and spatial models. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 26(4):831–846, 2018.
• Vivek Jayaram and John Thickstun. Source separation with deep generative priors. arXiv preprint arXiv:2002.07942, 2020.
• Laurent Benaroya, Frédéric Bimbot, and Rémi Gribonval. Audio source separation with a single sensor. IEEE Transactions on Audio, Speech, and Language Processing, 14(1):191–199, 2005.
• Nancy Bertin, Roland Badeau, and Emmanuel Vincent. Enforcing harmonicity and smoothness in bayesian non-negative matrix factorization applied to polyphonic music transcription. IEEE Transactions on Audio, Speech, and Language Processing, 18(3):538–549, 2010.
• Tavi Halperin, Ariel Ephrat, and Yedid Hoshen. Neural separation of observed and unobserved distributions. arXiv preprint arXiv:1811.12739, 2018.
• Andreas Jansson, Eric Humphrey, Nicola Montecchio, Rachel Bittner, Aparna Kumar, and Tillman Weyde. Singing voice separation with deep u-net convolutional networks. 2017.
• Aditya Arie Nugraha, Antoine Liutkus, and Emmanuel Vincent. Multichannel audio source separation with deep neural networks. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 24(9):1652–1664, 2016.
• Daniel Stoller, Sebastian Ewert, and Simon Dixon. Wave-u-net: A multi-scale neural network for end-to-end audio source separation. arXiv preprint arXiv:1806.03185, 2018.
• Yi Luo and Nima Mesgarani. Tasnet: time-domain audio separation network for real-time, single-channel speech separation. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 696–700. IEEE, 2018.
• Yi Luo and Nima Mesgarani. Conv-tasnet: Surpassing ideal time–frequency magnitude masking for speech separation. IEEE/ACM transactions on audio, speech, and language processing, 27(8):1256–1266, 2019.
• John R Hershey, Zhuo Chen, Jonathan Le Roux, and Shinji Watanabe. Deep clustering: Discriminative embeddings for segmentation and separation. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 31–35. IEEE, 2016.
• Chenglin Xu, Wei Rao, Xiong Xiao, Eng Siong Chng, and Haizhou Li. Single channel speech separation with constrained utterance level permutation invariant training using grid lstm. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6–10. IEEE, 2018.
• Chao Weng, Dong Yu, Michael L Seltzer, and Jasha Droppo. Deep neural networks for singlechannel multi-talker speech recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 23(10):1670–1679, 2015.
• Takuya Yoshioka, Hakan Erdogan, Zhuo Chen, and Fil Alleva. Multi-microphone neural speech separation for far-field multi-talker speech recognition. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5739–5743. IEEE, 2018.
• Zhuo Chen, Xiong Xiao, Takuya Yoshioka, Hakan Erdogan, Jinyu Li, and Yifan Gong. Multichannel overlapped speech recognition with location guided speech extraction network. In 2018 IEEE Spoken Language Technology Workshop (SLT), pages 558–565. IEEE, 2018.
• Xueliang Zhang and DeLiang Wang. Deep learning based binaural speech separation in reverberant environments. IEEE/ACM transactions on audio, speech, and language processing, 25(5):1075–1084, 2017.
• Rongzhi Gu, Shi-Xiong Zhang, Lianwu Chen, Yong Xu, Meng Yu, Dan Su, Yuexian Zou, and Dong Yu. Enhancing end-to-end multi-channel speech separation via spatial feature learning. arXiv preprint arXiv:2003.03927, 2020.
• Cong Han, Yi Luo, and Nima Mesgarani. Real-time binaural speech separation with preserved spatial cues. arXiv preprint arXiv:2002.06637, 2020.
• Hang Zhao, Chuang Gan, Andrew Rouditchenko, Carl Vondrick, Josh McDermott, and Antonio Torralba. The sound of pixels. In Proceedings of the European Conference on Computer Vision (ECCV), pages 570–586, 2018.
• Andrew Rouditchenko, Hang Zhao, Chuang Gan, Josh McDermott, and Antonio Torralba. Selfsupervised audio-visual co-segmentation. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 2357–2361. IEEE, 2019.
• François Grondin and James Glass. Multiple sound source localization with svd-phat. arXiv preprint arXiv:1906.11913, 2019.
• Or Nadiri and Boaz Rafaely. Localization of multiple speakers under high reverberation using a spherical microphone array and the direct-path dominance test. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 22(10):1494–1505, 2014.
• Despoina Pavlidi, Anthony Griffin, Matthieu Puigt, and Athanasios Mouchtaris. Real-time multiple sound source localization and counting using a circular microphone array. IEEE Transactions on Audio, Speech, and Language Processing, 21(10):2193–2206, 2013.
• Joseph Hector DiBiase. A high-accuracy, low-latency technique for talker localization in reverberant environments using microphone arrays. Brown University Providence, RI, 2000.
• Ralph Schmidt. Multiple emitter location and signal parameter estimation. IEEE transactions on antennas and propagation, 34(3):276–280, 1986.
• Hong Wang and Mostafa Kaveh. Coherent signal-subspace processing for the detection and estimation of angles of arrival of multiple wide-band sources. IEEE Transactions on Acoustics, Speech, and Signal Processing, 33(4):823–831, 1985.
• Elio D Di Claudio and Raffaele Parisi. Waves: Weighted average of signal subspaces for robust wideband direction finding. IEEE Transactions on Signal Processing, 49(10):2179–2191, 2001.
• Yeo-Sun Yoon, Lance M Kaplan, and James H McClellan. Tops: New doa estimator for wideband signals. IEEE Transactions on Signal processing, 54(6):1977–1989, 2006.
• Hanjie Pan, Robin Scheibler, Eric Bezzam, Ivan Dokmanic, and Martin Vetterli. Frida: Fribased doa estimation for arbitrary array layouts. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 3186–3190. IEEE, 2017.
• Weipeng He, Petr Motlicek, and Jean-Marc Odobez. Deep neural networks for multiple speaker detection and localization. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pages 74–79. IEEE, 2018.
• Sharath Adavanne, Archontis Politis, Joonas Nikunen, and Tuomas Virtanen. Sound event localization and detection of overlapping sources using convolutional recurrent neural networks. IEEE Journal of Selected Topics in Signal Processing, 13(1):34–48, 2018.
• Yi Luo, Zhuo Chen, Nima Mesgarani, and Takuya Yoshioka. End-to-end microphone permutation and number invariant multi-channel speech separation. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6394– 6398. IEEE, 2020.
• Takuya Higuchi, Keisuke Kinoshita, Marc Delcroix, Katerina Zmolíková, and Tomohiro Nakatani. Deep clustering-based beamforming for separation with unknown number of sources. In Interspeech, pages 1183–1187, 2017.
• Naoya Takahashi, Sudarsanam Parthasaarathy, Nabarun Goswami, and Yuki Mitsufuji. Recursive speech separation for unknown number of speakers. arXiv preprint arXiv:1904.03065, 2019.
• Eliya Nachmani, Yossi Adi, and Lior Wolf. Voice separation with an unknown number of multiple speakers. arXiv preprint arXiv:2003.01531, 2020.
• Hidri Adel, Meddeb Souad, Abdulqadir Alaqeeli, and Amiri Hamid. Beamforming techniques for multichannel audio signal separation. arXiv preprint arXiv:1212.6080, 2012.
• J. Traa and P. Smaragdis. Multichannel source separation and tracking with ransac and directional statistics. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 22(12):2233–2243, 2014.
• Michael I Mandel, Ron J Weiss, and Daniel PW Ellis. Model-based expectation-maximization source separation and localization. IEEE Transactions on Audio, Speech, and Language Processing, 18(2):382–394, 2009.
• Futoshi Asano and Hideki Asoh. Sound source localization and separation based on the em algorithm. In ISCA Tutorial and Research Workshop (ITRW) on Statistical and Perceptual Audio Processing, 2004.
• Yuval Dorfan, Dani Cherkassky, and Sharon Gannot. Speaker localization and separation using incremental distributed expectation-maximization. In 2015 23rd European Signal Processing Conference (EUSIPCO), pages 1256–1260. IEEE, 2015.
• Antoine Deleforge, Florence Forbes, and Radu Horaud. Acoustic space learning for soundsource separation and localization on binaural manifolds. International journal of neural systems, 25(01):1440003, 2015.
• Michael I Mandel, Daniel P Ellis, and Tony Jebara. An em algorithm for localizing multiple sound sources in reverberant environments. In Advances in neural information processing systems, pages 953–960, 2007.
• Johannes Traa, Paris Smaragdis, Noah D Stein, and David Wingate. Directional nmf for joint source localization and separation. In 2015 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pages 1–5. IEEE, 2015.
• Daniel Johnson, Daniel Gorelik, Ross E Mawhorter, Kyle Suver, Weiqing Gu, Steven Xing, Cody Gabriel, and Peter Sankhagowit. Latent gaussian activity propagation: using smoothness and structure to separate and localize sounds in large noisy environments. In Advances in Neural Information Processing Systems, pages 3465–3474, 2018.
• Alexandre Défossez, Nicolas Usunier, Léon Bottou, and Francis Bach. Demucs: Deep extractor for music sources with extra unlabeled data remixed. arXiv preprint arXiv:1909.01174, 2019.
• J-M Valin, François Michaud, Jean Rouat, and Dominic Létourneau. Robust sound source localization using a microphone array on a mobile robot. In Proceedings 2003 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2003)(Cat. No. 03CH37453), volume 2, pages 1228–1233. IEEE, 2003.
• Don H. Johnson and Dan E. Dudgeon. Array Signal Processing: Concepts and Techniques. Simon & Schuster, Inc., USA, 1992.
• Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499, 2016.
• Christophe Veaux, Junichi Yamagishi, Kirsten MacDonald, et al. Superseded-cstr vctk corpus: English multi-speaker corpus for cstr voice cloning toolkit. 2016.
• Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. Librispeech: an asr corpus based on public domain audio books. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5206–5210. IEEE, 2015.
• John Garofalo, David Graff, Doug Paul, and David Pallett. Csr-i (wsj0) complete. Linguistic Data Consortium, Philadelphia, 2007.
• Jont B Allen and David A Berkley. Image method for efficiently simulating small-room acoustics. The Journal of the Acoustical Society of America, 65(4):943–950, 1979.
• Robin Scheibler, Eric Bezzam, and Ivan Dokmanic. Pyroomacoustics: A python package for audio room simulation and array processing algorithms. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 351–355. IEEE, 2018.
• Michael Vorländer. Auralization: fundamentals of acoustics, modelling, simulation, algorithms and acoustic virtual reality. Springer Science & Business Media, 2007.
• Jonathan Le Roux, Scott Wisdom, Hakan Erdogan, and John R Hershey. Sdr–half-baked or well done? In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 626–630. IEEE, 2019.
• Fabian-Robert Stöter, Antoine Liutkus, and Nobutaka Ito. The 2018 signal separation evaluation campaign. In International Conference on Latent Variable Analysis and Signal Separation, pages 293–305.
• DeLiang Wang. On ideal binary mask as the computational goal of auditory scene analysis. In Speech separation by humans and machines, pages 181–197.
• Antoine Liutkus and Roland Badeau. Generalized wiener filtering with fractional power spectrograms. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 266–270. IEEE, 2015.
• Ngoc QK Duong, Emmanuel Vincent, and Rémi Gribonval. Under-determined reverberant audio source separation using a full-rank spatial covariance model. IEEE Transactions on Audio, Speech, and Language Processing, 18(7):1830–1840, 2010.
• https://wiki.seeedstudio.com/ReSpeaker_Mic_Array_v2.0/.
• J. Traa and P. Smaragdis. A wrapped kalman filter for azimuthal speaker tracking. IEEE Signal Processing Letters, 20(12):1257–1260, 2013.
• Xinyuan Qian, Alessio Brutti, Maurizio Omologo, and Andrea Cavallaro. 3d audio-visual speaker tracking with an adaptive particle filter. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 2896–2900. IEEE, 2017.
• F. Keyrouz. Robotic binaural localization and separation of multiple simultaneous sound sources. In 2017 IEEE 11th International Conference on Semantic Computing (ICSC), pages 188–195, 2017.
• Ning Ma, Tobias May, and Guy J Brown. Exploiting deep neural networks and head movements for robust binaural localization of multiple sources in reverberant environments. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 25(12):2444–2453, 2017.
• Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
• [18] Oracle IBM Oracle IRM Oracle MWF
• [18] Oracle IBM Oracle IRM Oracle MWF
Author
Teerapat Jenrungrot