Digital Voicing of Silent Speech

David Gaddy
David Gaddy

EMNLP 2020, 2020.

Cited by: 0|Bibtex|Views188
Other Links: arxiv.org
Keywords:
silent speechtext systemtranscription word error ratesilent emgcanonical correlation analysisMore(10+)
Weibo:
Our results show that digital voicing of silent speech, while still challenging in open domain settings, shows promise as an achievable technology

Abstract:

In this paper, we consider the task of digitally voicing silent speech, where silently mouthed words are converted to audible speech based on electromyography (EMG) sensor measurements that capture muscle impulses. While prior work has focused on training speech synthesis models from EMG collected during vocalized speech, we are the fir...More

Code:

Data:

0
Introduction
  • The authors are interested in in enabling speech-like communication without requiring sound to be produced.
  • It could be used to create a device analogous to a Bluetooth headset that allows people to carry on phone conversations without disrupting those around them
  • Such a device could be useful in settings where the environment is too loud to capture audible speech or where maintaining silence is important.
  • In addition to these direct uses of digital voicing for silent speech, it may be useful as a component technology for creating silent speechto-text systems (Schultz and Wand, 2010), making silent speech accessible to the devices and digital assistants by leveraging existing high-quality audio-based speech-to-text systems
Highlights
  • In this paper, we are interested in in enabling speech-like communication without requiring sound to be produced
  • In addition to these direct uses of digital voicing for silent speech, it may be useful as a component technology for creating silent speechto-text systems (Schultz and Wand, 2010), making silent speech accessible to our devices and digital assistants by leveraging existing high-quality audio-based speech-to-text systems
  • We extend digital voicing to train on silent EMG ES rather than only vocalized EMG EV
  • Our use of canonical correlation analysis (CCA) for dynamic time warping (DTW) is similar to Zhou and Torre (2009), which combined the two methods for use in aligning human pose data, but we found their iterative approach did not improve performance compared to a single application of CCA in our setting
  • Our results show that digital voicing of silent speech, while still challenging in open domain settings, shows promise as an achievable technology
  • We significantly improve intelligibility in an open vocabulary condition, with a relative error reduction over 20%
Methods
  • The authors' method is built around a recurrent neural transduction model from EMG features to time-aligned speech features (Section 3.1).
  • The authors will denote the featurized version of the signals used by the transduction model ES/V and AV for EMG and audio respectively.
  • A core contribution of the work is a method of training the transducer model on silent EMG signals, which no longer have time-aligned audio to use as training targets.
  • The alignment is initially found using dynamic time warping between EMG signals and is refined using canonical correlation analysis (CCA) and predicted audio from a partially trained model
Results
  • The authors' use of CCA for DTW is similar to Zhou and Torre (2009), which combined the two methods for use in aligning human pose data, but the authors found their iterative approach did not improve performance compared to a single application of CCA in the setting.
  • The authors significantly improve intelligibility in an open vocabulary condition, with a relative error reduction over 20%
Conclusion
  • The authors' results show that digital voicing of silent speech, while still challenging in open domain settings, shows promise as an achievable technology.
  • On silent EMG recordings from closed vocabulary data the speech outputs achieve high intelligibility, with a 3.6% transcription word error rate and relative error reduction of 95% from the baseline.
  • The authors significantly improve intelligibility in an open vocabulary condition, with a relative error reduction over 20%.
  • The authors hope that the public release of data will encourage others to further improve models for this task.6
Summary
  • Introduction:

    The authors are interested in in enabling speech-like communication without requiring sound to be produced.
  • It could be used to create a device analogous to a Bluetooth headset that allows people to carry on phone conversations without disrupting those around them
  • Such a device could be useful in settings where the environment is too loud to capture audible speech or where maintaining silence is important.
  • In addition to these direct uses of digital voicing for silent speech, it may be useful as a component technology for creating silent speechto-text systems (Schultz and Wand, 2010), making silent speech accessible to the devices and digital assistants by leveraging existing high-quality audio-based speech-to-text systems
  • Objectives:

    By using muscular sensor measurements of speech articulator movement, the authors aim to capture silent speech - utterances that have been articulated without producing sound.
  • Methods:

    The authors' method is built around a recurrent neural transduction model from EMG features to time-aligned speech features (Section 3.1).
  • The authors will denote the featurized version of the signals used by the transduction model ES/V and AV for EMG and audio respectively.
  • A core contribution of the work is a method of training the transducer model on silent EMG signals, which no longer have time-aligned audio to use as training targets.
  • The alignment is initially found using dynamic time warping between EMG signals and is refined using canonical correlation analysis (CCA) and predicted audio from a partially trained model
  • Results:

    The authors' use of CCA for DTW is similar to Zhou and Torre (2009), which combined the two methods for use in aligning human pose data, but the authors found their iterative approach did not improve performance compared to a single application of CCA in the setting.
  • The authors significantly improve intelligibility in an open vocabulary condition, with a relative error reduction over 20%
  • Conclusion:

    The authors' results show that digital voicing of silent speech, while still challenging in open domain settings, shows promise as an achievable technology.
  • On silent EMG recordings from closed vocabulary data the speech outputs achieve high intelligibility, with a 3.6% transcription word error rate and relative error reduction of 95% from the baseline.
  • The authors significantly improve intelligibility in an open vocabulary condition, with a relative error reduction over 20%.
  • The authors hope that the public release of data will encourage others to further improve models for this task.6
Tables
  • Table1: Closed vocabulary data summary
  • Table2: Open vocabulary data summary
  • Table3: Electrode locations
  • Table4: Results of a human intelligibility evaluation on the closed vocabulary data. Lower WER is better. Our model greatly outperforms both variants of the direct transfer baseline
  • Table5: Results of an automatic intelligibility evaluation on open vocabulary data. Lower WER is better
Download tables as Excel
Funding
  • This material is based upon work supported by the National Science Foundation under Grant No 1618460
Reference
  • Srinivas Desai, E Veera Raghavendra, B Yegnanarayana, Alan W Black, and Kishore Prahallad. 2009. Voice conversion using artificial neural networks. In 2009 IEEE International Conference on Acoustics, Speech and Signal Processing, pages 3893–3896. IEEE.
    Google ScholarLocate open access versionFindings
  • L. Diener, G. Felsch, M. Angrick, and T. Schultz. 2018. Session-independent array-based EMG-tospeech conversion using convolutional neural networks. In Speech Communication; 13th ITGSymposium, pages 1–5.
    Google ScholarLocate open access versionFindings
  • Awni Hannun, Carl Case, Jared Casper, Bryan Catanzaro, Greg Diamos, Erich Elsen, Ryan Prenger, Sanjeev Satheesh, Shubho Sengupta, Adam Coates, et al. 2014. Deep speech: Scaling up end-to-end speech recognition. arXiv preprint arXiv:1412.5567.
    Findings
  • Harold Hotelling. 1936. Relations between two sets of variates.
    Google ScholarFindings
  • M. Janke and L. Diener. 2017. EMG-to-speech: Direct generation of speech from facial electromyographic signals. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 25(12):2375–2385.
    Google ScholarLocate open access versionFindings
  • Szu-Chen Stan Jou, Tanja Schultz, Matthias Walliczek, Florian Kraft, and Alexander H. Waibel. 200Towards continuous speech recognition using surface electromyography. In INTERSPEECH.
    Google ScholarFindings
  • Arnav Kapur, Shreyas Kapur, and Pattie Maes. 2018. Alterego: A personalized wearable silent speech interface. In 23rd International Conference on Intelligent User Interfaces, pages 43–53.
    Google ScholarLocate open access versionFindings
  • Kazuhiro Kobayashi and Tomoki Toda. 201sprocket: Open-source voice conversion software. In Odyssey, pages 203–210.
    Google ScholarLocate open access versionFindings
  • Geoffrey S Meltzner, James T Heaton, Yunbin Deng, Gianluca De Luca, Serge H Roy, and Joshua C Kline. 2017. Silent speech recognition as an alternative communication device for persons with laryngectomy. IEEE/ACM transactions on audio, speech, and language processing, 25(12):2386–2398.
    Google ScholarLocate open access versionFindings
  • Our dataset can be downloaded from https://doi.org/10.5281/zenodo.4064408 and code is available at https://github.com/dgaddy/silent_speech.
    Findings
  • Geoffrey S Meltzner, James T Heaton, Yunbin Deng, Gianluca De Luca, Serge H Roy, and Joshua C Kline. 2018. Development of sEMG sensors and algorithms for silent speech recognition. Journal of neural engineering, 15(4):046031.
    Google ScholarLocate open access versionFindings
  • Aäron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew W. Senior, and Koray Kavukcuoglu. 2016. WaveNet: A generative model for raw audio. ArXiv, abs/1609.03499.
    Findings
  • Lawrence Rabiner and Biing-Hwang Juang. 1993. Fundamentals of speech recognition. Prentice Hall.
    Google ScholarFindings
  • Tanja Schultz and Michael Wand. 2010. Modeling coarticulation in EMG-based continuous speech recognition. Speech Communication, 52(4):341– 353.
    Google ScholarLocate open access versionFindings
  • Arthur R. Toth, Michael Wand, and Tanja Schultz. 2009. Synthesizing speech from electromyography using voice transformation techniques. In INTERSPEECH.
    Google ScholarFindings
  • Michael Wand, Matthias Janke, and Tanja Schultz. 2014. The EMG-UKA corpus for electromyographic speech processing. In INTERSPEECH.
    Google ScholarFindings
  • Feng Zhou and Fernando Torre. 2009. Canonical time warping for alignment of human behavior. In Advances in neural information processing systems, pages 2286–2294.
    Google ScholarLocate open access versionFindings
Full Text
Your rating :
0

 

Tags
Comments