AI helps you reading Science

AI generates interpretation videos

AI extracts and analyses the key points of the paper to generate videos automatically


pub
Go Generating

AI Traceability

AI parses the academic lineage of this thesis


Master Reading Tree
Generate MRT

AI Insight

AI extracts a summary of this paper


Weibo:
In this paper we present an accurate and low power detector to facilitate interacting with intelligent voice assistants on smartwatches

Raise to Speak: An Accurate, Low-power Detector for Activating Voice Assistants on Smartwatches

pp.2736-2744 (2019)

Cited by: 0|Views123
EI

Abstract

The two most common ways to activate intelligent voice assistants (IVAs) are button presses and trigger phrases. This paper describes a new way to invoke IVAs on smartwatches: simply raise your hand and speak naturally. To achieve this experience, we designed an accurate, low-power detector that works on a wide range of environments and a...More

Code:

Data:

0
Introduction
  • The two most common ways to invoke IVAs are using physical buttons or issuing specific trigger phrases such as "Hey Siri".
  • In this paper the authors propose a more natural invocation for IVAs on smartwatches; raise and speak to the device
  • To enable this interaction, the authors designed an accurate, low-power detector that uses only an accelerometer and a microphone.
  • The detector is designed to run mostly on-device to preserve user privacy and provide a low-latency experience for users
Highlights
  • Intelligent voice assistants (IVA) have become ubiquitous [15]
  • In this paper we present an accurate and low power detector to facilitate interacting with IVAs on smartwatches
  • A four component detector is presented consisting of an on-device gesture detector (GestureCNN), an on-device speech detector (SpeechCNN), a policy model, and an off-device false trigger mitigator (FTM)
  • In Section 5.2 we evaluate three other baseline approaches for gesture detection in addition to GestureCNN
  • Experimentation shows that the closest match to the GestureCNN in model performance are the gradient boosting tree (GBT) models
  • Other common approaches are either too computationally expensive for an embedded system (e.g dynamic time wrapping (DTW)) or aren’t complex enough to capture the non-linearity in the data
Methods
  • One of the advantages of the RTS detector is that the four loosely coupled components (GestureCNN, SpeechCNN, Policy Model, and FTM) can mostly be trained independently.
  • The authors outline the experiments that led to the final architecture of the RTS detector by going through each of the components individually.
  • The data collection process is divided into three stages.
  • The first stage (S1) consists of data collected in the most ideal environments.
  • In this stage, the majority (53.2%) of sessions consist of users that are either sitting or standing.
  • The second stage (S2) consists of users walking while performing RTS.
  • In total the authors collected 4228 sessions from 92 different users
Results
  • There are four distinct stages related to an RTS gesture; raising, raised, dropping, and dropped.
  • In each of these stages, the authors identify constraints on the gesture that can help improve the accuracy of the detector while still ensuring that the gesture feels natural to most users
Conclusion
  • In this paper the authors present an accurate and low power detector to facilitate interacting with IVAs on smartwatches.
  • A four component detector is presented consisting of an on-device gesture detector (GestureCNN), an on-device speech detector (SpeechCNN), a policy model, and an off-device false trigger mitigator (FTM).
  • In Section 5.2 the authors evaluate three other baseline approaches for gesture detection in addition to GestureCNN.
  • Experimentation shows that the closest match to the GestureCNN in model performance are the GBT models.
  • SpeechCNN outperforms the GBT model by a large margin (See Section 5.3)
Summary
  • Introduction:

    The two most common ways to invoke IVAs are using physical buttons or issuing specific trigger phrases such as "Hey Siri".
  • In this paper the authors propose a more natural invocation for IVAs on smartwatches; raise and speak to the device
  • To enable this interaction, the authors designed an accurate, low-power detector that uses only an accelerometer and a microphone.
  • The detector is designed to run mostly on-device to preserve user privacy and provide a low-latency experience for users
  • Methods:

    One of the advantages of the RTS detector is that the four loosely coupled components (GestureCNN, SpeechCNN, Policy Model, and FTM) can mostly be trained independently.
  • The authors outline the experiments that led to the final architecture of the RTS detector by going through each of the components individually.
  • The data collection process is divided into three stages.
  • The first stage (S1) consists of data collected in the most ideal environments.
  • In this stage, the majority (53.2%) of sessions consist of users that are either sitting or standing.
  • The second stage (S2) consists of users walking while performing RTS.
  • In total the authors collected 4228 sessions from 92 different users
  • Results:

    There are four distinct stages related to an RTS gesture; raising, raised, dropping, and dropped.
  • In each of these stages, the authors identify constraints on the gesture that can help improve the accuracy of the detector while still ensuring that the gesture feels natural to most users
  • Conclusion:

    In this paper the authors present an accurate and low power detector to facilitate interacting with IVAs on smartwatches.
  • A four component detector is presented consisting of an on-device gesture detector (GestureCNN), an on-device speech detector (SpeechCNN), a policy model, and an off-device false trigger mitigator (FTM).
  • In Section 5.2 the authors evaluate three other baseline approaches for gesture detection in addition to GestureCNN.
  • Experimentation shows that the closest match to the GestureCNN in model performance are the GBT models.
  • SpeechCNN outperforms the GBT model by a large margin (See Section 5.3)
Tables
  • Table1: Details of 31 gesture temporal features
  • Table2: Distributions of scenarios in positive sessions
  • Table3: Distributions of acoustic environment
  • Table4: Model performance
  • Table5: FRR measured under different scenarios in validation data set
Download tables as Excel
Related work
  • Combining gesture and speech signals to facilitate human computer interaction is an area of active research [2, 16, 20]. [20] provides a more detailed review on this topic. The two most relevant systems are from [16] and [2]. [16] describes a system that combines gesture as captured by a camera with speech as captured by a head-mounted microphone to estimate cues in conversational interactions. Shake2Talk is another experience where users can send audio messages via simple gestures [2]. These gestures were captured using a SHAKE device with multiple motion capturing sensors such as accelerometers, gyroscopes and capacitive sensors. One of the key differences between these systems and RTS is that RTS is deployed on a resource constrained embedded device using only sensors that are available today (accelerometer and microphone) on almost all smartwatches.

    The goal of the RTS system is to enable a gestural activation method whereby the IVA on smartwatches can be activated without trigger phrase or button press. The system is designed to trigger on a gesture where the user raises their smartwatch to their mouth and speaks into the watch to converse with the IVA. A conversation can include issuing commands (e.g. Set a timer for 2 min), asking questions (e.g. what is the weather today?), replying to messages (e.g. Reply: I am running late) or performing any action that you could otherwise do without needing to press buttons or mention trigger phases.
Reference
  • Ahmad Akl and Shahrokh Valaee. 2010. Accelerometer-based gesture recognition via dynamic-time warping, affinity propagation, & compressive sensing. In Acoustics Speech and Signal Processing (ICASSP), 2010 IEEE International Conference on. IEEE, 2270–2273.
    Google ScholarLocate open access versionFindings
  • Lorna M. Brown and John Williamson. 2007. Shake2Talk: multimodal messaging for interpersonal communication. In International Workshop on Haptic and Audio Interaction Design. Springer, 44–55.
    Google ScholarLocate open access versionFindings
  • Tianqi Chen and Carlos Guestrin. 2016. XGBoost: A Scalable Tree Boosting System. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD ’16). ACM, New York, NY, USA, 785–794. https://doi.org/10.1145/2939672.2939785
    Locate open access versionFindings
  • Florian Eyben, Felix Weninger, Stefano Squartini, and Björn Schuller. 2013. Reallife voice activity detection with lstm recurrent neural networks and an application to hollywood movies. In Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on. IEEE, 483–487.
    Google ScholarLocate open access versionFindings
  • Brendan J. Frey and Delbert Dueck. 2007. Clustering by passing messages between data points. science 315, 5814 (2007), 972–976.
    Google ScholarLocate open access versionFindings
  • Eleftheria Georganti, Tobias May, Steven van de Par, Aki Harma, and John Mourjopoulos. 2011. Speaker distance detection using a single microphone. IEEE Transactions on Audio, Speech, and Language Processing 19, 7 (2011), 1949–1961.
    Google ScholarLocate open access versionFindings
  • Amir Gholami, Kiseok Kwon, Bichen Wu, Zizheng Tai, Xiangyu Yue, Peter Jin, Sicheng Zhao, and Kurt Keutzer. 2018. SqueezeNext: Hardware-Aware Neural Network Design. arXiv preprint arXiv:1803.10615 (2018).
    Findings
  • Thad Hughes and Keir Mierle. 2013. Recurrent neural networks for voice activity detection. In Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on. IEEE, 7378–7382.
    Google ScholarLocate open access versionFindings
  • Forrest N. Iandola, Song Han, Matthew W. Moskewicz, Khalid Ashraf, William J. Dally, and Kurt Keutzer. 2016. Squeezenet: Alexnet-level accuracy with 50x fewer parameters and< 0.5 mb model size. arXiv preprint arXiv:1602.07360 (2016).
    Findings
  • Sergey Ioffe and Christian Szegedy. 2015. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. In International Conference on Machine Learning. 448–456. http://proceedings.mlr.press/v37/ioffe15.html
    Locate open access versionFindings
  • Eamonn Keogh and Chotirat Ann Ratanamahatana. 2005. Exact indexing of dynamic time warping. Knowledge and information systems 7, 3 (2005), 358–386.
    Google ScholarLocate open access versionFindings
  • Diederik P. Kingma and Jimmy Ba. 2014. Adam: A Method for Stochastic Optimization. (Dec. 2014). https://arxiv.org/abs/1412.6980
    Findings
  • Jiayang Liu, Lin Zhong, Jehan Wickramasuriya, and Venu Vasudevan. 2009. uWave: Accelerometer-based personalized gesture recognition and its applications. Pervasive and Mobile Computing 5, 6 (2009), 657–675.
    Google ScholarLocate open access versionFindings
  • Ewa Luger and Abigail Sellen. 2016. Like having a really bad PA: the gulf between user expectation and experience of conversational agents. In Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems. ACM, 5286–5297.
    Google ScholarLocate open access versionFindings
  • Michael McTear, Zoraida Callejas, and David Griol. 2016. The Conversational Interface: Talking to Smart Devices. Springer. Google-Books-ID: X_w0DAAAQBAJ.
    Google ScholarFindings
  • Francis Quek, David McNeill, Robert Bryll, Susan Duncan, Xin-Feng Ma, Cemil Kirbas, Karl E. McCullough, and Rashid Ansari. 2002. Multimodal human discourse: gesture and speech. ACM Transactions on Computer-Human Interaction (TOCHI) 9, 3 (2002), 171–193.
    Google ScholarLocate open access versionFindings
  • Javier Ramirez, Juan Manuel Górriz, and José Carlos Segura. 2007. Voice activity detection. fundamentals and speech recognition system robustness. In Robust speech recognition and understanding. InTech.
    Google ScholarLocate open access versionFindings
  • Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: a simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research 15, 1 (2014), 1929–1958.
    Google ScholarLocate open access versionFindings
  • Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1–9.
    Google ScholarLocate open access versionFindings
  • Matthew Turk. 2014. Multimodal interaction: A review. Pattern Recognition Letters 36 (2014), 189–195.
    Google ScholarLocate open access versionFindings
  • Xiao-Lei Zhang and Ji Wu. 2013. Deep belief networks based voice activity detection. IEEE Transactions on Audio, Speech, and Language Processing 21, 4 (2013), 697–710.
    Google ScholarLocate open access versionFindings
Your rating :
0

 

Tags
Comments
小科