Neural Generation of Dialogue Response Timings

Matthew Roddy
Matthew Roddy

ACL, pp. 2442-2452, 2020.

Cited by: 0|Views16
EI
Weibo:
The components needed for the design of spoken dialogue systems that can communicate in a realistic human fashion have seen rapid advancements in recent years; Zhou et al; Skerry-Ryan et al )

Abstract:

The timings of spoken response offsets in human dialogue have been shown to vary based on contextual elements of the dialogue. We propose neural models that simulate the distributions of these response offsets, taking into account the response turn as well as the preceding turn. The models are designed to be integrated into the pipeline...More
0
Full Text
Bibtex
Weibo
Introduction
  • The components needed for the design of spoken dialogue systems (SDSs) that can communicate in a realistic human fashion have seen rapid advancements in recent years (e.g. Li et al (2016); Zhou et al (2018); Skerry-Ryan et al (2018)).
  • Raux and Eskenazi (2009)
  • This approach does not emulate naturalistic response offsets, since in human-human conversation the distributions of response timing offsets have been shown to differ based on the context of the first speaker’s turn and the context of the addressee’s response (Sacks et al., 1974; Levinson and Torreira, 2015; Heeman and.
  • If the authors wish to realistically generate offsets distributions in SDSs, the authors need to design response timing models that take into account the context of the user’s speech and the upcoming system response.
  • Offsets where the first speaker’s turn is a backchannel occur in overlap more frequently (Levinson and
Highlights
  • The components needed for the design of spoken dialogue systems (SDSs) that can communicate in a realistic human fashion have seen rapid advancements in recent years (e.g. Li et al (2016); Zhou et al (2018); Skerry-Ryan et al (2018))
  • We propose an extension of Response Timing Network that uses a variational autoencoder (VAE) (Kingma and Welling, 2014) to train an interpretable latent space which can be used to bypass the encoding process at inference-time
  • We define interpausal units as segments of speech by a person that are separated by pauses of 200ms or greater
  • Response Timing Network Performance The offset distribution for the full Response Timing Network model is shown in Fig. 6a
  • 1.0 (b) Yes/No baseline Response Timing Network model is better able to replicate many of the features of the true distribution in comparison with predicted offsets using the best possible fixed probability shown in Fig. 6b
Methods
  • Switchboard-1 Release 2 corpus (Godfrey and Holliman, 1997).
  • Switchboard has 2438 dyadic telephone conversations with a total length of approximately 260 hours.
  • Turn pairs are automatically extracted from orthographic annotations using the following procedure: The authors extract frame-based speech-activity labels for each speaker using a frame step-size of 50ms.
  • The frame-based representation is used to partition each person’s speech signal into interpausal units (IPUs).
  • The authors define IPUs as segments of speech by a person that are separated by pauses of 200ms or greater
Conclusion
  • RTNet Discussion

    RTNet Performance The offset distribution for the full RTNet model is shown in Fig. 6a.
  • Encoder Ablation (a) BC/Statement (b) Yes/No baseline RTNet model is better able to replicate many of the features of the true distribution in comparison with predicted offsets using the best possible fixed probability shown in Fig. 6b.
  • In Fig. 6a, the model has the most trouble reproducing the distribution of offsets between -500 ms and 0 ms
  • This part of the distribution is the most demanding because it requires that the model anticipate the user’s turnending.
  • The ADAPT Centre for Digital Content Technology is funded under the SFI Research Centres Programme (Grant 13/RC/2106) and is co-funded under the European Regional Development Fund
Summary
  • Introduction:

    The components needed for the design of spoken dialogue systems (SDSs) that can communicate in a realistic human fashion have seen rapid advancements in recent years (e.g. Li et al (2016); Zhou et al (2018); Skerry-Ryan et al (2018)).
  • Raux and Eskenazi (2009)
  • This approach does not emulate naturalistic response offsets, since in human-human conversation the distributions of response timing offsets have been shown to differ based on the context of the first speaker’s turn and the context of the addressee’s response (Sacks et al., 1974; Levinson and Torreira, 2015; Heeman and.
  • If the authors wish to realistically generate offsets distributions in SDSs, the authors need to design response timing models that take into account the context of the user’s speech and the upcoming system response.
  • Offsets where the first speaker’s turn is a backchannel occur in overlap more frequently (Levinson and
  • Methods:

    Switchboard-1 Release 2 corpus (Godfrey and Holliman, 1997).
  • Switchboard has 2438 dyadic telephone conversations with a total length of approximately 260 hours.
  • Turn pairs are automatically extracted from orthographic annotations using the following procedure: The authors extract frame-based speech-activity labels for each speaker using a frame step-size of 50ms.
  • The frame-based representation is used to partition each person’s speech signal into interpausal units (IPUs).
  • The authors define IPUs as segments of speech by a person that are separated by pauses of 200ms or greater
  • Conclusion:

    RTNet Discussion

    RTNet Performance The offset distribution for the full RTNet model is shown in Fig. 6a.
  • Encoder Ablation (a) BC/Statement (b) Yes/No baseline RTNet model is better able to replicate many of the features of the true distribution in comparison with predicted offsets using the best possible fixed probability shown in Fig. 6b.
  • In Fig. 6a, the model has the most trouble reproducing the distribution of offsets between -500 ms and 0 ms
  • This part of the distribution is the most demanding because it requires that the model anticipate the user’s turnending.
  • The ADAPT Centre for Digital Content Technology is funded under the SFI Research Centres Programme (Grant 13/RC/2106) and is co-funded under the European Regional Development Fund
Tables
  • Table1: Experimental results on our test set. Lower is better in all cases. Best results shown in bold
Download tables as Excel
Reference
  • Sara Bögels, Kobin H. Kendrick, and Stephen C. Levinson. 2019. Conversational expectations get revised as response latencies unfold. Language, Cognition and Neuroscience, pages 1–14.
    Google ScholarLocate open access versionFindings
  • Sara Bögels and Stephen C. Levinson. 2017. The Brain
    Google ScholarLocate open access versionFindings
  • 2016. Generating Sentences from a Continuous
    Google ScholarFindings
  • Space. In Proceedings of The 20th SIGNLL Conference on Computational Natural Language Learning, pages 10–21, Berlin, Germany. Association for
    Google ScholarLocate open access versionFindings
  • Beaver. 2010. The NXT-format Switchboard Corpus: A rich resource for investigating the syntax, semantics, pragmatics and prosody of dialogue. Language Resources and Evaluation, 44(4):387–419.
    Google ScholarLocate open access versionFindings
  • Lemon. 2012. Optimising incremental dialogue decisions using information density for interactive systems. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pages 82–93. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • David DeVault, Kenji Sagae, and David Traum. 2011.
    Google ScholarFindings
  • Florian Eyben, Klaus R. Scherer, Bjorn W. Schuller, Johan Sundberg, Elisabeth Andre, Carlos Busso, Laurence Y. Devillers, Julien Epps, Petri Laukka, Shrikanth S. Narayanan, and Khiet P. Truong. 2016.
    Google ScholarFindings
  • John J Godfrey and Edward Holliman. 1997.
    Google ScholarFindings
  • David Ha and Douglas Eck. 2018. A neural representation of sketch drawings. In International Conference on Learning Representations.
    Google ScholarLocate open access versionFindings
  • Peter A. Heeman and Rebecca Lunsford. 2017. Turntaking offsets and dialogue context. In Proc. Interspeech 2017, pages 1671–1675.
    Google ScholarLocate open access versionFindings
  • Kobin H. Kendrick and Francisco Torreira. 2015. The timing and construction of preference: A quantitative study. Discourse Processes, 52(4):255–289.
    Google ScholarLocate open access versionFindings
  • Diederik P. Kingma and Max Welling. 2014. Autoencoding variational bayes. In 2nd International
    Google ScholarFindings
  • 2017. Attentive listening system with backchanneling, response generation and flexible turn-taking. In Proceedings of the 18th Annual SIGdial Meeting on Discourse and Dialogue, pages 127–136, Saarbrücken, Germany. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Divesh Lala, Shizuka Nakamura, and Tatsuya Kawahara. 2019. Analysis of Effect and Timing of Fillers in Natural Turn-Taking. In Interspeech 2019, pages
    Google ScholarLocate open access versionFindings
  • Stephen C. Levinson and Francisco Torreira. 2015.
    Google ScholarFindings
  • Jiwei Li, Michel Galley, Chris Brockett, Georgios Spithourakis, Jianfeng Gao, and Bill Dolan. 2016. A
    Google ScholarFindings
  • Persona-Based Neural Conversation Model. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long
    Google ScholarLocate open access versionFindings
  • Papers), pages 994–1003, Berlin, Germany. Association for Computational Linguistics.
    Google ScholarFindings
  • Laurens van der Maaten and Geoffrey Hinton. 2008.
    Google ScholarFindings
  • Visualizing data using t-SNE. Journal of machine learning research, 9(Nov):2579–2605.
    Google ScholarLocate open access versionFindings
  • Gustafson. 2014. Data-driven models for timing feedback responses in a Map Task dialogue system.
    Google ScholarFindings
  • Gratch. 2010. A probabilistic multimodal approach for predicting listener backchannels. Autonomous
    Google ScholarLocate open access versionFindings
  • Ryosuke Nakanishi, Koji Inoue, Shizuka Nakamura, Katsuya Takanashi, and Tatsuya Kawahara. 2018.
    Google ScholarFindings
  • Manning. 2014. Glove: Global Vectors for Word
    Google ScholarFindings
  • Representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language
    Google ScholarLocate open access versionFindings
  • Processing (EMNLP), pages 1532–1543, Doha, Qatar. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Antoine Raux and Maxine Eskenazi. 2009. A finitestate turn-taking model for spoken dialog systems. In HLT-NAACL, pages 629–637. ACL.
    Google ScholarLocate open access versionFindings
  • Hawthorne, and Douglas Eck. 2018. A hierarchical latent vector model for learning long-term structure in music. In ICML, pages 4361–4370.
    Google ScholarLocate open access versionFindings
  • 2018. Multimodal Continuous Turn-Taking Prediction Using Multiscale RNNs. In Proceedings of the
    Google ScholarLocate open access versionFindings
  • 2018 on International Conference on Multimodal Interaction - ICMI ’18, pages 186–190, Boulder, CO, USA. ACM Press.
    Google ScholarLocate open access versionFindings
  • Harvey Sacks, Emanuel A. Schegloff, and Gail Jefferson. 1974. A Simplest Systematics for the Organization of Turn-Taking for Conversation. Language, 50(4):696.
    Google ScholarLocate open access versionFindings
  • David Schlangen and Gabriel Skantze. 2011. A general, abstract model of incremental dialogue processing. In Proceedings of the 12th Conference of the
    Google ScholarLocate open access versionFindings
  • European Chapter of the Association for Computational Linguistics, pages 710–718.
    Google ScholarLocate open access versionFindings
  • Gabriel Skantze. 2017. Towards a General, Continuous Model of Turn-taking in Spoken Dialogue using
    Google ScholarFindings
  • LSTM Recurrent Neural Networks. In Proceedings of SigDial, Saarbrucken, Germany.
    Google ScholarLocate open access versionFindings
  • Clark, and Rif A Saurous. 2018. Towards End-toEnd Prosody Transfer for Expressive Speech Synthesis with Tacotron. Proceedings of the 35 th International Conference on Machine Learning, page 10.
    Google ScholarLocate open access versionFindings
  • Marie Meteer. 2000. Dialogue act modeling for automatic tagging and recognition of conversational speech. Computational linguistics, 26(3):339–373.
    Google ScholarLocate open access versionFindings
  • 2017. Learning Discourse-level Diversity for Neural Dialog Models using Conditional Variational Autoencoders. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 654–664, Vancouver, Canada. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Shum. 2018. The Design and Implementation of XiaoIce, an Empathetic Social Chatbot. arXiv preprint arXiv:1812.08989.
    Findings
Your rating :
0

 

Tags
Comments