AI helps you reading Science

AI generates interpretation videos

AI extracts and analyses the key points of the paper to generate videos automatically


pub
Go Generating

AI Traceability

AI parses the academic lineage of this thesis


Master Reading Tree
Generate MRT

AI Insight

AI extracts a summary of this paper


Weibo:
This paper investigates pre-trained language models to find out which model intrinsically carries the most informative representation for task-oriented dialogue tasks

Probing Task Oriented Dialogue Representation from Language Models

EMNLP 2020, pp.5036-5051, (2020)

Cited by: 0|Views228
Full Text
Bibtex
Weibo

Abstract

This paper investigates pre-trained language models to find out which model intrinsically carries the most informative representation for task-oriented dialogue tasks. We approach the problem from two aspects: supervised classifier probe and unsupervised mutual information probe. We fine-tune a feed-forward layer as the classifier probe o...More

Code:

Data:

0
Introduction
Highlights
  • Task-oriented dialogue systems achieve specific user goals within a limited number of dialogue turns via natural language
  • We are interested in answering these questions: which language model has the most informative representations that is better for what task-oriented dialogue task? Does pretraining with dialogue-specific data or different objectives make any difference? We investigate how good these pre-trained representations are for a task-oriented dialogue system, ignoring the model architectures and training strategies by only probing their final representations with fine-tuning models
  • When comparing the ranking of GPT2 and DialoGPT models in Figure 1 and Figure 2, we found that they obtain almost the worse adjusted normalized mutual information (ANMI) scores but work quite good in classification accuracy
  • We investigate representations from pre-trained language models for task-oriented dialogue tasks, including domain identification, intent detection, slot tagging, and dialogue act prediction
  • From the ranking results of two different probings, we show a list of interesting observations to provide model selection guidelines and shed light on future research towards a more advanced language modeling learning for dialogue applications
Methods
  • For every utterance Ut or St, the authors have humanannotated domain, user intent, slot, and dialogue act labels.
  • The authors first feed all the utterances to a pre-trained model and obtain user and system representations.
  • For domain identification and intent detection, the authors use a Softmax layer and backpropagate with the cross-entropy loss.
  • For dialogue slot and act prediction, the authors use a Sigmoid layer and the binary cross-entropy loss since they are multi-label classification tasks
Results
  • Classifier results are shown in Figure 1.
  • The authors can observe that ConveRT, TOD-BERT-jnt, and TODGPT2 achieve the best performance, implying that pre-training with dialogue-related data captures better representations, at least in these sub-tasks.
  • The performance of ConveRT and TODBERT-jnt suggests that it is helpful to pre-train with a response selection contrastive objective, especially when comparing TOD-BERT-jnt to TODBERT-mlm.
  • Most of the pre-trained models have a similar and high micro-F1 score in (d) system dialogue act prediction, as most of them.
  • TGoPDT-2BERT-mlmDistilBERTToD-GPT2 ConveTRoTD-BERT-jnt (a) MWOZ Domain (System).
  • GPT2DistilBETRoTD-BERT-jntToD-GPT2 ConveRT (b) OOS Intent (User)
Conclusion
  • The authors investigate representations from pre-trained language models for task-oriented dialogue tasks, including domain identification, intent detection, slot tagging, and dialogue act prediction.
  • From the ranking results of two different probings, the authors show a list of interesting observations to provide model selection guidelines and shed light on future research towards a more advanced language modeling learning for dialogue applications
Tables
  • Table1: An overview of selected pre-trained language models (Details in Section 2)
  • Table2: Labels classes in the MWOZ Data
  • Table3: Clustering results of the ConveRT model. The samples are picked from each randomly selected five clusters with K=32. We can roughly label a topic for each cluster
  • Table4: OOS intent
  • Table5: The data statistics is from <a class="ref-link" id="cWu_et+al_2020_a" href="#rWu_et+al_2020_a">Wu et al (2020</a>)
  • Table6: Clustering results of the TOD-BERT-jnt model. The samples are randomly picked from each randomly selected five clusters (using K=32)
  • Table7: Clustering results of the GPT2 model. The samples are randomly picked from each randomly selected five clusters (using K=32)
  • Table8: Clustering results of the DialoGPT model. The samples are randomly picked from each randomly selected five clusters (using K=32)
Download tables as Excel
Funding
  • Dialogue slot (c) information, meanwhile, is not well captured by these representations, resulting in a micro-F1 lower than 30%
Study subjects and analysis
datasets: 9
In this paper, following TOD-BERT’s idea, we train a task-oriented GPT2 model (TOD-GPT2) built on the GPT2 model and further pre-trained with task-oriented datasets. We use the same dataset collection, which contains nine datasets in total, as shown in Wu et al (2020), to pre-train the model as a reference. We define a dialogue corpus D = {D1, . . . , DM } has M dialogue samples, and each dialogue sample Dm has T turns of conversational exchange

datasets: 9
To obtain each representation, we run most of the pre-trained models using the HuggingFace (Wolf et al, 2019a) library, except the ConveRT 1 and TOD-BERT 2. We fine-tune GPT2 using its default hyper-parameters and the same nine datasets as shown in Wu et al (2020) to train for TOD-GPT2 model. For classifier probing, we fine-tune the top layer with a consistent hyperparameter setting

clusters and five samples: 5
What utterances are clustered together? In Table 3, we show the clustering examples of system responses from the top performance model ConveRT. We use K = 32 clustering and randomly select five clusters and five samples each. We found that most of the utterances are related to an unsuccessful booking in the cluster 1, containing “I am sorry,” “solidly booked,” or “booking was unsuccessful.” We also found other clusters showing good clustering results, such as selecting departure or arrival time for a train ticket or requesting more user preference for a restaurant reservation

Reference
  • Yossi Adi, Einat Kermany, Yonatan Belinkov, Ofer Lavi, and Yoav Goldberg. 2016. Fine-grained analysis of sentence embeddings using auxiliary prediction tasks. arXiv preprint arXiv:1608.04207.
    Findings
  • Siqi Bao, Huang He, Fan Wang, and Hua Wu. 2019. Plato: Pre-trained dialogue generation model with discrete latent variable. arXiv preprint arXiv:1910.07931.
    Findings
  • Yonatan Belinkov, Nadir Durrani, Fahim Dalvi, Hassan Sajjad, and James Glass. 2017. What do neural machine translation models learn about morphology? In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 861–872, Vancouver, Canada. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Paweł Budzianowski and Ivan Vulic. 2019.
    Google ScholarFindings
  • Paweł Budzianowski, Tsung-Hsien Wen, Bo-Hsiang Tseng, Inigo Casanueva, Stefan Ultes, Osman Ramadan, and Milica Gasic. 2018. Multiwoz-a large-scale multi-domain wizard-of-oz dataset for task-oriented dialogue modelling. arXiv preprint arXiv:1810.00278.
    Findings
  • Kevin Clark, Minh-Thang Luong, Quoc V Le, and Christopher D Manning. 2020. Electra: Pre-training text encoders as discriminators rather than generators. arXiv preprint arXiv:2003.10555.
    Findings
  • Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
    Findings
  • Matthew Henderson, Inigo Casanueva, Nikola Mrksic, Pei-Hao Su, Ivan Vulic, et al. 2019. Convert: Efficient and accurate conversational representations from transformers. arXiv preprint arXiv:1911.03688.
    Findings
  • John Hewitt and Christopher D Manning. 201A structural probe for finding syntax in word representations. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4129–4138.
    Google ScholarLocate open access versionFindings
  • Jeff Johnson, Matthijs Douze, and Herve Jegou. 2017. Billion-scale similarity search with gpus. arXiv preprint arXiv:1702.08734.
    Findings
  • Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. 2019. Albert: A lite bert for self-supervised learning of language representations. arXiv preprint arXiv:1909.11942.
    Findings
  • Stefan Larson, Anish Mahendran, Joseph J Peper, Christopher Clarke, Andrew Lee, Parker Hill, Jonathan K Kummerfeld, Kevin Leach, Michael A Laurenzano, Lingjia Tang, et al. 2019. An evaluation dataset for intent classification and out-of-scope prediction. arXiv preprint arXiv:1909.02027.
    Findings
  • Zhaojiang Lin, Andrea Madotto, Genta Indra Winata, and Pascale Fung. 2020. Mintl: Minimalist transfer learning for task-oriented dialogue systems. arXiv preprint arXiv:2009.12005.
    Findings
  • Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.
    Findings
  • Stuart Lloyd. 1982. Least squares quantization in pcm. IEEE transactions on information theory, 28(2):129–137.
    Google ScholarLocate open access versionFindings
  • Ilya Loshchilov and Frank Hutter. 2017. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101.
    Findings
  • Tiago Pimentel, Josef Valvoda, Rowan Hall Maudslay, Ran Zmigrod, Adina Williams, and Ryan Cotterell. 2020. Information-theoretic probing for linguistic structure. arXiv preprint arXiv:2004.03061.
    Findings
  • Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language models are unsupervised multitask learners.
    Google ScholarFindings
  • Douglas A Reynolds. 2009. Gaussian mixture models. Encyclopedia of biometrics, 741.
    Google ScholarLocate open access versionFindings
  • Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. 2019.
    Google ScholarFindings
  • Hannes Schulz, Jeremie Zumer, Layla El Asri, and Shikhar Sharma. 2017. A frame tracking model for memory-enhanced dialogue systems. arXiv preprint arXiv:1706.01690.
    Findings
  • Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008.
    Google ScholarLocate open access versionFindings
  • Nguyen Xuan Vinh, Julien Epps, and James Bailey. 2010. Information theoretic measures for clusterings comparison: Variants, properties, normalization and correction for chance. The Journal of Machine Learning Research, 11:2837–2854.
    Google ScholarLocate open access versionFindings
  • Zhongyu Wei, Qianlong Liu, Baolin Peng, Huaixiao Tou, Ting Chen, Xuan-Jing Huang, Kam-Fai Wong, and Xiang Dai. 2018. Task-oriented dialogue system for automatic diagnosis. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 201–207.
    Google ScholarLocate open access versionFindings
  • Tsung-Hsien Wen, Milica Gasic, Nikola Mrksic, Lina Maria Rojas-Barahona, Pei hao Su, Stefan Ultes, David Vandyke, and Steve J. Young. 2017. A network-based end-to-end trainable task-oriented dialogue system. In EACL.
    Google ScholarFindings
  • Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, R’emi Louf, Morgan Funtowicz, and Jamie Brew. 2019a. Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771.
    Findings
  • Thomas Wolf, Victor Sanh, Julien Chaumond, and Clement Delangue. 2019b. Transfertransfo: A transfer learning approach for neural network based conversational agents. arXiv preprint arXiv:1901.08149.
    Findings
  • Chien-Sheng Wu, Steven Hoi, Richard Socher, and Caiming Xiong. 2020. Tod-bert: Pre-trained natural language understanding for task-oriented dialogues. arXiv preprint arXiv:2004.06871.
    Findings
  • Chien-Sheng Wu, Andrea Madotto, Ehsan HosseiniAsl, Caiming Xiong, Richard Socher, and Pascale Fung. 2019a. Transferable multi-domain state generator for task-oriented dialogue systems. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Chien-Sheng Wu, Richard Socher, and Caiming Xiong. 2019b. Global-to-local memory pointer networks for task-oriented dialogue. In Proceedings of the 7th International Conference on Learning Representations.
    Google ScholarLocate open access versionFindings
  • Yizhe Zhang, Siqi Sun, Michel Galley, Yen-Chun Chen, Chris Brockett, Xiang Gao, Jianfeng Gao, Jingjing Liu, and Bill Dolan. 2019. Dialogpt: Large-scale generative pre-training for conversational response generation. arXiv preprint arXiv:1911.00536.
    Findings
Your rating :
0

 

Tags
Comments
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn
小科