We find that larger gains over conventional rule-based baselines are present in dialog systems where the speech recognition confidence score has poor discrimination
The Dialog State Tracking Challenge Series: A Review.
D&D, no. 3 (2016): 4-33
下载 PDF 全文
In a spoken dialog system, dialog state tracking deduces information about the user’s goal as the dialog progresses, synthesizing evidence such as dialog acts over multiple turns with external data sources. Recent approaches have been shown to overcome ASR and SLU errors in some applications. However, there are currently no common testbed...更多
下载 PDF 全文
- H recognition (ASR) and spoken language understanding (SLU) errors are common, and can cause the system to misunderstand the user’s needs.
- Most commercial systems use hand-crafted heuristics for state tracking, selecting the SLU result with the highest confidence score, and discarding alternatives.
- Statistical approaches compute scores for many hypotheses for the dialog state (Figure 1).
- By exploiting correlations between turns and information from external data sources – such as maps, bus timetables, or models of past dialogs – statistical approaches can overcome some SLU errors
- h recognition (ASR) and spoken language understanding (SLU) errors are common, and can cause the system to misunderstand the user’s needs
- Teams were asked to process the test dialogs online – i.e., to make a single pass over the data, as if the tracker were being run in deployment
- The data, evaluation tools, and baselines will continue to be freely available to the research community (DST, 2013)
- The results of the challenge show that the suite of performance metrics cluster into 4 natural groups
- We find that larger gains over conventional rule-based baselines are present in dialog systems where the speech recognition confidence score has poor discrimination
- We observe substantial limitations on generalization: in mismatched conditions, around half of the trackers entered did not exceed the performance of two simple baselines
- Results and discussion
Logistically, the training data and labels, bus timetable database, scoring scripts, and baseline system were publicly released in late December 2012.
- The test data was released on 22 March 2013, and teams were given a week to run their trackers and send results back to the organizers for evaluation.
- 6. Here the authors see 4 natural clusters emerge: a cluster for correctness with Accuracy, MRR, and the ROC.V1.CA measures; a cluster for probability quality with L2 and Average score; and two clusters for score discrimination – one with ROC.V1.EER and the other with the three ROC.V2 metrics.
- Results in Figure 4 emphasize that different trackers are tuned for different performance measures, and the optimal tracking algorithm depends crucially on the target performance measure
- The dialog state tracking challenge has provided the first common testbed for this task.
- The details of the trackers themselves will be published at SIGDIAL 2013.
- The results of the challenge show that the suite of performance metrics cluster into 4 natural groups.
- The authors find that larger gains over conventional rule-based baselines are present in dialog systems where the speech recognition confidence score has poor discrimination.
- The authors observe substantial limitations on generalization: in mismatched conditions, around half of the trackers entered did not exceed the performance of two simple baselines
- Table1: Summary of the datasets. One turn includes a system output and a user response. Slots are named entity types such as bus route, origin neighborhood, date, time, etc. N-best SLU Recall indicates the fraction of concepts which appear anywhere on the SLU N-best list
- The organizers also thank Ian Lane for his support for transcription, and Microsoft and Honda Research Institute USA for funding the challenge
When a transcription exactly and unambiguously matched a recognized slot value, such as the bus route “sixty one c”, labels were assigned automatically. The remainder were assigned using crowdsourcing, where three workers were shown the true words spoken and the recognized concept, and asked to indicate if the recognized concept was correct – even if it did not match the recognized words exactly. Workers were also shown dialog history, which helps decipher the user’s meaning when their speech was ambiguous
Workers were also shown dialog history, which helps decipher the user’s meaning when their speech was ambiguous. If the 3 workers were not unanimous in their labels (about 4% of all turns), the item was labeled manually by the organizers. The REST meta-hypothesis was not explicitly labeled; rather, it was deemed to be correct if none of the prior SLU results were labeled as correct
- AW Black, S Burger, B Langner, G Parent, and M Eskenazi. 2010. Spoken dialog challenge 2010. In Proc SLT, Berkeley.
- D Bohus and AI Rudnicky. 2006. A ‘K hypotheses + other’ belief updating model. In Proc AAAI Workshop on Statistical and Empirical Approaches for Spoken Dialogue Systems, Boston.
- 201Dialog State Tracking Challenge Homepage. http://research.microsoft.com/events/dstc/.
- H Higashinaka, M Nakano, and K Aikawa. 2003. Corpus-based discourse understanding in spoken dialogue systems. In Proc ACL, Sapporo.
- D Huggins-Daines, M Kumar, A Chan, A W Black, M Ravishankar, and A I Rudnicky. 2006. PocketSphinx: A Free, Real-Time Continuous Speech Recognition System for Hand-Held Devices. In Proc ICASSP, Toulouse.
- M Kendall. 1938. A new measure of rank correlation. Biometrika, 30(1-2):81–89.
- Y Ma, A Raux, D Ramachandran, and R Gupta. 2012. Landmark-based location belief tracking in a spoken dialog system. In Proc SigDial, Seoul.
- N Mehta, R Gupta, A Raux, D Ramachandran, and S Krawczyk. 2010. Probabilistic ontology trees for belief tracking in dialog systems. In Proc SigDial, Tokyo.
- T Paek and E Horvitz. 2000. Conversation as action under uncertainty. In Proc UAI, Stanford, pages 455–464.
- G Parent and M Eskenazi. 20Toward Better Crowdsourced Transcription: Transcription of a Year of the Let’s Go Bus Information System Data. In Proc SLT, Berkeley.
- B Thomson and SJ Young. 2010. Bayesian update of dialogue state: A POMDP framework for spoken dialogue systems. Computer Speech and Language, 24(4):562–588.
- JD Williams and SJ Young. 2007. Partially observable Markov decision processes for spoken dialog systems. Computer Speech and Language, 21(2):393– 422.
- JD Williams, A Raux, D Ramachandran, and AW Black. 2012. Dialog state tracking challenge handbook. Technical report, Microsoft Research.
- JD Williams. 2010. Incremental partition recombination for efficient tracking of multiple dialogue states. In Proc. of ICASSP.
- SJ Young, M Gasic, S Keizer, F Mairesse, J Schatzmann, B Thomson, and K Yu. 2010. The hidden information state model: a practical framework for POMDP-based spoken dialogue management. Computer Speech and Language, 24(2):150–174.