The TechQA Dataset

Castelli Vittorio
Castelli Vittorio
Chakravarti Rishav
Chakravarti Rishav
Dana Saswati
Dana Saswati
Ferritto Anthony
Ferritto Anthony
Garg Dinesh
Garg Dinesh
Khandelwal Dinesh
Khandelwal Dinesh
McCawley Mike
McCawley Mike
Nasr Mohamed
Nasr Mohamed

ACL, pp. 1269-1278, 2020.

Cited by: 1|Views68
EI
Weibo:
These are a model trained on SQuAD 2.0, a model trained on NQ, and the Technical Answer Prediction system submitted to the HOTPOTQA leaderboard

Abstract:

We introduce TechQA, a domain-adaptation question answering dataset for the technical support domain. The TechQA corpus highlights two real-world issues from the automated customer support domain. First, it contains actual questions posed by users on a technical forum, rather than questions generated specifically for a competition or a ...More

Code:

Data:

0
Full Text
Bibtex
Weibo
Introduction
  • There is a tension between the development of novel capabilities in the early phases of the technology lifecycle, using unlimited data and compute power, and the later development of practical solutions as that technology matures.
  • The challenges of creating practical solutions are twofold: developing robust, efficient algorithms and curating appropriate training data.
  • The authors have manually selected question-answer pairs that are appropriate for machine reading comprehension techniques, and reserved questions where the answer is distributed across multiple separate spans or multiple documents, and those that require reasoning or substantial real world knowledge for future datasets.
  • The authors release 600 questions for training purposes, of which 150 are not answerable from the provided data, as well as 160 answerable and 150 non-answerable questions as development set.
  • The authors have reserved 490 questions with similar answerable/non-answerable statistics to the development set as a blind test set
Highlights
  • There is a tension between the development of novel capabilities in the early phases of the technology lifecycle, using unlimited data and compute power, and the later development of practical solutions as that technology matures
  • We have manually selected question-answer pairs that are appropriate for machine reading comprehension techniques, and reserved questions where the answer is distributed across multiple separate spans or multiple documents, and those that require reasoning or substantial real world knowledge for future datasets
  • In preparation for TECHQA, the annotators were trained to annotate Technotes for mention detection according to an unreleased type system we developed for IT technical support
  • These are a model trained on SQuAD 2.0, a model trained on NQ, and the Technical Answer Prediction system submitted to the HOTPOTQA leaderboard
  • We have introduced TECHQA, a questionanswering dataset for the IT technical support domain
  • The overall size of the released data (600 training questions) is in line with real-world scenarios, where the high cost of domain expert time limits the amount of quality data that can reasonably be collected
Results
  • The authors use a model pre-trained on SQuAD 2.0 and one pretrained on NQ which is SOTA on the short answer leaderboard.
  • Both models start from the BERTLARGE language model (Devlin et al, 2019) which takes a maximum 512 input word piece token sequence X =.
  • For the QA or MRC task, X consists of a [CLS]
Conclusion
  • Discussion and Future

    Work

    The authors have introduced TECHQA, a questionanswering dataset for the IT technical support domain.

    The overall size of the released data (600 training questions) is in line with real-world scenarios, where the high cost of domain expert time limits the amount of quality data that can reasonably be collected.
  • The dataset is meant to stimulate research in domain adaptation, in addition to developing algorithms for longer questions and answers than the current leaderboards.
  • The authors have created a leaderboard to evaluate systems against a blind dataset of 490 questions with a ratio of answerable to unanswerable questions similar to that of the development set.
  • The leaderboard ranks submissions according to a metric consisting of the character overlap F1 measure for answerable questions and the zero-one metric for non-answerable questions.
  • The leaderboard reports the F1 at the top result and the F1 for the top 5 results computed over the answerable test questions
Summary
  • Introduction:

    There is a tension between the development of novel capabilities in the early phases of the technology lifecycle, using unlimited data and compute power, and the later development of practical solutions as that technology matures.
  • The challenges of creating practical solutions are twofold: developing robust, efficient algorithms and curating appropriate training data.
  • The authors have manually selected question-answer pairs that are appropriate for machine reading comprehension techniques, and reserved questions where the answer is distributed across multiple separate spans or multiple documents, and those that require reasoning or substantial real world knowledge for future datasets.
  • The authors release 600 questions for training purposes, of which 150 are not answerable from the provided data, as well as 160 answerable and 150 non-answerable questions as development set.
  • The authors have reserved 490 questions with similar answerable/non-answerable statistics to the development set as a blind test set
  • Results:

    The authors use a model pre-trained on SQuAD 2.0 and one pretrained on NQ which is SOTA on the short answer leaderboard.
  • Both models start from the BERTLARGE language model (Devlin et al, 2019) which takes a maximum 512 input word piece token sequence X =.
  • For the QA or MRC task, X consists of a [CLS]
  • Conclusion:

    Discussion and Future

    Work

    The authors have introduced TECHQA, a questionanswering dataset for the IT technical support domain.

    The overall size of the released data (600 training questions) is in line with real-world scenarios, where the high cost of domain expert time limits the amount of quality data that can reasonably be collected.
  • The dataset is meant to stimulate research in domain adaptation, in addition to developing algorithms for longer questions and answers than the current leaderboards.
  • The authors have created a leaderboard to evaluate systems against a blind dataset of 490 questions with a ratio of answerable to unanswerable questions similar to that of the development set.
  • The leaderboard ranks submissions according to a metric consisting of the character overlap F1 measure for answerable questions and the zero-one metric for non-answerable questions.
  • The leaderboard reports the F1 at the top result and the F1 for the top 5 results computed over the answerable test questions
Tables
  • Table1: Statistics of questions from the forums. The questions with a Technote in the accepted link were manually annotated by our annotators
  • Table2: Statistics of the question and answer lengths in white-space-separated tokens for SQuAD 2.0, HOTPOTQA and TECHQA
  • Table3: Our baseline systems on the dev set. Here, ‘−FT’ indicates no fine-tuning and we use a pre-trained SQuAD 2.0 and NQ models, while ‘+FT’ indicates further fine-tuning using the TECHQA corpus. Entries marked with ‘∗’ use a threshold tuned on the development set using the F1 metric; hence, F1 equals BEST F1
Download tables as Excel
Related work
  • Recent notable datasets for Machine Reading Comprehension (henceforth, MRC) include the SQuAD 1.1 (Rajpurkar et al, 2016), SQuAD 2.0 (Rajpurkar et al, 2018), NarrativeQA (Kociskyet al., 2018) and HOTPOTQA datasets. They have stimulated a tremendous amount of research and the associated leaderboards have seen a broad participation across the MRC field. A common problem of the earlier MRC datasets is observation bias. Specifically, these datasets contain questions and answers written by annotators who have first read the paragraph that may contain an answer and then wrote the corresponding questions. Hence, the question and the paragraph have substantial lexical overlap. Additionally, systems trained on SQuAD 1.1 could be easily fooled by the insertion of distractor sentences that should not change the answer as shown in (Jia and Liang, 2017). As a result, SQuAD 2.0 added “unanswerable” questions. However, large pre-trained language models (Devlin et al, 2019; Liu et al, 2019) were able to achieve super-human performance in less than a year on that dataset as well; this suggests that the evidence needed to correctly identify unanswerable questions also are present as specific patterns, such as antonyms, in the paragraphs.
Funding
  • Introduces TECHQA, a domain-adaptation question answering dataset for the technical support domain
  • Presents statistics of the dataset in Section 4, introduce the associated leaderboard task in Section 5 and present baseline results obtained by fine-tuning MRC systems built for Na tural Questions and HOTPOTQA in Section 6
Reference
  • InsuranceQA, a question answering corpus in insurance domain. https://github.com/shuzi/insuranceQA. Last commit: January 16, 2017.
    Findings
  • Chris Alberti, Kenton Lee, and Michael Collins. 2019. A BERT baseline for the natural questions. arXiv preprint arXiv:1901.08634, pages 1–4.
    Findings
  • Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, et al. 2016. Ms marco: A human generated machine reading comprehension dataset. arXiv preprint arXiv:1611.09268.
    Findings
  • Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT.
    Google ScholarFindings
  • Robin Jia and Percy Liang. 2017. Adversarial examples for evaluating reading comprehension systems. arXiv preprint arXiv:1707.07328.
    Findings
  • Tomas Kocisky, Jonathan Schwarz, Phil Blunsom, Chris Dyer, Karl Moritz Hermann, Gabor Melis, and Edward Grefenstette. 2018. The NarrativeQA reading comprehension challenge. TACL, 6:317–328.
    Google ScholarLocate open access versionFindings
  • Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Matthew Kelcey, Jacob Devlin, Kenton Lee, Kristina N. Toutanova, Llion Jones, Ming-Wei Chang, Andrew Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov. 2019. Natural Questions: a benchmark for question answering research. TACL.
    Google ScholarLocate open access versionFindings
  • Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. RoBERTa: A robustly optimized BERT pretraining approach. CoRR, abs/1907.11692.
    Findings
  • Lin Pan, Rishav Chakravarti, Anthony Ferritto, Michael Glass, Alfio Gliozzo, Salim Roukos, Radu Florian, and Avirup Sil. 201Frustratingly easy natural question answering.
    Google ScholarFindings
  • Pranav Rajpurkar, Robin Jia, and Percy Liang. 2018. Know what you don’t know: Unanswerable questions for SQuAD. arXiv preprint arXiv:1806.03822.
    Findings
  • Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. SQuAD: 100,000+ questions for machine comprehension of text. EMNLP.
    Google ScholarFindings
  • George Tsatsaronis, Georgios Balikas, Prodromos Malakasiotis, Ioannis Partalas, Matthias Zschunke, Michael R Alvers, Dirk Weissenborn, Anastasia Krithara, Sergios Petridis, Dimitris Polychronopoulos, et al. 2015. An overview of the bioasq large-scale biomedical semantic indexing and question answering competition. BMC bioinformatics, 16(1):138.
    Google ScholarLocate open access versionFindings
  • Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems, pages 5998–6008. Curran Associates, Inc.
    Google ScholarLocate open access versionFindings
  • Georg Wiese, Dirk Weissenborn, and Mariana Neves. 2017. Neural domain adaptation for biomedical question answering. CoNLL.
    Google ScholarLocate open access versionFindings
  • Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W Cohen, Ruslan Salakhutdinov, and Christopher D Manning. 2018. HotpotQA: A dataset for diverse, explainable multi-hop question answering. arXiv preprint arXiv:1809.09600.
    Findings
Your rating :
0

 

Tags
Comments