AI帮你理解科学

AI 生成解读视频

AI抽取解析论文重点内容自动生成视频


pub
生成解读视频

AI 溯源

AI解析本论文相关学术脉络


Master Reading Tree
生成 溯源树

AI 精读

AI抽取本论文的概要总结


微博一下
Our study shows that most QA models fail to generalize over different positions when trained on datasets having answers in a specific position

Look at the First Sentence: Position Bias in Question Answering

EMNLP 2020, pp.1109-1121, (2020)

被引用0|浏览421
下载 PDF 全文
引用
微博一下

摘要

Many extractive question answering models are trained to predict start and end positions of answers. The choice of predicting answers as positions is mainly due to its simplicity and effectiveness. In this study, we hypothesize that when the distribution of the answer positions is highly skewed in the training set (e.g., answers lie only ...更多

代码

数据

0
简介
  • Question answering (QA) is a task of answering questions given a passage. Large-scale QA datasets have attracted many researchers to build effective QA models, and with the advent of deep learning, recent QA models outperform humans in some datasets (Rajpurkar et al, 2016; Devlin et al, 2019; Yang et al, 2019).
  • Extractive QA is the task that assumes that answers always lie in the passage.
  • Based on this task assumption, various QA models are trained to predict the start and end positions of the answers.
  • Following the structure of earlier deep learning-based QA models (Wang and Jiang, Training data (All answers are in the kth sentence) Example #1.
  • Context (1st sent.) ...
  • (k+1th sent.).
重点内容
  • Question answering (QA) is a task of answering questions given a passage
  • We demonstrate that the models predicting the position can be severely biased when trained on datasets that have a very skewed answer position distribution
  • The word-level answer prior does not seem to provide strong signals of position bias as its distribution is much softer than the sentencelevel answer prior
  • While exploiting the positional distribution of the training set could be more helpful when evaluating on the development set that has a similar positional distribution, our method achieves nontrivial improvement (+1.5% in EM) showing that 1) our method works safely when the positional distribution doesn’t change much and 2) position bias might be harmful for the generalization of QA models
  • As shown in Table 3, all three models suffer from position bias in every sentence position while the learned-mixin method (+Learned-Mixin) successfully resolves the bias
  • Our study shows that most QA models fail to generalize over different positions when trained on datasets having answers in a specific position
方法
  • To prevent models from learning a direct correlation between word positions and answers, the authors first introduce simple baselines for BERT such as randomized position or entropy regularization.
  • Randomized Position To avoid learning the direct correlation between word positions and answers, the authors randomly perturb input positions.
  • The authors first randomly sample t indices from a range of 0 to maximum sequence length of BERT.
  • Sorting in an ascending order could bias the models to learn that low position indices are more suitable for answers in the case of SQuADktr=ain1.
  • The authors randomly choose between ascending and descending orders for each sample during training
结果
  • SQuADktr=ai1n The results of applying various de-biasing methods on three models with SQuADktr=ain1 are in Table 2.
  • Performance of all models without any de-biasing methods is very low on SQuADkde=v2,3,..., but fairly high on SQuADkde=v1.
  • BERT trained on SQuADktr=ain1 Baseline None Random Position SQuADdev EM F1.
  • In Table 4 and Table 5, the authors show results of applying the methods
  • In both datasets, BERT, trained on biased datasets (k = 1 and k = 2, 3, ...), significantly suffers from position bias.
结论
  • Most QA studies frequently utilize start and end positions of answers as training targets without much considerations.
  • The authors' study shows that most QA models fail to generalize over different positions when trained on datasets having answers in a specific position.
  • The authors introduce several de-biasing methods to make models to ignore the spurious positional cues, and find out that the sentence-level answer prior is very useful.
  • The authors' findings generalize to different positions and different datasets.
  • One limitation of the approach is that the method and analysis are based on a single paragraph setting which should be extended to a multiple paragraph setting to be more practically useful
总结
  • Introduction:

    Question answering (QA) is a task of answering questions given a passage. Large-scale QA datasets have attracted many researchers to build effective QA models, and with the advent of deep learning, recent QA models outperform humans in some datasets (Rajpurkar et al, 2016; Devlin et al, 2019; Yang et al, 2019).
  • Extractive QA is the task that assumes that answers always lie in the passage.
  • Based on this task assumption, various QA models are trained to predict the start and end positions of the answers.
  • Following the structure of earlier deep learning-based QA models (Wang and Jiang, Training data (All answers are in the kth sentence) Example #1.
  • Context (1st sent.) ...
  • (k+1th sent.).
  • Methods:

    To prevent models from learning a direct correlation between word positions and answers, the authors first introduce simple baselines for BERT such as randomized position or entropy regularization.
  • Randomized Position To avoid learning the direct correlation between word positions and answers, the authors randomly perturb input positions.
  • The authors first randomly sample t indices from a range of 0 to maximum sequence length of BERT.
  • Sorting in an ascending order could bias the models to learn that low position indices are more suitable for answers in the case of SQuADktr=ain1.
  • The authors randomly choose between ascending and descending orders for each sample during training
  • Results:

    SQuADktr=ai1n The results of applying various de-biasing methods on three models with SQuADktr=ain1 are in Table 2.
  • Performance of all models without any de-biasing methods is very low on SQuADkde=v2,3,..., but fairly high on SQuADkde=v1.
  • BERT trained on SQuADktr=ain1 Baseline None Random Position SQuADdev EM F1.
  • In Table 4 and Table 5, the authors show results of applying the methods
  • In both datasets, BERT, trained on biased datasets (k = 1 and k = 2, 3, ...), significantly suffers from position bias.
  • Conclusion:

    Most QA studies frequently utilize start and end positions of answers as training targets without much considerations.
  • The authors' study shows that most QA models fail to generalize over different positions when trained on datasets having answers in a specific position.
  • The authors introduce several de-biasing methods to make models to ignore the spurious positional cues, and find out that the sentence-level answer prior is very useful.
  • The authors' findings generalize to different positions and different datasets.
  • One limitation of the approach is that the method and analysis are based on a single paragraph setting which should be extended to a multiple paragraph setting to be more practically useful
表格
  • Table1: Performance of QA models trained on the biased SQuAD dataset (SQuADktra=in1), and tested on SQuADdev. ∆ denotes the difference in F1 score with SQuADtrain. See Section 2.1 for more details
  • Table2: Results of applying de-biasing methods. Each model is evaluated on SQuADdev and two subsets: SQuADkde=v1 and SQuADdke=v2,3,
  • Table3: Position bias in different positions. Each model is trained on a biased SQuAD dataset (SQuADktrain) and evaluated on SQuADdev
  • Table4: F1 scores on NewsQA. Models are evaluated on the original development dataset (NewsQAdev)
  • Table5: F1 scores on NaturalQuestions. Models are evaluated on the original development dataset (NQdev)
Download tables as Excel
相关工作
基金
  • This research was supported by National Research Foundation of Korea (NRF-2017R1A2A1A17069 645, NRF-2017M3C4A7065887)
研究对象与分析
samples: 20593
XLNet +Bias Product +Learned-Mixin. F1 k=2 (20,593 samples). F1 k=3 k=4 (15,567 samples) (10,379 samples)

samples: 15567
F1 k=2 (20,593 samples). F1 k=3 k=4 (15,567 samples) (10,379 samples). F1 k = 5, 6, ... (12,610 samples)

samples: 12610
F1 k=3 k=4 (15,567 samples) (10,379 samples). F1 k = 5, 6, ... (12,610 samples). Training set with -th sentence

samples: 40000
Implementation Details From NewsQA and NaturalQuestions, we construct two sub-training datasets having only the first annotated samples (Dktr=ain1) and the remaining samples (Dktr=ain2,3,...). For a fair comparison, we fix the size of two subtraining sets to have 17,000 (NewsQA) and 40,000 samples (NaturalQuestions). More details on preprocessing of NewsQA and Natural Questions are in Appendix A

引用论文
  • Aishwarya Agrawal, Dhruv Batra, and Devi Parikh. 2016. Analyzing the behavior of visual question answering models. In EMNLP.
    Google ScholarFindings
  • Aishwarya Agrawal, Dhruv Batra, Devi Parikh, and Aniruddha Kembhavi. 2018. Don’t just assume; look and answer: Overcoming priors for visual question answering. In CVPR.
    Google ScholarFindings
  • Yonatan Belinkov, Adam Poliak, Stuart M Shieber, Benjamin Van Durme, and Alexander Rush. 2019. On adversarial removal of hypothesis-only bias in natural language inference. NAACL HLT.
    Google ScholarFindings
  • Danqi Chen, Jason Bolton, and Christopher D Manning. 2016. A thorough examination of the cnn/daily mail reading comprehension task. In ACL.
    Google ScholarFindings
  • Jifan Chen and Greg Durrett. 2019. Understanding dataset design choices for multi-hop reasoning. In NAACL-HLT.
    Google ScholarFindings
  • Christopher Clark, Mark Yatskar, and Luke Zettlemoyer. 2019. Don’t take the easy way out: Ensemble based methods for avoiding known dataset biases. In EMNLP-IJCNLP.
    Google ScholarFindings
  • Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. In NAACL.
    Google ScholarFindings
  • Pedro Domingos. 1999. Metacost: a general method for making classifiers cost-sensitive. In SIGKDD.
    Google ScholarLocate open access versionFindings
  • Dheeru Dua, Yizhong Wang, Pradeep Dasigi, Gabriel Stanovsky, Sameer Singh, and Matt Gardner. 201Drop: A reading comprehension benchmark requiring discrete reasoning over paragraphs. In NAACLHLT.
    Google ScholarFindings
  • Adam Fisch, Alon Talmor, Robin Jia, Minjoon Seo, Eunsol Choi, and Danqi Chen. 2019. Mrqa 2019 shared task: Evaluating generalization in reading comprehension. In Proceedings of the 2nd Workshop on Machine Reading for Question Answering.
    Google ScholarLocate open access versionFindings
  • Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. 2017. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In CVPR.
    Google ScholarFindings
  • He He, Sheng Zha, and Haohan Wang. 2019. Unlearn dataset bias in natural language inference by fitting the residual. EMNLP-IJCNLP 2019.
    Google ScholarLocate open access versionFindings
  • Geoffrey E Hinton. 2002. Training products of experts by minimizing contrastive divergence. Neural computation.
    Google ScholarFindings
  • Chen Huang, Yining Li, Chen Change Loy, and Xiaoou Tang. 2016. Learning deep representation for imbalanced classification. In CVPR.
    Google ScholarFindings
  • Nathalie Japkowicz and Shaju Stephen. 2002. The class imbalance problem: A systematic study. Intelligent data analysis.
    Google ScholarFindings
  • Robin Jia and Percy Liang. 2017. Adversarial examples for evaluating reading comprehension systems. In EMNLP.
    Google ScholarFindings
  • Justin Johnson, Bharath Hariharan, Laurens van der Maaten, Li Fei-Fei, C Lawrence Zitnick, and Ross Girshick. 20Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. In CVPR.
    Google ScholarFindings
  • Mandar Joshi, Eunsol Choi, Daniel S Weld, and Luke Zettlemoyer. 2017. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. In ACL.
    Google ScholarFindings
  • Dongyeop Kang, Tushar Khot, Ashish Sabharwal, and Eduard Hovy. 2018. Adventure: Adversarial training for textual entailment with knowledge-guided examples. In ACL.
    Google ScholarFindings
  • Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Matthew Kelcey, Jacob Devlin, Kenton Lee, Kristina N. Toutanova, Llion Jones, Ming-Wei Chang, Andrew Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov. 2019. Natural questions: a benchmark for question answering research. TACL.
    Google ScholarLocate open access versionFindings
  • Kenton Lee, Shimi Salant, Tom Kwiatkowski, Ankur Parikh, Dipanjan Das, and Jonathan Berant. 2016. Learning recurrent span representations for extractive question answering. arXiv preprint arXiv:1611.01436.
    Findings
  • M Lewis and A Fan. 2019. Generative question answering: Learning to answer the whole question. In ICLR.
    Google ScholarFindings
  • Sewon Min, Eric Wallace, Sameer Singh, Matt Gardner, Hannaneh Hajishirzi, and Luke Zettlemoyer. 2019. Compositional questions do not necessitate multi-hop reasoning. In ACL.
    Google ScholarFindings
  • Sewon Min, Victor Zhong, Richard Socher, and Caiming Xiong. 2018. Efficient and robust question answering from minimal context over documents. In ACL.
    Google ScholarFindings
  • Pasquale Minervini and Sebastian Riedel. 2018. Adversarially regularising neural nli models to integrate logical background knowledge. CoNLL, page 65.
    Google ScholarLocate open access versionFindings
  • Timothy Niven and Hung-Yu Kao. 2019. Probing neural network comprehension of natural language arguments. In ACL.
    Google ScholarFindings
  • Adam Poliak, Jason Naradowsky, Aparajita Haldar, Rachel Rudinger, and Benjamin Van Durme. 2018. Hypothesis only baselines in natural language inference. In Proceedings of the Seventh Joint Conference on Lexical and Computational Semantics.
    Google ScholarLocate open access versionFindings
  • Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2019. Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv preprint arXiv:1910.10683.
    Findings
  • Adams Wei Yu, David Dohan, Minh-Thang Luong, Rui Zhao, Kai Chen, Mohammad Norouzi, and Quoc V Le. 2018. Qanet: Combining local convolution with global self-attention for reading comprehension. In ICLR.
    Google ScholarFindings
  • Pranav Rajpurkar, Robin Jia, and Percy Liang. 2018. Know what you dont know: Unanswerable questions for squad. In ACL.
    Google ScholarFindings
  • Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. Squad: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250.
    Findings
  • Peng Zhang, Yash Goyal, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. 2016. Yin and yang: Balancing and answering binary visual questions. In CVPR.
    Google ScholarFindings
  • Zhi-Hua Zhou and Xu-Ying Liu. 2006. On multi-class cost-sensitive learning. In AAAI.
    Google ScholarFindings
  • Minjoon Seo, Aniruddha Kembhavi, Ali Farhadi, and Hannaneh Hajishirzi. 2017. Bidirectional attention flow for machine comprehension. In ICLR.
    Google ScholarFindings
  • Minjoon Seo, Jinhyuk Lee, Tom Kwiatkowski, Ankur Parikh, Ali Farhadi, and Hannaneh Hajishirzi. 2019. Real-time open-domain question answering with dense-sparse phrase index. In ACL.
    Google ScholarFindings
  • Alon Talmor and Jonathan Berant. 2019. Multiqa: An empirical investigation of generalization and transfer in reading comprehension. In ACL.
    Google ScholarFindings
  • Adam Trischler, Tong Wang, Xingdi Yuan, Justin Harris, Alessandro Sordoni, Philip Bachman, and Kaheer Suleman. 2017. Newsqa: A machine comprehension dataset. In Proceedings of the 2nd Workshop on Representation Learning for NLP.
    Google ScholarLocate open access versionFindings
  • Oriol Vinyals, Meire Fortunato, and Navdeep Jaitly. 2015. Pointer networks. In NIPS.
    Google ScholarFindings
  • Shuohang Wang and Jing Jiang. 2016. Machine comprehension using match-lstm and answer pointer. arXiv preprint arXiv:1608.07905.
    Findings
  • Wenhui Wang, Nan Yang, Furu Wei, Baobao Chang, and Ming Zhou. 2017. Gated self-matching networks for reading comprehension and question answering. In ACL.
    Google ScholarFindings
  • Johannes Welbl, Pontus Stenetorp, and Sebastian Riedel. 2018. Constructing datasets for multi-hop reading comprehension across documents. TACL.
    Google ScholarLocate open access versionFindings
  • Caiming Xiong, Victor Zhong, and Richard Socher. 2017. Dcn+: Mixed objective and deep residual coattention for question answering. In ICLR.
    Google ScholarFindings
  • Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, and Quoc V Le. 2019. Xlnet: Generalized autoregressive pretraining for language understanding. arXiv preprint arXiv:1906.08237.
    Findings
  • Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D Manning. 2018. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. In EMNLP, pages 2369–2380.
    Google ScholarLocate open access versionFindings
  • Pre-processing of NewsQA and NaturalQuestions For NewsQA, we truncate each paragraph so that the length of each context is less than 300 words. We eliminate training and development samples that become unanswerable due to the truncation. For NaturalQuestions, we use the pre-processed dataset provided by the MRQA shared task (Fisch et al., 2019).5 We choose firstly occurring answers for training extractive QA models, which is a common approach in weakly supervised setting (Joshi et al., 2017; Talmor and Berant, 2019).
    Google ScholarLocate open access versionFindings
作者
Ko Miyoung
Ko Miyoung
Kim Hyunjae
Kim Hyunjae
Kim Gangwoo
Kim Gangwoo
您的评分 :
0

 

标签
评论
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn
小科