Dynamic Sampling Strategies for Multi-Task Reading Comprehension

Ananth Gottumukkala
Ananth Gottumukkala
Dheeru Dua
Dheeru Dua

ACL, pp. 920-924, 2020.

Cited by: 0|Bibtex|Views110
EI
Other Links: dblp.uni-trier.de|academic.microsoft.com
Weibo:
We show that interleaving instances from different tasks within each epoch and forming heterogeneous batches is crucial for optimizing multi-task performance

Abstract:

Building general reading comprehension systems, capable of solving multiple datasets at the same time, is a recent aspirational goal in the research community. Prior work has focused on model architectures or generalization to held out datasets, and largely passed over the particulars of the multi-task learning set up. We show that a simp...More

Code:

Data:

0
Introduction
  • Building multi-task reading comprehension systems has received significant attention and been a focus of active research (Talmor and Berant, 2019; Xu et al, 2019)
  • These approaches mostly focus on model architecture improvements or generalizability to new tasks or domains.
  • Prior work has typically either made this a uniform distribution over datasets, a distribution proportional to the sizes of the datasets, or some combination of the two
  • Because these sampling strategies favor some datasets over others, they can lead to catastrophic forgetting in the non-favored datasets.
  • By adjusting the sampling distribution over the course of training according to what the model is learning, this method is able to mitigate the catastrophic forgetting that is observed with other sampling strategies
Highlights
  • Building multi-task reading comprehension systems has received significant attention and been a focus of active research (Talmor and Berant, 2019; Xu et al, 2019). These approaches mostly focus on model architecture improvements or generalizability to new tasks or domains
  • We investigate the importance of this structuring by training a multi-task model on the 8 datasets from ORB (Dua et al, 2019b), a recent multi-task reading comprehension benchmark
  • Prior work has typically either made this a uniform distribution over datasets, a distribution proportional to the sizes of the datasets, or some combination of the two. Because these sampling strategies favor some datasets over others, they can lead to catastrophic forgetting in the non-favored datasets
  • We introduce a dynamic sampling strategy that selects instances from a dataset with probability proportional to the gap between its current performance on some metric and measured single-task performance of the same model on that dataset
  • We show that interleaving instances from different tasks within each epoch and forming heterogeneous batches is crucial for optimizing multi-task performance
Methods
  • Single Task Uniform By Size Uni→Size Dynamic Average EM F1 --

    Quoref ROPES DuoRC NarrQA EM F1 EM F1 EM F1 EM F1 53.0 58.6 67.5 72.1 23.3 30.8 - 50.3

    Partition Homo Hetero

    Average Quoref ROPES DuoRC NarrQA EM F1 EM F1 EM F1 EM F1 EM F1 46.1 53.2 50.7 55.3 58.1 65.4 22.1 30.7 - 50.9 48.8 54.7 53.3 56.8 61.5 66.6 21.6 29.6 - 49.9 51.7 58.1 56.3 60.4 65.1 71.9 23.1 31.5 - 52.9 ORB Dynamic

    Average Quoref ROPES DuoRC NarrQA EM F1 EM F1 EM F1 EM F1 EM F1 34.4 42.1 35.0 44.7 31.1 37.3 25.4 34.1 - 36.6 47.6 54.5 59.4 63.9 36.5 44.8 23.0 31.5 - 52.0

    on DROP.
  • Average Quoref ROPES DuoRC NarrQA EM F1 EM F1 EM F1 EM F1 EM F1 46.1 53.2 50.7 55.3 58.1 65.4 22.1 30.7 - 50.9 48.8 54.7 53.3 56.8 61.5 66.6 21.6 29.6 - 49.9 51.7 58.1 56.3 60.4 65.1 71.9 23.1 31.5 - 52.9 ORB Dynamic.
  • Dynamic sampling achieves the highest average performance and fully cures both problems mentioned above since each epoch, the sampling distribution can be adjusted based on which datasets perform poorly.
  • ORB Evaluation Table 4 shows that the model trained with dynamic sampling and heterogeneous batches significantly outperforms the previous ORB state-of-the-art NABERT baseline model.
Results
  • The authors' final model shows greatly increased performance over the best model on ORB, a recently-released multitask reading comprehension benchmark.
  • ORB Evaluation Table 4 shows that the model trained with dynamic sampling and heterogeneous batches significantly outperforms the previous ORB state-of-the-art NABERT baseline model
Conclusion
  • The authors' goal was to investigate which instance sampling method and epoch scheduling strategy gives optimal performance in a multi-task reading comprehension setting.
  • It is worth noting that for the DuoRC, NarrativeQA, SQuAD, and Quoref datasets there are cases where the multi-task model outperforms the single-task model.
  • This suggests that for specific cases, the authors observe an effect similar to data augmentation but this needs to be explored further.
  • The authors hope that future work experiments further with dynamic sampling such as by modifying the metric and/or modifying other values like number of instances per epoch based on performance metrics
Summary
  • Introduction:

    Building multi-task reading comprehension systems has received significant attention and been a focus of active research (Talmor and Berant, 2019; Xu et al, 2019)
  • These approaches mostly focus on model architecture improvements or generalizability to new tasks or domains.
  • Prior work has typically either made this a uniform distribution over datasets, a distribution proportional to the sizes of the datasets, or some combination of the two
  • Because these sampling strategies favor some datasets over others, they can lead to catastrophic forgetting in the non-favored datasets.
  • By adjusting the sampling distribution over the course of training according to what the model is learning, this method is able to mitigate the catastrophic forgetting that is observed with other sampling strategies
  • Methods:

    Single Task Uniform By Size Uni→Size Dynamic Average EM F1 --

    Quoref ROPES DuoRC NarrQA EM F1 EM F1 EM F1 EM F1 53.0 58.6 67.5 72.1 23.3 30.8 - 50.3

    Partition Homo Hetero

    Average Quoref ROPES DuoRC NarrQA EM F1 EM F1 EM F1 EM F1 EM F1 46.1 53.2 50.7 55.3 58.1 65.4 22.1 30.7 - 50.9 48.8 54.7 53.3 56.8 61.5 66.6 21.6 29.6 - 49.9 51.7 58.1 56.3 60.4 65.1 71.9 23.1 31.5 - 52.9 ORB Dynamic

    Average Quoref ROPES DuoRC NarrQA EM F1 EM F1 EM F1 EM F1 EM F1 34.4 42.1 35.0 44.7 31.1 37.3 25.4 34.1 - 36.6 47.6 54.5 59.4 63.9 36.5 44.8 23.0 31.5 - 52.0

    on DROP.
  • Average Quoref ROPES DuoRC NarrQA EM F1 EM F1 EM F1 EM F1 EM F1 46.1 53.2 50.7 55.3 58.1 65.4 22.1 30.7 - 50.9 48.8 54.7 53.3 56.8 61.5 66.6 21.6 29.6 - 49.9 51.7 58.1 56.3 60.4 65.1 71.9 23.1 31.5 - 52.9 ORB Dynamic.
  • Dynamic sampling achieves the highest average performance and fully cures both problems mentioned above since each epoch, the sampling distribution can be adjusted based on which datasets perform poorly.
  • ORB Evaluation Table 4 shows that the model trained with dynamic sampling and heterogeneous batches significantly outperforms the previous ORB state-of-the-art NABERT baseline model.
  • Results:

    The authors' final model shows greatly increased performance over the best model on ORB, a recently-released multitask reading comprehension benchmark.
  • ORB Evaluation Table 4 shows that the model trained with dynamic sampling and heterogeneous batches significantly outperforms the previous ORB state-of-the-art NABERT baseline model
  • Conclusion:

    The authors' goal was to investigate which instance sampling method and epoch scheduling strategy gives optimal performance in a multi-task reading comprehension setting.
  • It is worth noting that for the DuoRC, NarrativeQA, SQuAD, and Quoref datasets there are cases where the multi-task model outperforms the single-task model.
  • This suggests that for specific cases, the authors observe an effect similar to data augmentation but this needs to be explored further.
  • The authors hope that future work experiments further with dynamic sampling such as by modifying the metric and/or modifying other values like number of instances per epoch based on performance metrics
Tables
  • Table1: Open Reading Benchmark (ORB) Datasets
  • Table2: Effect of using different instance sampling strategies with heterogeneous batch scheduling
  • Table3: Effect of using different epoch scheduling strategies with dynamic sampling
  • Table4: Results on ORB test sets
Download tables as Excel
Funding
  • This work was supported in part by funding from Allen Institute of Artificial Intelligence, in part by Amazon, and in part by the National Science Foundation (NSF) grant #CNS-1730158
Reference
  • Gail A Carpenter and Stephen Grossberg. 1988. The art of adaptive pattern recognition by a selforganizing neural network. Computer, 21(3):77–88.
    Google ScholarLocate open access versionFindings
  • Rich Caruana. 1997. Multitask learning. Machine learning, 28(1):41–75.
    Google ScholarLocate open access versionFindings
  • Pradeep Dasigi, Nelson Liu, Ana Marasovic, Noah Smith, and Matt Gardner. 2019. Quoref: A reading comprehension dataset with questions requiring coreferential reasoning. In EMNLP.
    Google ScholarFindings
  • D. Dua, Y. Wang, P. Dasigi, G. Stanovsky, S. Singh, and M. Gardner. 2019a. Drop: A reading comprehension benchmark requiring discrete reasoning over paragraphs. In North American Association for Computational Linguistics (NAACL).
    Google ScholarLocate open access versionFindings
  • Dheeru Dua, Ananth Gottumukkala, Alon Talmor, Sameer Singh, and Matt Gardner. 2019b. Orb: An open reading benchmark for comprehensive evaluation of machine reading comprehension. In Proceedings of the Second Workshop on Machine Reading for Question Answering, pages 147–153.
    Google ScholarLocate open access versionFindings
  • Adam Fisch, Alon Talmor, Robin Jia, Minjoon Seo, Eunsol Choi, and Danqi Chen. 2019. Mrqa 2019 shared task: Evaluating generalization in reading comprehension. arXiv preprint arXiv:1910.09753.
    Findings
  • Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q. Weinberger. 201On calibration of modern neural networks.
    Google ScholarFindings
  • T. Kocisky, J. Schwarz, P. Blunsom, C. Dyer, K. M. Hermann, G. Melis, and E. Grefenstette. 2017. The NarrativeQA reading comprehension challenge. arXiv preprint arXiv:1712.07040.
    Findings
  • Kevin Lin, Oyvind Tafjord, Peter Clark, and Matt Gardner. 201Reasoning over paragraph effects in situations. arXiv preprint arXiv:1908.05852.
    Findings
  • P. Rajpurkar, R. Jia, and P. Liang. 2018. Know what you don’t know: Unanswerable questions for SQuAD. In Association for Computational Linguistics (ACL).
    Google ScholarLocate open access versionFindings
  • P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang. 2016. SQuAD: 100,000+ questions for machine comprehension of text. In Empirical Methods in Natural Language Processing (EMNLP).
    Google ScholarLocate open access versionFindings
  • A. Saha, R. Aralikatte, M. Khapra, and K. Sankaranarayanan. 2018. Duorc: Towards complex language understanding with paraphrased reading comprehension. In Association for Computational Linguistics (ACL).
    Google ScholarLocate open access versionFindings
  • Victor Sanh, Thomas Wolf, and Sebastian Ruder. 2019. A hierarchical multi-task approach for learning embeddings from semantic tasks. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 6949–6956.
    Google ScholarLocate open access versionFindings
  • Sahil Sharma and Balaraman Ravindran. 2017. Online multi-task learning using active sampling.
    Google ScholarFindings
  • Alon Talmor and Jonathan Berant. 2019. Multiqa: An empirical investigation of generalization and transfer in reading comprehension. Association for Computational Linguistics.
    Google ScholarFindings
  • A. Trischler, T. Wang, X. Yuan, J. Harris, A. Sordoni, P. Bachman, and K. Suleman. 2017. NewsQA: A machine comprehension dataset. In Workshop on Representation Learning for NLP.
    Google ScholarFindings
  • Yichong Xu, Xiaodong Liu, Yelong Shen, Jingjing Liu, and Jianfeng Gao. 2019. Multi-task learning with sample re-weighting for machine reading comprehension. North American Chapter of the Association for Computational Linguistics.
    Google ScholarFindings
Full Text
Your rating :
0

 

Tags
Comments