Dynamic Sampling Strategies for Multi-Task Reading Comprehension
ACL, pp. 920-924, 2020.
EI
Weibo:
Abstract:
Building general reading comprehension systems, capable of solving multiple datasets at the same time, is a recent aspirational goal in the research community. Prior work has focused on model architectures or generalization to held out datasets, and largely passed over the particulars of the multi-task learning set up. We show that a simp...More
Code:
Data:
Introduction
- Building multi-task reading comprehension systems has received significant attention and been a focus of active research (Talmor and Berant, 2019; Xu et al, 2019)
- These approaches mostly focus on model architecture improvements or generalizability to new tasks or domains.
- Prior work has typically either made this a uniform distribution over datasets, a distribution proportional to the sizes of the datasets, or some combination of the two
- Because these sampling strategies favor some datasets over others, they can lead to catastrophic forgetting in the non-favored datasets.
- By adjusting the sampling distribution over the course of training according to what the model is learning, this method is able to mitigate the catastrophic forgetting that is observed with other sampling strategies
Highlights
- Building multi-task reading comprehension systems has received significant attention and been a focus of active research (Talmor and Berant, 2019; Xu et al, 2019). These approaches mostly focus on model architecture improvements or generalizability to new tasks or domains
- We investigate the importance of this structuring by training a multi-task model on the 8 datasets from ORB (Dua et al, 2019b), a recent multi-task reading comprehension benchmark
- Prior work has typically either made this a uniform distribution over datasets, a distribution proportional to the sizes of the datasets, or some combination of the two. Because these sampling strategies favor some datasets over others, they can lead to catastrophic forgetting in the non-favored datasets
- We introduce a dynamic sampling strategy that selects instances from a dataset with probability proportional to the gap between its current performance on some metric and measured single-task performance of the same model on that dataset
- We show that interleaving instances from different tasks within each epoch and forming heterogeneous batches is crucial for optimizing multi-task performance
Methods
- Single Task Uniform By Size Uni→Size Dynamic Average EM F1 --
Quoref ROPES DuoRC NarrQA EM F1 EM F1 EM F1 EM F1 53.0 58.6 67.5 72.1 23.3 30.8 - 50.3
Partition Homo Hetero
Average Quoref ROPES DuoRC NarrQA EM F1 EM F1 EM F1 EM F1 EM F1 46.1 53.2 50.7 55.3 58.1 65.4 22.1 30.7 - 50.9 48.8 54.7 53.3 56.8 61.5 66.6 21.6 29.6 - 49.9 51.7 58.1 56.3 60.4 65.1 71.9 23.1 31.5 - 52.9 ORB Dynamic
Average Quoref ROPES DuoRC NarrQA EM F1 EM F1 EM F1 EM F1 EM F1 34.4 42.1 35.0 44.7 31.1 37.3 25.4 34.1 - 36.6 47.6 54.5 59.4 63.9 36.5 44.8 23.0 31.5 - 52.0
on DROP. - Average Quoref ROPES DuoRC NarrQA EM F1 EM F1 EM F1 EM F1 EM F1 46.1 53.2 50.7 55.3 58.1 65.4 22.1 30.7 - 50.9 48.8 54.7 53.3 56.8 61.5 66.6 21.6 29.6 - 49.9 51.7 58.1 56.3 60.4 65.1 71.9 23.1 31.5 - 52.9 ORB Dynamic.
- Dynamic sampling achieves the highest average performance and fully cures both problems mentioned above since each epoch, the sampling distribution can be adjusted based on which datasets perform poorly.
- ORB Evaluation Table 4 shows that the model trained with dynamic sampling and heterogeneous batches significantly outperforms the previous ORB state-of-the-art NABERT baseline model.
Results
- The authors' final model shows greatly increased performance over the best model on ORB, a recently-released multitask reading comprehension benchmark.
- ORB Evaluation Table 4 shows that the model trained with dynamic sampling and heterogeneous batches significantly outperforms the previous ORB state-of-the-art NABERT baseline model
Conclusion
- The authors' goal was to investigate which instance sampling method and epoch scheduling strategy gives optimal performance in a multi-task reading comprehension setting.
- It is worth noting that for the DuoRC, NarrativeQA, SQuAD, and Quoref datasets there are cases where the multi-task model outperforms the single-task model.
- This suggests that for specific cases, the authors observe an effect similar to data augmentation but this needs to be explored further.
- The authors hope that future work experiments further with dynamic sampling such as by modifying the metric and/or modifying other values like number of instances per epoch based on performance metrics
Summary
Introduction:
Building multi-task reading comprehension systems has received significant attention and been a focus of active research (Talmor and Berant, 2019; Xu et al, 2019)- These approaches mostly focus on model architecture improvements or generalizability to new tasks or domains.
- Prior work has typically either made this a uniform distribution over datasets, a distribution proportional to the sizes of the datasets, or some combination of the two
- Because these sampling strategies favor some datasets over others, they can lead to catastrophic forgetting in the non-favored datasets.
- By adjusting the sampling distribution over the course of training according to what the model is learning, this method is able to mitigate the catastrophic forgetting that is observed with other sampling strategies
Methods:
Single Task Uniform By Size Uni→Size Dynamic Average EM F1 --
Quoref ROPES DuoRC NarrQA EM F1 EM F1 EM F1 EM F1 53.0 58.6 67.5 72.1 23.3 30.8 - 50.3
Partition Homo Hetero
Average Quoref ROPES DuoRC NarrQA EM F1 EM F1 EM F1 EM F1 EM F1 46.1 53.2 50.7 55.3 58.1 65.4 22.1 30.7 - 50.9 48.8 54.7 53.3 56.8 61.5 66.6 21.6 29.6 - 49.9 51.7 58.1 56.3 60.4 65.1 71.9 23.1 31.5 - 52.9 ORB Dynamic
Average Quoref ROPES DuoRC NarrQA EM F1 EM F1 EM F1 EM F1 EM F1 34.4 42.1 35.0 44.7 31.1 37.3 25.4 34.1 - 36.6 47.6 54.5 59.4 63.9 36.5 44.8 23.0 31.5 - 52.0
on DROP.- Average Quoref ROPES DuoRC NarrQA EM F1 EM F1 EM F1 EM F1 EM F1 46.1 53.2 50.7 55.3 58.1 65.4 22.1 30.7 - 50.9 48.8 54.7 53.3 56.8 61.5 66.6 21.6 29.6 - 49.9 51.7 58.1 56.3 60.4 65.1 71.9 23.1 31.5 - 52.9 ORB Dynamic.
- Dynamic sampling achieves the highest average performance and fully cures both problems mentioned above since each epoch, the sampling distribution can be adjusted based on which datasets perform poorly.
- ORB Evaluation Table 4 shows that the model trained with dynamic sampling and heterogeneous batches significantly outperforms the previous ORB state-of-the-art NABERT baseline model.
Results:
The authors' final model shows greatly increased performance over the best model on ORB, a recently-released multitask reading comprehension benchmark.- ORB Evaluation Table 4 shows that the model trained with dynamic sampling and heterogeneous batches significantly outperforms the previous ORB state-of-the-art NABERT baseline model
Conclusion:
The authors' goal was to investigate which instance sampling method and epoch scheduling strategy gives optimal performance in a multi-task reading comprehension setting.- It is worth noting that for the DuoRC, NarrativeQA, SQuAD, and Quoref datasets there are cases where the multi-task model outperforms the single-task model.
- This suggests that for specific cases, the authors observe an effect similar to data augmentation but this needs to be explored further.
- The authors hope that future work experiments further with dynamic sampling such as by modifying the metric and/or modifying other values like number of instances per epoch based on performance metrics
Tables
- Table1: Open Reading Benchmark (ORB) Datasets
- Table2: Effect of using different instance sampling strategies with heterogeneous batch scheduling
- Table3: Effect of using different epoch scheduling strategies with dynamic sampling
- Table4: Results on ORB test sets
Funding
- This work was supported in part by funding from Allen Institute of Artificial Intelligence, in part by Amazon, and in part by the National Science Foundation (NSF) grant #CNS-1730158
Reference
- Gail A Carpenter and Stephen Grossberg. 1988. The art of adaptive pattern recognition by a selforganizing neural network. Computer, 21(3):77–88.
- Rich Caruana. 1997. Multitask learning. Machine learning, 28(1):41–75.
- Pradeep Dasigi, Nelson Liu, Ana Marasovic, Noah Smith, and Matt Gardner. 2019. Quoref: A reading comprehension dataset with questions requiring coreferential reasoning. In EMNLP.
- D. Dua, Y. Wang, P. Dasigi, G. Stanovsky, S. Singh, and M. Gardner. 2019a. Drop: A reading comprehension benchmark requiring discrete reasoning over paragraphs. In North American Association for Computational Linguistics (NAACL).
- Dheeru Dua, Ananth Gottumukkala, Alon Talmor, Sameer Singh, and Matt Gardner. 2019b. Orb: An open reading benchmark for comprehensive evaluation of machine reading comprehension. In Proceedings of the Second Workshop on Machine Reading for Question Answering, pages 147–153.
- Adam Fisch, Alon Talmor, Robin Jia, Minjoon Seo, Eunsol Choi, and Danqi Chen. 2019. Mrqa 2019 shared task: Evaluating generalization in reading comprehension. arXiv preprint arXiv:1910.09753.
- Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q. Weinberger. 201On calibration of modern neural networks.
- T. Kocisky, J. Schwarz, P. Blunsom, C. Dyer, K. M. Hermann, G. Melis, and E. Grefenstette. 2017. The NarrativeQA reading comprehension challenge. arXiv preprint arXiv:1712.07040.
- Kevin Lin, Oyvind Tafjord, Peter Clark, and Matt Gardner. 201Reasoning over paragraph effects in situations. arXiv preprint arXiv:1908.05852.
- P. Rajpurkar, R. Jia, and P. Liang. 2018. Know what you don’t know: Unanswerable questions for SQuAD. In Association for Computational Linguistics (ACL).
- P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang. 2016. SQuAD: 100,000+ questions for machine comprehension of text. In Empirical Methods in Natural Language Processing (EMNLP).
- A. Saha, R. Aralikatte, M. Khapra, and K. Sankaranarayanan. 2018. Duorc: Towards complex language understanding with paraphrased reading comprehension. In Association for Computational Linguistics (ACL).
- Victor Sanh, Thomas Wolf, and Sebastian Ruder. 2019. A hierarchical multi-task approach for learning embeddings from semantic tasks. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 6949–6956.
- Sahil Sharma and Balaraman Ravindran. 2017. Online multi-task learning using active sampling.
- Alon Talmor and Jonathan Berant. 2019. Multiqa: An empirical investigation of generalization and transfer in reading comprehension. Association for Computational Linguistics.
- A. Trischler, T. Wang, X. Yuan, J. Harris, A. Sordoni, P. Bachman, and K. Suleman. 2017. NewsQA: A machine comprehension dataset. In Workshop on Representation Learning for NLP.
- Yichong Xu, Xiaodong Liu, Yelong Shen, Jingjing Liu, and Jianfeng Gao. 2019. Multi-task learning with sample re-weighting for machine reading comprehension. North American Chapter of the Association for Computational Linguistics.
Full Text
Tags
Comments