AI helps you reading Science

AI generates interpretation videos

AI extracts and analyses the key points of the paper to generate videos automatically


pub
Go Generating

AI Traceability

AI parses the academic lineage of this thesis


Master Reading Tree
Generate MRT

AI Insight

AI extracts a summary of this paper


Weibo:
We presented Eχαμs, a new challenging crosslingual and multilingual benchmark for science Question Answering in 16 languages and 24 subjects from high school examinations

EXAMS: A Multi subject High School Examinations Dataset for Cross lingual and Multilingual Question Answering

EMNLP 2020, pp.5427-5444, (2020)

Cited by: 0|Views184
Full Text
Bibtex
Weibo

Abstract

We propose EXAMS – a new benchmark dataset for cross-lingual and multilingual question answering for high school examinations. We collected more than 24,000 high-quality high school exam questions in 16 languages, covering 8 language families and 24 school subjects from Natural Sciences and Social Sciences, among others.EXAMS offers uniqu...More

Code:

Data:

0
Introduction
  • Research on science question answering has attracted a lot of attention in recent years (Clark, 2015; Schoenick et al, 2017; Clark et al, 2019).
  • A combination of these was required to achieve noticeable performance gains (Clark et al, 2016)
  • This inevitably made research in schoollevel science Question Answering (QA) hard for languages other than English due to the scarceness of resources (Clark et al, 2014; Khot et al, 2017, 2018; Bhakthavatsalam et al, 2020).
  • Geography History Professional Forestry Geology Psychology Politics Social Landscaping Religion Chemistry
Highlights
  • Research on science question answering has attracted a lot of attention in recent years (Clark, 2015; Schoenick et al, 2017; Clark et al, 2019)
  • Information Retrieval (IR) is better than random guessing, but it is clear that most questions require reasoning beyond simple word matching
  • We evaluate the knowledge contained in the models before and after the Question Answering (QA) fine-tuning
  • Our results show that initial fine-tuning on a large monolingual out-of-domain multi-choice machine reading comprehension dataset (RACE (Lai et al, 2017)) performs much better than no training baselines for answering multilingual Eχαμs questions
  • We presented Eχαμs, a new challenging crosslingual and multilingual benchmark for science QA in 16 languages and 24 subjects from high school examinations
  • We see a 2.4% improvement with multilingual fine-tuning on Eχαμs and +0.5% for English
  • We hope that our publicly available data and code will enable work on multilingual models that can reason about question answering in the challenging science domain
Results
  • The authors evaluate the performance of the baseline models described in Section 4 on the Eχαμs dataset.
  • In Table 4, the authors show the overall per-language performance of the evaluated models.
  • The first group shows simple baselines: random guessing and IR over Wikipedia articles.
  • The authors evaluate the knowledge contained in the models before and after the QA fine-tuning.
  • The authors evaluate XLM-R as a knowledge base, and the authors use the Full model but with the question–option pair only
Conclusion
  • The authors' results show that initial fine-tuning on a large monolingual out-of-domain multi-choice machine reading comprehension dataset (RACE (Lai et al, 2017)) performs much better than no training baselines for answering multilingual Eχαμs questions.
  • Additional training on English science QA in lower school levels has no significant effect on the overall accuracy
  • These results suggest that further investigation of finetuning with other multilingual datasets (Gupta et al, 2018; Lewis et al, 2020; Clark et al, 2020; Efimov et al, 2020; d’Hoffschmidt et al, 2020; Artetxe et al, 2020; Longpre et al, 2020) is needed in order to understand the domain transfer benefits to science QA in Eχαμs, even if they are not in a multi-choice setting (Khashabi et al, 2020).
Tables
  • Table1: Statistics about Eχαμs. The average length of the question (Question Len) and the choices (sChoice Len) are measured in number of tokens, and the vocabulary size (Vocab) is measured in number of words
  • Table2: Parallel questions for different language pairs
  • Table3: Number of examples in the data splits based on the experimental setup
  • Table4: Overall per-language evaluation. The first three columns show the results on ARC Easy (E), ARC Challenge (C), and Regents 12 LivEnv (en). The following columns show the per-language and the overall results (the last column All) for all languages. All is the score averaged over all Eχαμs questions
  • Table5: Cross-lingual zero-shot performance on Eχαμs. The first three columns show the performance on the test set of the AI2 science datasets (English), followed by per-language evaluation. The underlined values mark languages that have parallel data with the source language, and the ones with an asterisk* are from the same family
  • Table6: The hyper-parameter values we used for fine-tuning
  • Table7: Per-subject statistics. The grade is High (H), and Middle (M). The average length of the question (Q Len) and the choices (Ch Len) are measured in number of tokens, and the vocabulary size (Vocab) is shown in number of words
  • Table8: Description of the per-language indices used as a source of background knowledge in our experiments
Download tables as Excel
Related work
  • Science QA The work in Science Question Answering emerged in recent years with the development of several challenging datasets. The most notable is ARC (Clark et al, 2018), which is a QA reasoning challenge that contains both Easy and Challenge questions from 4th to 8th grade examinations in the Natural Science domain. As in Eχαμs, the questions in ARC are created by experts, albeit our dataset covers a wide variety of high school (8th-12th grade) subjects including but not limited to, Natural Sciences, Social Sciences, Applied Studies, Arts, Religion, etc. (see Section 3.2 for details). We provide definitions of the less known subjects in Eχαμs in Appendix B.1.

    The early versions of ARC (Clark, 2015; Schoenick et al, 2017) inspired several crowdsourced datasets: Welbl et al (2017) proposed a scalable approach for crowdsourcing science questions given a set of basic supporting science facts. Dalvi et al (2019) focused on specific phenomena including understanding science procedural texts, Mihaylov et al (2018) and Khot et al (2020) studied multi-step reasoning, given a set of science facts and commonsense knowledge, Tafjord et al (2019), and Mitra et al (2019) worked on reasoning about qualitative relationships, and declarative texts, among others. Unlike these English-only datasets, Eχαμs offers questions in 16 languages. Moreover, it contains questions about multiple subjects, which are presumably harder as they were extracted mostly from matriculation examinations (8-12th grade). Finally, Eχαμs contains over 24,000 questions, which is more than three times as many as in ARC.
Funding
  • We thank the AI2 Aristo Team for providing the data splits used for pre-training on SciEN datasets. This research is partially supported by Project UNITe BG05M2OP001-1.001-0004 funded by the OP “Science and Education for Smart Growth” and co-funded by the EU through the ESI Funds. Peter Clark, Niranjan Balasubramanian, Sumithra Bhakthavatsalam, Kevin Humphreys, J
  • Automatic construction of inference-supporting knowledge bases
Study subjects and analysis
school subjects: 24
We propose Eχαμs – a new benchmark dataset for cross-lingual and multilingual question answering for high school examinations. We collected more than 24,000 highquality high school exam questions in 16 languages, covering 8 language families and 24 school subjects from Natural Sciences and Social Sciences, among others. Eχαμs offers a fine-grained evaluation framework across multiple languages and subjects, which allows precise analysis and comparison of various models

parallel question pairs: 9857
Such parallel examinations also exist in our dataset. In particular, there are 9,857 parallel question pairs spread across seven languages as shown in Table 2. The parallel pairs are coming from Croatia (Croatian, Serbian, Italian, Hungarian), Hungary (Hungarian, German, French, Spanish, Croatian, Serbian, Italian), and North Macedonia (Macedonian, Albanian, Turkish)

subjects: 24
We repeated the aforementioned steps until there were no suitable merge candidates. As a result, we ended up with a total of 24 subjects (see Appendix B for more details), which we further grouped into three major categories, based on the main branches of science: Natural Science – “the study of natural phenomena”, Social Sciences – “the study of human behavior and societies”, Other – Applied Studies, Arts, Religion, etc. (see Figure 1). 2. The distribution of the major categories is Natural Sciences (40.0%) and Social Sciences (44.0%) and 16.0% for Others (these are the actual numbers, not approximate)

subjects: 24
7 Conclusion and Future Work. We presented Eχαμs, a new challenging crosslingual and multilingual benchmark for science QA in 16 languages and 24 subjects from high school examinations. We further proposed new fine-grained evaluation that allows precise comparison across different languages and school subjects

Reference
  • Mikel Artetxe, Sebastian Ruder, and Dani Yogatama. 2020. On the cross-lingual transferability of monolingual representations. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL ’20, pages 4623–4637.
    Google ScholarLocate open access versionFindings
  • Pratyay Banerjee, Kuntal Kumar Pal, Arindam Mitra, and Chitta Baral. 2019. Careful selection of knowledge to solve open book question answering. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, ACL ’19, pages 6120–6129, Florence, Italy.
    Google ScholarLocate open access versionFindings
  • Sumithra Bhakthavatsalam, Chloe Anastasiades, and Peter Clark. 2020. GenericsKB: A knowledge base of generic statements. ArXiv, abs/2005.00660.
    Findings
  • Michael Boratko, Harshit Padigela, Divyendra Mikkilineni, Pritish Yuvraj, Rajarshi Das, Andrew McCallum, Maria Chang, Achille Fokoue, Pavan Kapanipathi, Nicholas Mattei, Ryan Musa, Kartik Talamadupula, and Michael Witbrock. 2018. An interface for annotating science questions. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, EMNLP ’18, pages 102–107, Brussels, Belgium.
    Google ScholarLocate open access versionFindings
  • Casimiro Pio Carrino, Marta R. Costa-jussa, and Jose A. R. Fonollosa. 2020. Automatic Spanish translation of SQuAD dataset for multi-lingual question answering. In Proceedings of the 12th Language Resources and Evaluation Conference, LREC ’20, pages 5515–5523, Marseille, France.
    Google ScholarLocate open access versionFindings
  • Jonathan H. Clark, Eunsol Choi, Michael Collins, Dan Garrette, Tom Kwiatkowski, Vitaly Nikolaev, and Jennimaria Palomaki. 2020. TyDi QA: A benchmark for information-seeking question answering in typologically diverse languages. Transactions of the Association for Computational Linguistics, 8:454– 470.
    Google ScholarLocate open access versionFindings
  • Peter Clark. 2015. Elementary school science and math tests as a driver for AI: Take the Aristo challenge! In Proceedings of the Twenty-Ninth Conference on Artificial Intelligence, AAAI ’15, pages 4019–4021, Austin, Texas, USA.
    Google ScholarLocate open access versionFindings
  • Peter Clark, Oren Etzioni, Daniel Khashabi, Tushar Khot, Bhavana Dalvi Mishra, Kyle Richardson, Ashish Sabharwal, Carissa Schoenick, Oyvind Tafjord, Niket Tandon, Sumithra Bhakthavatsalam, Dirk Groeneveld, Michal Guerquin, and Michael Schmitz. 2019. From ‘F’ to ‘A’ on the N.Y. regents science exams: An overview of the Aristo project. ArXiv, abs/1909.01958.
    Findings
  • Peter Clark, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Oyvind Tafjord, Peter Turney, and Daniel Khashabi. 2016. Combining retrieval, statistics, and inference to answer elementary science questions. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, AAAI ’16, page 2580–2586, Phoenix, Arizona, USA.
    Google ScholarLocate open access versionFindings
  • Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzman, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. 2020. Unsupervised cross-lingual representation learning at scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL ’20, pages 8440–8451.
    Google ScholarLocate open access versionFindings
  • Bhavana Dalvi, Niket Tandon, Antoine Bosselut, Wentau Yih, and Peter Clark. 2019. Everything happens for a reason: Discovering the purpose of actions in procedural text. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLPIJCNLP ’19, pages 4496–4505, Hong Kong, China.
    Google ScholarLocate open access versionFindings
  • Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT ’19, pages 4171–4186, Minneapolis, Minnesota, USA.
    Google ScholarLocate open access versionFindings
  • Martin d’Hoffschmidt, Maxime Vidal, Wacim Belblidia, and Tom Brendle. 2020. FQuAD: French question answering dataset. ArXiv, abs/2002.06071.
    Findings
  • Pavel Efimov, Andrey Chertok, Leonid Boytsov, and Pavel Braslavski. 2020. SberQuAD – Russian reading comprehension dataset: Description and analysis. In Proceedings of the 11th International Conference of the CLEF Association: Experimental IR Meets Multilinguality, Multimodality, and Interaction, CLEF ’20, pages 3–15, Thessaloniki, Greece.
    Google ScholarLocate open access versionFindings
  • Deepak Gupta, Surabhi Kumari, Asif Ekbal, and Pushpak Bhattacharyya. 2018. MMQA: A multi-domain multi-lingual question-answering framework for English and Hindi. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation, LREC ’18, pages 2777–2784, Miyazaki, Japan.
    Google ScholarLocate open access versionFindings
  • Suchin Gururangan, Ana Marasovic, Swabha Swayamdipta, Kyle Lo, Iz Beltagy, Doug Downey, and Noah A. Smith. 2020. Don’t stop pretraining: Adapt language models to domains and tasks. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL ’20, pages 8342–8360.
    Google ScholarLocate open access versionFindings
  • Momchil Hardalov, Ivan Koychev, and Preslav Nakov. 2019. Beyond English-only reading comprehension: Experiments in zero-shot multilingual transfer for Bulgarian. In Proceedings of the International Conference on Recent Advances in Natural Language Processing, RANLP ’19, pages 447–459, Varna, Bulgaria.
    Google ScholarLocate open access versionFindings
  • Junjie Hu, Sebastian Ruder, Aditya Siddhant, Graham Neubig, Orhan Firat, and Melvin Johnson. 2020. XTREME: A massively multilingual multitask benchmark for evaluating cross-lingual generalization. In Proceedings of Machine Learning Research, ICML ’20, Online.
    Google ScholarLocate open access versionFindings
  • Yimin Jing, Deyi Xiong, and Zhen Yan. 20BiPaR: A bilingual parallel dataset for multilingual and cross-lingual reading comprehension on novels. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP ’19, pages 2452–2462, Hong Kong, China.
    Google ScholarLocate open access versionFindings
  • Daniel Khashabi, Tushar Khot, Ashish Sabharwal, Peter Clark, Oren Etzioni, and Dan Roth. 2016. Question answering via integer programming over semi-structured knowledge. In Proceedings of the Twenty-fifth International Joint Conferences on Artificial Intelligence Organization, IJCAI ’16, pages 1145–1152, New York, New York.
    Google ScholarLocate open access versionFindings
  • Daniel Khashabi, Tushar Khot, Ashish Sabharwal, and Dan Roth. 2018. Question answering as global reasoning over semantic abstractions. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, AAAI ’18, pages 1905–1914, New Orleans, Louisiana, USA.
    Google ScholarLocate open access versionFindings
  • Daniel Khashabi, Tushar Khot, Ashish Sabharwal, Oyvind Tafjord, Peter Clark, and Hannaneh Hajishirzi. 2020. UnifiedQA: Crossing format boundaries with a single QA system. Finding of EMNLP.
    Google ScholarFindings
  • Tushar Khot, Peter Clark, Michal Guerquin, Paul Edward Jansen, and Ashish Sabharwal. 2020. QASC: A dataset for question answering via sentence composition. In Proceedings of the Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI ’20, pages 8082–8090, New York, New York, USA.
    Google ScholarLocate open access versionFindings
  • Tushar Khot, Ashish Sabharwal, and Peter Clark. 2017. Answering complex questions using open information extraction. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, ACL ’17, pages 311–316, Vancouver, Canada.
    Google ScholarLocate open access versionFindings
  • Tushar Khot, Ashish Sabharwal, and Peter Clark. 2018. SciTail: A textual entailment dataset from science question answering. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, AAAI ’18, pages 5189–5197, New Orleans, Louisiana, USA.
    Google ScholarLocate open access versionFindings
  • Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang, and Eduard Hovy. 2017. RACE: Large-scale ReAding comprehension dataset from examinations. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, EMNLP ’17, pages 785–794, Copenhagen, Denmark.
    Google ScholarLocate open access versionFindings
  • Guillaume Lample and Francois Charton. 2020. Deep learning for symbolic mathematics. In Proceedings of the 8th International Conference on Learning Representations, ICLR ’20.
    Google ScholarLocate open access versionFindings
  • Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. 2020. ALBERT: A lite BERT for self-supervised learning of language representations. In Proceedings of the 8th International Conference on Learning Representations, ICLR ’20.
    Google ScholarLocate open access versionFindings
  • Patrick Lewis, Barlas Oguz, Ruty Rinott, Sebastian Riedel, and Holger Schwenk. 2020. MLQA: Evaluating cross-lingual extractive question answering. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL ’20, pages 7315–7330.
    Google ScholarLocate open access versionFindings
  • Seungyoung Lim, Myungji Kim, and Jooyoul Lee. 2019. KorQuAD1.0: Korean QA dataset for machine reading comprehension. ArXiv, abs/1909.07005.
    Findings
  • Jiahua Liu, Yankai Lin, Zhiyuan Liu, and Maosong Sun. 2019a. XQA: A cross-lingual open-domain question answering dataset. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, ACL ’19, pages 2358–2368, Florence, Italy.
    Google ScholarLocate open access versionFindings
  • Pengyuan Liu, Yuning Deng, Chenghao Zhu, and Han Hu. 2019b. XCMRC: Evaluating cross-lingual machine reading comprehension. In Proceedings of the International Conference on Natural Language Processing and Chinese Computing, NLPCC ’19, pages 552–564, Dunhuang, China.
    Google ScholarLocate open access versionFindings
  • Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019c. RoBERTa: A robustly optimized BERT pretraining approach. ArXiv, abs/1907.11692.
    Findings
  • Shayne Longpre, Yi Lu, and Joachim Daiber. 2020. MKQA: A linguistically diverse benchmark for multilingual open domain question answering. ArXiv, abs/2007.15207.
    Findings
  • Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. 2018. Can a suit of armor conduct electricity? A new dataset for open book question answering. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, EMNLP ’18, pages 2381–2391, Brussels, Belgium.
    Google ScholarLocate open access versionFindings
  • Todor Mihaylov and Anette Frank. 2019. Discourseaware semantic self-attention for narrative reading comprehension. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP ’19, pages 2541–2552, Hong Kong, China.
    Google ScholarLocate open access versionFindings
  • Arindam Mitra, Peter Clark, Oyvind Tafjord, and Chitta Baral. 2019. Declarative question answering over knowledge bases containing natural language text with answer set programming. In Proceedings of the Thirty-Third AAAI Conference on Artificial Intelligence, AAAI ’19, pages 3003–3010, Honolulu, Hawaii, USA.
    Google ScholarLocate open access versionFindings
  • Jianmo Ni, Chenguang Zhu, Weizhu Chen, and Julian McAuley. 2019. Learning to attend on essential terms: An enhanced retriever-reader model for open-domain question answering. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT ’19, pages 335–344, Minneapolis, Minnesota, USA.
    Google ScholarLocate open access versionFindings
  • Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Improving language understanding by generative pre-training.
    Google ScholarFindings
  • Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language models are unsupervised multitask learners. OpenAI Blog.
    Google ScholarFindings
  • Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140):1–67.
    Google ScholarLocate open access versionFindings
  • Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. SQuAD: 100,000+ questions for machine comprehension of text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, EMNLP ’16, pages 2383– 2392, Austin, Texas, USA.
    Google ScholarLocate open access versionFindings
  • Anna Rogers, Olga Kovaleva, and Anna Rumshisky. 2020. A primer in BERTology: What we know about how BERT works. ArXiv, abs/2002.12327.
    Findings
  • David Saxton, Edward Grefenstette, Felix Hill, and Pushmeet Kohli. 2019. Analysing mathematical reasoning abilities of neural models. In Proceedings of the 7th International Conference on Learning Representations, ICLR ’19, New Orleans, Louisiana, USA.
    Google ScholarLocate open access versionFindings
  • Carissa Schoenick, Peter Clark, Oyvind Tafjord, Peter D. Turney, and Oren Etzioni. 2017. Moving beyond the Turing test with the Allen AI Science Challenge. Communications of the ACM, 60:60 – 64.
    Google ScholarLocate open access versionFindings
  • Xiaoman Pan, Kai Sun, Dian Yu, Jianshu Chen, Heng Ji, Claire Cardie, and Dong Yu. 2019. Improving question answering with external knowledge. In Proceedings of the 2nd Workshop on Machine Reading for Question Answering, MRQA ’19, pages 27– 37, Hong Kong, China.
    Google ScholarLocate open access versionFindings
  • Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT ’18, pages 2227–2237, New Orleans, Louisiana, USA.
    Google ScholarLocate open access versionFindings
  • Kai Sun, Dian Yu, Dong Yu, and Claire Cardie. 2019. Improving machine reading comprehension with general reading strategies. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT ’19, pages 2633–2643, Minneapolis, Minnesota, USA.
    Google ScholarLocate open access versionFindings
  • Oyvind Tafjord, Peter Clark, Matt Gardner, Wen tau Yih, and Ashish Sabharwal. 2019. QuaRel: A dataset and models for answering questions about qualitative relationships. In Proceedings of the Thirty-Third AAAI Conference on Artificial Intelligence, AAAI ’19, pages 7064–7071, Honolulu, Hawaii, USA.
    Google ScholarLocate open access versionFindings
  • Fabio Petroni, Tim Rocktaschel, Sebastian Riedel, Patrick Lewis, Anton Bakhtin, Yuxiang Wu, and Alexander Miller. 2019. Language models as knowledge bases? In Proceedings of the 2019 Conference on Empirical Methods in Natural Language
    Google ScholarLocate open access versionFindings
  • Kiet Van Nguyena, Khiem Vinh Trana, Son T Luua, and Anh Gia-Tuan. 2020. Enhancing lexical-based approach with external knowledge for Vietnamese multiple-choice reading comprehension. ArXiv, abs/2001.05687.
    Findings
  • Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Proceedings of the Annual Conference on Neural Information Processing Systems, NIPS ’17, pages 5998–6008, Long Beach, California, USA.
    Google ScholarLocate open access versionFindings
  • Johannes Welbl, Nelson F. Liu, and Matt Gardner. 2017. Crowdsourcing multiple choice science questions. In Proceedings of the 3rd Workshop on Noisy User-generated Text, W-NUT ’17, pages 94–106, Copenhagen, Denmark.
    Google ScholarLocate open access versionFindings
  • Johannes Welbl, Pontus Stenetorp, and Sebastian Riedel. 2018. Constructing datasets for multi-hop reading comprehension across documents. Transactions of the Association for Computational Linguistics, 6:287–302.
    Google ScholarLocate open access versionFindings
  • Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Remi Louf, Morgan Funtowicz, and Jamie Brew. 2019. HuggingFace’s Transformers: State-of-the-art natural language processing. ArXiv, abs/1910.03771.
    Findings
  • Dongfang Xu, Peter Jansen, Jaycie Martin, Zhengnan Xie, Vikas Yadav, Harish Tayyar Madabushi, Oyvind Tafjord, and Peter Clark. 2020. Multiclass hierarchical question classification for multiple choice science exams. In Proceedings of the 12th Language Resources and Evaluation Conference, LREC ’20, pages 5370–5382, Marseille, France.
    Google ScholarLocate open access versionFindings
  • Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Russ R Salakhutdinov, and Quoc V Le. 2019. XLNet: Generalized autoregressive pretraining for language understanding. In Advances in Neural Information Processing Systems 32, NIPS ’19, pages 5753–5763.
    Google ScholarLocate open access versionFindings
  • Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. 2018. HotpotQA: A dataset for diverse, explainable multi-hop question answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, EMNLP ’18, pages 2369–2380, Brussels, Belgium.
    Google ScholarLocate open access versionFindings
  • In this work, we are interested in the crosslingual transferability of multilingual models such as mBERT (Devlin et al., 2019) and XLMRoBERTa (Conneau et al., 2020), each of which comes pre-trained on more than 100 languages. We evaluated the QA capabilities of these models, following the established protocol (Devlin et al., 2019; Liu et al., 2019c; Sun et al., 2019), namely we fine-tuned them to predict the correct answer in a multi-choice setting, given a selected context. The aforementioned setup feeds the pre-trained model with a text, processed using the model’s tokenizer in the following format:
    Google ScholarLocate open access versionFindings
Author
Momchil Hardalov
Momchil Hardalov
Dimitrina Zlatkova
Dimitrina Zlatkova
Yoan Dinkov
Yoan Dinkov
Your rating :
0

 

Tags
Comments
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn
小科