Multilingual Transfer Learning for QA Using Translation as Data Augmentation

Mihaela Bornea
Mihaela Bornea
Lin Pan
Lin Pan
Sara Rosenthal
Sara Rosenthal
被引用0|浏览32
微博一下
Our Produce the same answer+QS is weaker than just doing PSA on ‘en-zh’ suggesting again that choosing only one extra language in the Language Arbitration Framework setting improves over the ZS baseline but is not as beneficial as adding all languages together

摘要

Prior work on multilingual question answering has mostly focused on using large multilingual pre-trained language models (LM) to perform zero-shot language-wise learning: train a QA model on English and test on other languages. In this work, we explore strategies that improve cross-lingual transfer by bringing the multilingual embedding...更多

代码

数据

0
ZH
下载 PDF 全文
引用
微博一下
简介
  • Recent advances in open domain question answering (QA) have mostly revolved around machine reading comprehension (MRC) where the task is to read and comprehend a given text and answer questions based on it.
  • Most recent work in MRC has only been in English e.g. SQuAD (Rajpurkar et al 2016; Rajpurkar, Jia, and Liang 2018), HotpotQA (Yang et al 2018) and Natural Questions (Kwiatkowski et al 2019).
  • The authors focus on multilingual QA and, in particular, on two recent large-scale datasets: MLQA (Lewis et al.
  • Question: What was Hanna’s prison sentence? Prior work predictions: four years and three months each This work predictions: life
重点内容
  • Recent advances in open domain question answering (QA) have mostly revolved around machine reading comprehension (MRC) where the task is to read and comprehend a given text and answer questions based on it
  • When we extend the scope of the model to look at all languages together, we get the best performing MLQA system so far with {61.2 (G-XLT), 65.2 (XLT)}
  • Our Produce the same answer (PSA)+QS is weaker than just doing PSA on ‘en-zh’ suggesting again that choosing only one extra language in the Language Arbitration Framework (LAF) setting improves over the ZS baseline but is not as beneficial as adding all languages together
  • adversarial training (AT) is better compared to T(Q) and the best results are obtained with cross-lingual LAF with an average increase of 5.3 F1 points compared to ZS
  • We produce several novel strategies for multilingual QA that go beyond zero-shot training and outshine the previous baseline built on top of Multilingual BERT (MBERT)
  • Our AT and LAF strategies utilize translation as data augmentation to bring the language-specific embeddings of the language models (LM) closer to each other
方法
  • Ar de en es hi vi zh G-XLT XLT ZS. 46.9 51.4 60.2 55.0 47.0 52.0 49.7 51.7 (±0.4) 61.7 (±0.3). 60.9 (±0.2) 52.2 (±1.0) 57.7 (±0.1) 61.1 (± 0.1). 64.9 (±0.2) 58.5 (±0.9) 64.3 (±0.0) 64.2 (±0.2). 50.5 56.7 68.0 60.8 50.7 57.4 51.5 56.5 (± 0.1) 62.8 (± 0.1) 54.1 61.1 73.6 65.5 54.2 63.4 56.8 61.2 (± 0.1) 65.2 (± 0.1)
结果
  • The authors first experiment with the same models that the authors trained for MLQA by translating SQuAD to the MLQA languages
  • In this setting, the authors evaluate cross-lingual transfer beyond translation, since en and ar are the only languages the two datasets have in common.
  • LAF has the best cross-lingual transfer performance, improving Indonesian, Swahili, Russian as well as ar compared to the ZS baseline.
  • The authors tested the models trained by translating SQuAD to the TyDiQA languages
  • In this case, the authors notice consistent trends with the MLQA results.
  • The authors' improvements over ZS and T(Q) are statistically significant and the authors used the Fisher randomization test
结论
  • The authors highlight open challenges in the existing multilingual approach by (Lewis et al 2020) and (Clark et al 2020).
  • The authors produce several novel strategies for multilingual QA that go beyond zero-shot training and outshine the previous baseline built on top of MBERT.
  • The authors' AT and LAF strategies utilize translation as data augmentation to bring the language-specific embeddings of the LM closer to each other.
  • These approaches help them significantly improve the cross-lingual transfer.
  • The authors' models demonstrate strong results and all approaches improve over the previous
总结
  • Introduction:

    Recent advances in open domain question answering (QA) have mostly revolved around machine reading comprehension (MRC) where the task is to read and comprehend a given text and answer questions based on it.
  • Most recent work in MRC has only been in English e.g. SQuAD (Rajpurkar et al 2016; Rajpurkar, Jia, and Liang 2018), HotpotQA (Yang et al 2018) and Natural Questions (Kwiatkowski et al 2019).
  • The authors focus on multilingual QA and, in particular, on two recent large-scale datasets: MLQA (Lewis et al.
  • Question: What was Hanna’s prison sentence? Prior work predictions: four years and three months each This work predictions: life
  • Methods:

    Ar de en es hi vi zh G-XLT XLT ZS. 46.9 51.4 60.2 55.0 47.0 52.0 49.7 51.7 (±0.4) 61.7 (±0.3). 60.9 (±0.2) 52.2 (±1.0) 57.7 (±0.1) 61.1 (± 0.1). 64.9 (±0.2) 58.5 (±0.9) 64.3 (±0.0) 64.2 (±0.2). 50.5 56.7 68.0 60.8 50.7 57.4 51.5 56.5 (± 0.1) 62.8 (± 0.1) 54.1 61.1 73.6 65.5 54.2 63.4 56.8 61.2 (± 0.1) 65.2 (± 0.1)
  • Results:

    The authors first experiment with the same models that the authors trained for MLQA by translating SQuAD to the MLQA languages
  • In this setting, the authors evaluate cross-lingual transfer beyond translation, since en and ar are the only languages the two datasets have in common.
  • LAF has the best cross-lingual transfer performance, improving Indonesian, Swahili, Russian as well as ar compared to the ZS baseline.
  • The authors tested the models trained by translating SQuAD to the TyDiQA languages
  • In this case, the authors notice consistent trends with the MLQA results.
  • The authors' improvements over ZS and T(Q) are statistically significant and the authors used the Fisher randomization test
  • Conclusion:

    The authors highlight open challenges in the existing multilingual approach by (Lewis et al 2020) and (Clark et al 2020).
  • The authors produce several novel strategies for multilingual QA that go beyond zero-shot training and outshine the previous baseline built on top of MBERT.
  • The authors' AT and LAF strategies utilize translation as data augmentation to bring the language-specific embeddings of the LM closer to each other.
  • These approaches help them significantly improve the cross-lingual transfer.
  • The authors' models demonstrate strong results and all approaches improve over the previous
表格
  • Table1: Comparing our original traning data SQuAD v1.1 with our augmented training data using translation techniques. The Question Type is based on the first word in the question
  • Table2: Our results on MLQA test averaged over 3 runs. We compare our models against the previous baseline (<a class="ref-link" id="cLewis_et+al_2020_a" href="#rLewis_et+al_2020_a">Lewis et al 2020</a>): ZS setting with MBERTQA. Best numbers within the method are in bold. The best LAF and AT models are statistically significantly better than the best Trans model
  • Table3: G-XLT F1 scores of the LAF:PSA+QS (en-all) model on the overall test set for individual cross languages performance. XLT F1 is 65.7 averaged across the diagonal, as shown with the G-XLT results in the last row of Table 2
  • Table4: XLT F1 scores of ZS and LAF with MBERT
  • Table5: Our results on TydiQA dev. We compare our models against the previous baseline (<a class="ref-link" id="cClark_et+al_2020_a" href="#rClark_et+al_2020_a">Clark et al 2020</a>): ZS setting with MBERTQA. T(Q)*, AT*, LAF* are the MLQA models. The LAF* and AT* are statistically significantly better than ZS. The LAF and AT models are statistically significantly better than the T(Q) model and ZS
Download tables as Excel
相关工作
基金
  • The percentage of reduced question type ranges from 14% (Which) to 20% (Why)
  • For MLQA, we report separate F1 scores on both the G-XLT and XLT tasks
引用论文
  • Alberti, C.; Andor, D.; Pitler, E.; Devlin, J.; and Collins, M. 2019. Synthetic QA Corpora Generation with Roundtrip Consistency. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 6168–6173.
    Google ScholarLocate open access versionFindings
  • Arivazhagan, N.; Bapna, A.; Firat, O.; Lepikhin, D.; Johnson, M.; Krikun, M.; Chen, M. X.; Cao, Y.; Foster, G.; Cherry, C.; et al. 2019. Massively multilingual neural machine translation in the wild: Findings and challenges. arXiv preprint arXiv:1907.05019.
    Findings
  • Artetxe, M.; Ruder, S.; and Yogatama, D. 2019. On the crosslingual transferability of monolingual representations. arXiv preprint arXiv:1910.11856.
    Findings
  • Asai, A.; Eriguchi, A.; Hashimoto, K.; and Tsuruoka, Y. 2018. Multilingual extractive reading comprehension by runtime machine translation. arXiv preprint arXiv:1809.03275.
    Findings
  • Bonadiman, D.; Uva, A.; and Moschitti, A. 2017. Effective shared representations with Multitask Learning for Community Question Answering. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, 726–732.
    Google ScholarLocate open access versionFindings
  • Chen, D.; Fisch, A.; Weston, J.; and Bordes, A. 2017. Reading Wikipedia to Answer Open-Domain Questions. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 1870– 1879.
    Google ScholarLocate open access versionFindings
  • Chen, X.; Sun, Y.; Athiwaratkun, B.; Cardie, C.; and Weinberger, K. 2018. Adversarial deep averaging networks for cross-lingual sentiment classification. Transactions of the Association for Computational Linguistics 6: 557–570.
    Google ScholarLocate open access versionFindings
  • Clark, J.; Choi, E.; Collins, M.; Garrette, D.; Kwiatkowski, T.; Nikolaev, V.; and Palomaki, J. 2020. TyDi QA: A Benchmark for Information-Seeking Question Answering in Typologically Diverse Languages. Transactions of the Association for Computational Linguistics 8: 454–470.
    Google ScholarLocate open access versionFindings
  • Conneau, A.; Khandelwal, K.; Goyal, N.; Chaudhary, V.; Wenzek, G.; Guzmán, F.; Grave, E.; Ott, M.; Zettlemoyer, L.; and Stoyanov, V. 201Unsupervised cross-lingual representation learning at scale. arXiv preprint arXiv:1911.02116.
    Findings
  • Conneau, A.; Rinott, R.; Lample, G.; Williams, A.; Bowman, S.; Schwenk, H.; and Stoyanov, V. 2018. XNLI: Evaluating
    Google ScholarFindings
  • Cross-lingual Sentence Representations. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, 2475–2485.
    Google ScholarLocate open access versionFindings
  • Croce, D.; Zelenanska, A.; and Basili, R. 2019. Enabling deep learning for large scale question answering in Italian. Intelligenza Artificiale 13(1): 49–61.
    Google ScholarLocate open access versionFindings
  • Cui, Y.; Che, W.; Liu, T.; Qin, B.; Wang, S.; and Hu, G. 2019a. Cross-Lingual Machine Reading Comprehension. EMNLP.
    Google ScholarLocate open access versionFindings
  • Cui, Y.; Liu, T.; Che, W.; Xiao, L.; Chen, Z.; Ma, W.; Wang, S.; and Hu, G. 2019b. A Span-Extraction Dataset for Chinese Machine Reading Comprehension. In Inui, K.; Jiang, J.; Ng, V.; and Wan, X., eds., Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3-7, 2019, 5882–5888.
    Google ScholarLocate open access versionFindings
  • Devlin, J.; Chang, M.-W.; Lee, K.; and Toutanova, K. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In NAACL-HLT.
    Google ScholarFindings
  • Gao, H.; Mao, J.; Zhou, J.; Huang, Z.; Wang, L.; and Xu, W. 20Multilingual image question answering. US Patent App. 15/137,179.
    Google ScholarFindings
  • Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; and Bengio, Y. 2014. Generative Adversarial Nets. In Ghahramani, Z.; Welling, M.; Cortes, C.; Lawrence, N. D.; and Weinberger, K. Q., eds., Advances in Neural Information Processing Systems 27, 2672–2680.
    Google ScholarLocate open access versionFindings
  • Gupta, D.; Kumari, S.; Ekbal, A.; and Bhattacharyya, P. 20MMQA: A multi-domain multi-lingual questionanswering framework for English and Hindi. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018).
    Google ScholarLocate open access versionFindings
  • He, W.; Liu, K.; Liu, J.; Lyu, Y.; Zhao, S.; Xiao, X.; Liu, Y.; Wang, Y.; Wu, H.; She, Q.; Liu, X.; Wu, T.; and Wang, H. 2018. DuReader: a Chinese Machine Reading Comprehension Dataset from Real-world Applications. In Proceedings of the Workshop on Machine Reading for Question Answering, 37–46.
    Google ScholarLocate open access versionFindings
  • IBM. 20IBM Watson Language Translator. URL https://www.ibm.com/watson/services/language-translator/.
    Findings
  • Joshi, M.; Choi, E.; Weld, D.; and Zettlemoyer, L. 2017. TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 1601–1611.
    Google ScholarLocate open access versionFindings
  • Keung, P.; Lu, Y.; and Bhardwaj, V. 2019. Adversarial Learning with Contextual Embeddings for Zero-resource Crosslingual Classification and NER. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 1355–1360.
    Google ScholarLocate open access versionFindings
  • Kwiatkowski, T.; Palomaki, J.; Redfield, O.; Collins, M.; Parikh, A.; Alberti, C.; Epstein, D.; Polosukhin, I.; Kelcey, M.; Devlin, J.; Lee, K.; Toutanova, K. N.; Jones, L.; Chang, M.-W.; Dai, A.; Uszkoreit, J.; Le, Q.; and Petrov, S. 2019. Natural Questions: a Benchmark for Question Answering Research. TACL.
    Google ScholarLocate open access versionFindings
  • Lee, K.; Yoon, K.; Park, S.; and Hwang, S.-w. 2018. Semisupervised training data generation for multilingual question answering. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018).
    Google ScholarLocate open access versionFindings
  • Lewis, P.; Oguz, B.; Rinott, R.; Riedel, S.; and Schwenk, H. 2020. MLQA: Evaluating cross-lingual extractive question answering. ACL.
    Google ScholarLocate open access versionFindings
  • Li, J.; Tu, Z.; Yang, B.; Lyu, M. R.; and Zhang, T. 2018. Multi-Head Attention with Disagreement Regularization. In EMNLP, 2897–2903.
    Google ScholarFindings
  • McCann, B.; Keskar, N. S.; Xiong, C.; and Socher, R. 2018. The natural language decathlon: Multitask learning as question answering. arXiv preprint arXiv:1806.08730.
    Findings
  • Miyato, T.; Dai, A. M.; and Goodfellow, I. J. 2017. Adversarial Training Methods for Semi-Supervised Text Classification. In 5th International Conference on Learning Representations, ICLR.
    Google ScholarLocate open access versionFindings
  • Mozannar, H.; Maamary, E.; El Hajal, K.; and Hajj, H. 2019. Neural Arabic Question Answering. In Proceedings of the Fourth Arabic Natural Language Processing Workshop, 108–118.
    Google ScholarLocate open access versionFindings
  • Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; and Sutskever, I. 2019. Language models are unsupervised multitask learners. OpenAI Blog 1(8): 9.
    Google ScholarLocate open access versionFindings
  • Rajpurkar, P.; Jia, R.; and Liang, P. 2018. Know What You Don’t Know: Unanswerable Questions for SQuAD. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), 784– 789.
    Google ScholarLocate open access versionFindings
  • Rajpurkar, P.; Zhang, J.; Lopyrev, K.; and Liang, P. 2016. SQuAD: 100,000+ Questions for Machine Comprehension of Text. EMNLP.
    Google ScholarFindings
  • Shao, C. C.; Liu, T.; Lai, Y.; Tseng, Y.; and Tsai, S. 2018. Drcd: a chinese machine reading comprehension dataset. arXiv preprint arXiv:1806.00920.
    Findings
  • Sønderby, C. K.; Caballero, J.; Theis, L.; Shi, W.; and Huszár, F. 2017. Amortised MAP Inference for Image Superresolution. In 5th International Conference on Learning Representations, ICLR.
    Google ScholarLocate open access versionFindings
  • Trischler, A.; Wang, T.; Yuan, X.; Harris, J.; Sordoni, A.; Bachman, P.; and Suleman, K. 2017. NewsQA: A Machine Comprehension Dataset. In Proceedings of the 2nd Workshop on Representation Learning for NLP, 191–200.
    Google ScholarLocate open access versionFindings
  • Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A. N.; Kaiser, L. u.; and Polosukhin, I. 2017. Attention is All you Need. In Advances in Neural Information Processing Systems, 5998–6008.
    Google ScholarLocate open access versionFindings
  • Wallace, E.; Feng, S.; Kandpal, N.; Gardner, M.; and Singh, S. 2019. Universal Adversarial Triggers for Attacking and
    Google ScholarFindings
  • Analyzing NLP. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 2153–2162.
    Google ScholarLocate open access versionFindings
  • Wang, Y.; and Bansal, M. 2018. Robust Machine Comprehension Models via Adversarial Training. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), 575–581.
    Google ScholarLocate open access versionFindings
  • Wu, S.; and Dredze, M. 2019.
    Google ScholarFindings
  • Beto, Bentz, Becas: The Surprising Cross-Lingual Effectiveness of BERT. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLPIJCNLP), 833–844.
    Google ScholarLocate open access versionFindings
  • Yang, Z.; Cui, Y.; Che, W.; Liu, T.; Wang, S.; and Hu, G. 2019a. Improving Machine Reading Comprehension via Adversarial Training. arXiv preprint arXiv:1911.03614.
    Findings
  • Yang, Z.; Dai, Z.; Yang, Y.; Carbonell, J.; Salakhutdinov, R. R.; and Le, Q. V. 2019b. XLNet: Generalized Autoregressive Pretraining for Language Understanding. In Wallach, H.; Larochelle, H.; Beygelzimer, A.; dÁlché Buc, F.; Fox, E.; and Garnett, R., eds., Advances in Neural Information Processing Systems 32, 5753–5763.
    Google ScholarLocate open access versionFindings
  • Yang, Z.; Qi, P.; Zhang, S.; Bengio, Y.; Cohen, W.; Salakhutdinov, R.; and Manning, C. D. 2018. HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, 2369–2380.
    Google ScholarLocate open access versionFindings
  • Yarowsky, D.; Ngai, G.; and Wicentowski, R. 2001. Inducing multilingual text analysis tools via robust projection across aligned corpora. In Proceedings of the first international conference on Human language technology research, 1–8.
    Google ScholarLocate open access versionFindings
  • Yu, A. W.; Dohan, D.; Luong, M.; Zhao, R.; Chen, K.; Norouzi, M.; and Le, Q. V. 2018. QANet: Combining Local Convolution with Global Self-Attention for Reading Comprehension. In 6th International Conference on Learning Representations, ICLR.
    Google ScholarLocate open access versionFindings
  • Yuan, F.; Shou, L.; Bai, X.; Gong, M.; Liang, Y.; Duan, N.; Fu, Y.; and Jiang, D. 2020. Enhancing Answer Boundary Detection for Multilingual Machine Reading Comprehension. ACL.
    Google ScholarLocate open access versionFindings
  • Zhu, C.; Cheng, Y.; Gan, Z.; Sun, S.; Goldstein, T.; and Liu, J. 2020. FreeLB: Enhanced Adversarial Training for Natural Language Understanding. In 8th International Conference on Learning Representations, ICLR.
    Google ScholarLocate open access versionFindings
您的评分 :
0

 

标签
评论