Cross Thought for Sentence Encoder Pre training

EMNLP 2020, pp. 412-421, 2020.

Cited by: 0|Bibtex|Views41|DOI:https://doi.org/10.18653/V1/2020.EMNLP-MAIN.30
Other Links: arxiv.org|academic.microsoft.com
Weibo:
We propose Cross-Thought, a novel approach to pre-training sequence encoder, which is instrumental in building reusable sequence embeddings for large-scale Natural Language Processing tasks such as question answering

Abstract:

In this paper, we propose Cross-Thought, a novel approach to pre-training sequence encoder, which is instrumental in building reusable sequence embeddings for large-scale NLP tasks such as question answering. Instead of using the original signals of full sentences, we train a Transformer-based sequence encoder over a large set of short se...More

Code:

Data:

0
Introduction
  • Encoding sentences into embeddings (Kiros et al, 2015; Subramanian et al, 2018; Reimers and Gurevych, 2019) is a critical step in many Natural Language Processing (NLP) tasks.
  • Skipthought (Kiros et al, 2015) uses encoded sentence embeddings to generate the sentence (Figure 2(a)).
  • Inverse Cloze Task (Lee et al, 2019) defines some pseudo labels to pre-train a sentence encoder (Figure 2(b)).
  • Pseudo labels may bear low accuracy, and rich linguistic information that can be well learned in generic language modeling is often lost in these unsupervised methods.
  • The authors propose a novel unsupervised approach that fully exploits the strength of language modeling for sentence encoder pre-training
Highlights
  • Encoding sentences into embeddings (Kiros et al, 2015; Subramanian et al, 2018; Reimers and Gurevych, 2019) is a critical step in many Natural Language Processing (NLP) tasks
  • We propose a novel unsupervised approach that fully exploits the strength of language modeling for sentence encoder pre-training
  • Results show that our Cross-Thought model achieves much better performance than Masked language modeling (LM)-1-64, as well as the Transformer pre-trained on 160G data (10 times larger than Wikipedia)
  • We evaluate our proposed method on how well the finetuned sentence embeddings can be utilized in the first step for information retrieval (IR), with the re-ranker and answer extractor fixed. “Masked LM-1-64” and “CrossThought-1-64” in Table 3 show that our pre-trained model achieves better performance than the baseline model pre-trained on single sequences
  • We propose a novel approach, Cross-Thought, to pre-train sentence encoder
  • Our proposed approach achieves new state of the art on HotpotQA by improving intermediate information retrieval performance
  • Experiments demonstrate that using Cross-Thought trained with short sequences can effectively improve sentence embedding
Methods
  • The authors conduct experiments based on the pre-trained models, and provide additional detailed analysis. 4.1 Datasets

    The authors conduct experiments on five datasets, the statistics of which is shown in Table 1.

    MNLI (Williams et al, 2018)2: Multi-Genre Natural Language Inference matched (MNLI-m) and mismatched (MNLI-mm) are textual entailment tasks.
  • The authors conduct experiments based on the pre-trained models, and provide additional detailed analysis.
  • 4.1 Datasets.
  • The authors conduct experiments on five datasets, the statistics of which is shown in Table 1.
  • MNLI (Williams et al, 2018)2: Multi-Genre Natural Language Inference matched (MNLI-m) and mismatched (MNLI-mm) are textual entailment tasks.
  • The goal is to classify the relation between premise and hypothesis sentences into three.
  • Dataset #train #test #seq Goal.
  • MNLI-m 373K 10K 2 classification.
  • MNLI-mm 373K 10K 2 classification SNLI
Results
  • Results on the classification and ranking tasks are summarized in Table 2.
  • Effect of Pre-training Tasks Among all the pretraining tasks, the proposed method Cross-Thought achieves the best performance.
  • LM pre-training tasks work better than the SkipThought and ICT methods which are designed for learning sentence embedding.
  • Results show that the Cross-Thought model achieves much better performance than Masked LM-1-64, as well as the Transformer pre-trained on 160G data (10 times larger than Wikipedia)
Conclusion
  • The authors propose a novel approach, Cross-Thought, to pre-train sentence encoder.
  • Experiments demonstrate that using Cross-Thought trained with short sequences can effectively improve sentence embedding.
  • The authors' pre-trained sentence encoder with further finetuning can beat several strong baselines on many NLP tasks
Summary
  • Introduction:

    Encoding sentences into embeddings (Kiros et al, 2015; Subramanian et al, 2018; Reimers and Gurevych, 2019) is a critical step in many Natural Language Processing (NLP) tasks.
  • Skipthought (Kiros et al, 2015) uses encoded sentence embeddings to generate the sentence (Figure 2(a)).
  • Inverse Cloze Task (Lee et al, 2019) defines some pseudo labels to pre-train a sentence encoder (Figure 2(b)).
  • Pseudo labels may bear low accuracy, and rich linguistic information that can be well learned in generic language modeling is often lost in these unsupervised methods.
  • The authors propose a novel unsupervised approach that fully exploits the strength of language modeling for sentence encoder pre-training
  • Objectives:

    The authors' pre-training task is inspired by Masked Language Modeling (Devlin et al, 2019; Liu et al, 2019), and the key difference is the way to construct sequences for pre-training.
  • As the goal is sentence embedding learning, the pre-training task is
  • Methods:

    The authors conduct experiments based on the pre-trained models, and provide additional detailed analysis. 4.1 Datasets

    The authors conduct experiments on five datasets, the statistics of which is shown in Table 1.

    MNLI (Williams et al, 2018)2: Multi-Genre Natural Language Inference matched (MNLI-m) and mismatched (MNLI-mm) are textual entailment tasks.
  • The authors conduct experiments based on the pre-trained models, and provide additional detailed analysis.
  • 4.1 Datasets.
  • The authors conduct experiments on five datasets, the statistics of which is shown in Table 1.
  • MNLI (Williams et al, 2018)2: Multi-Genre Natural Language Inference matched (MNLI-m) and mismatched (MNLI-mm) are textual entailment tasks.
  • The goal is to classify the relation between premise and hypothesis sentences into three.
  • Dataset #train #test #seq Goal.
  • MNLI-m 373K 10K 2 classification.
  • MNLI-mm 373K 10K 2 classification SNLI
  • Results:

    Results on the classification and ranking tasks are summarized in Table 2.
  • Effect of Pre-training Tasks Among all the pretraining tasks, the proposed method Cross-Thought achieves the best performance.
  • LM pre-training tasks work better than the SkipThought and ICT methods which are designed for learning sentence embedding.
  • Results show that the Cross-Thought model achieves much better performance than Masked LM-1-64, as well as the Transformer pre-trained on 160G data (10 times larger than Wikipedia)
  • Conclusion:

    The authors propose a novel approach, Cross-Thought, to pre-train sentence encoder.
  • Experiments demonstrate that using Cross-Thought trained with short sequences can effectively improve sentence embedding.
  • The authors' pre-trained sentence encoder with further finetuning can beat several strong baselines on many NLP tasks
Tables
  • Table1: Statistics of the datasets. #train and #test are the number of samples for training and testing. #seq is the number of sequences needed to use for each sample. 5M is for 5 million
  • Table2: Results on only using sentence embedding for classification and ranking. Cross-Thought-3-64 is to train Cross-Thought by pre-appending 3 special tokens to the sequences that are segmented into 64 tokens. For HotpotQA, we only evaluate on how well the model can retrieve gold paragraphs. Results for HotpotQA(u) are without finetuning. Acc: Accuracy. Recall@20: recall for the top 20 ranked paragraphs
  • Table3: Results on HoptpotQA (full-wiki setting). We use sentence embeddings from the finetuned model of Cross-Thought or Masked LM as information retriever (IR) to collect candidate paragraphs. Pas EM: exact match of gold paragraphs; Ans EM/F1: exact match/F1 on short answer; Sup EM/F1: exact match/F1 on supporting facts
  • Table4: Case study on unsupervised passage ranking. The attention weights are learned by cross-sequence Transformer from pre-training. The examples on the left come from HotpotQA and are the ranked passages from 200 candidates for answering the question. The examples on the right are in the format of Masked Language Modeling task, where our Cross-Thought needs to recover the masked words by leveraging other sequences. C: the ranked passages by attention weights
Download tables as Excel
Related work
  • Sequence Encoder Many studies have explored different ways to improve sequence embeddings. Huang et al (2013) proposes deep structured semantic encoders for web search. Tan et al (2015) uses LSTM as the encoder for non-factoid answer selection, and Tai et al (2015) proposes treeLSTM to compute semantic relatedness between sentences. Mou et al (2016) also uses tree-based CNN as the encoder for textual entailment tasks. Cheng et al (2016) proposes Long Short-Term Memory-Networks (LSTMN) for inferring the relation between sentences, and Lin et al (2017) combines LSTM and self-attention mechanism to improve sentence embeddings. Multi-task learning (Subramanian et al, 2018; Cer et al, 2018) has also been applied for training better sentence embeddings. Recently, in additional to supervised learning, models pre-trained with unsupervised methods begin to dominate the field.
Funding
  • Our proposed approach also achieves new state of the art on HotpotQA (full-wiki setting) by improving intermediate information retrieval performance
  • The attention weights of the pre-trained cross-sequence Transformers can also be directly used for ranking tasks. (iii) Our model achieves the best performance on multiple sequence-pair classification and answer-selection tasks, compared to state-of-the-art baselines
  • We can see that a larger sentence embedding size can significantly improve performance on the ranking tasks while not on the classification tasks
  • The pipeline integrating our sentence embedding achieves new state of the art on HotpotQA (full-wiki)
Study subjects and analysis
datasets: 5
Our proposed approach also achieves new state of the art on HotpotQA (full-wiki setting) by improving intermediate information retrieval performance. In this section, we conduct experiments based on our pre-trained models, and provide additional detailed analysis. 4.1 Datasets

We conduct experiments on five datasets, the statistics of which is shown in Table 1.

MNLI (Williams et al, 2018)2: Multi-Genre Natural Language Inference matched (MNLI-m) and mismatched (MNLI-mm) are textual entailment tasks
. The goal is to classify the relation between premise and hypothesis sentences into three

2https://gluebenchmark.com/tasks

Dataset #train #test #seq Goal

MNLI-m 373K 10K 2 classification

MNLI-mm 373K 10K 2 classification SNLI

549K 10K 2 classification

datasets: 5
4.1 Datasets. We conduct experiments on five datasets, the statistics of which is shown in Table 1. MNLI (Williams et al, 2018)2: Multi-Genre Natural Language Inference matched (MNLI-m) and mismatched (MNLI-mm) are textual entailment tasks

Reference
  • Akari Asai, Kazuma Hashimoto, Hannaneh Hajishirzi, Richard Socher, and Caiming Xiong. 2020. Learning to retrieve reasoning paths over wikipedia graph for question answering. In International Conference on Learning Representations (ICLR).
    Google ScholarLocate open access versionFindings
  • Samuel R Bowman, Gabor Angeli, Christopher Potts, and Christopher D Manning. 2015. A large annotated corpus for learning natural language inference. In Empirical Methods in Natural Language Processing (EMNLP).
    Google ScholarLocate open access versionFindings
  • Daniel Cer, Yinfei Yang, Sheng-yi Kong, Nan Hua, Nicole Limtiaco, Rhomni St John, Noah Constant, Mario Guajardo-Cespedes, Steve Yuan, Chris Tar, et al. 2018. Universal sentence encoder. arXiv preprint arXiv:1803.11175.
    Findings
  • Jianpeng Cheng, Li Dong, and Mirella Lapata. 2016. Long short-term memory-networks for machine reading. In Empirical Methods in Natural Language Processing (EMNLP).
    Google ScholarLocate open access versionFindings
  • Kyunghyun Cho, Bart Van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using rnn encoder-decoder for statistical machine translation. In International Conference on Learning Representations (ICLR).
    Google ScholarLocate open access versionFindings
  • Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. In North American Chapter of the Association for Computational Linguistics (NAACL).
    Google ScholarLocate open access versionFindings
  • Bhuwan Dhingra, Kathryn Mazaitis, and William W Cohen. 201Quasar: Datasets for question answering by search and reading. arXiv preprint arXiv:1707.03904.
    Findings
  • Ming Ding, Chang Zhou, Qibin Chen, Hongxia Yang, and Jie Tang. 2019. Cognitive graph for multi-hop reading comprehension at scale. In Association for Computational Linguistics (ACL).
    Google ScholarLocate open access versionFindings
  • Li Dong, Nan Yang, Wenhui Wang, Furu Wei, Xiaodong Liu, Yu Wang, Jianfeng Gao, Ming Zhou, and Hsiao-Wuen Hon. 201Unified language model pre-training for natural language understanding and generation. In Advances in Neural Information Processing Systems (NeurIPS).
    Google ScholarLocate open access versionFindings
  • Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Ming-Wei Chang. 2020. Realm: Retrievalaugmented language model pre-training. arXiv preprint arXiv:2002.08909.
    Findings
  • Felix Hill, Kyunghyun Cho, and Anna Korhonen. 2016. Learning distributed representations of sentences from unlabelled data. In North American Chapter of the Association for Computational Linguistics (NAACL).
    Google ScholarFindings
  • Po-Sen Huang, Xiaodong He, Jianfeng Gao, Li Deng, Alex Acero, and Larry Heck. 2013. Learning deep structured semantic models for web search using clickthrough data. In International Conference on Information & Knowledge Management (CIKM).
    Google ScholarLocate open access versionFindings
  • Diederik P Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. International Conference on Learning Representations (ICLR).
    Google ScholarLocate open access versionFindings
  • Ryan Kiros, Yukun Zhu, Ruslan R Salakhutdinov, Richard Zemel, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. 2015. Skip-thought vectors. In Advances in neural information processing systems (NeurIPS).
    Google ScholarLocate open access versionFindings
  • Kenton Lee, Ming-Wei Chang, and Kristina Toutanova. 2019. Latent retrieval for weakly supervised open domain question answering. In Association for Computational Linguistics (ACL).
    Google ScholarLocate open access versionFindings
  • Mike Lewis, Marjan Ghazvininejad, Gargi Ghosh, Armen Aghajanyan, Sida Wang, and Luke Zettlemoyer. 2020. Pre-training via paraphrasing. arXiv preprint arXiv:2006.15020.
    Findings
  • Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov, and Luke Zettlemoyer. 2019. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461.
    Findings
  • Zhouhan Lin, Minwei Feng, Cicero Nogueira dos Santos, Mo Yu, Bing Xiang, Bowen Zhou, and Yoshua Bengio. 2017. A structured self-attentive sentence embedding. International Conference on Learning Representations (ICLR).
    Google ScholarLocate open access versionFindings
  • Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 20Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.
    Findings
  • Lili Mou, Rui Men, Ge Li, Yan Xu, Lu Zhang, Rui Yan, and Zhi Jin. 2016. Natural language inference by tree-based convolution and heuristic matching. In Empirical Methods in Natural Language Processing (EMNLP).
    Google ScholarLocate open access versionFindings
  • Yixin Nie, Songhe Wang, and Mohit Bansal. 2019. Revealing the importance of semantic retrieval for machine reading at scale. In Empirical Methods in Natural Language Processing (EMNLP).
    Google ScholarLocate open access versionFindings
  • Matthew E Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations. In North American Chapter of the Association for Computational Linguistics (NAACL).
    Google ScholarLocate open access versionFindings
  • Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Improving language understanding by generative pre-training. URL https://s3-us-west-2.amazonaws.com/openaiassets/researchcovers/languageunsupervised/language understanding paper.pdf.
    Findings
  • Nils Reimers and Iryna Gurevych. 2019. Sentencebert: Sentence embeddings using siamese bertnetworks. In Empirical Methods in Natural Language Processing (EMNLP).
    Google ScholarLocate open access versionFindings
  • Sandeep Subramanian, Adam Trischler, Yoshua Bengio, and Christopher J Pal. 2018. Learning general purpose distributed sentence representations via large scale multi-task learning. In International Conference on Learning Representations (ICLR).
    Google ScholarLocate open access versionFindings
  • Kai Sheng Tai, Richard Socher, and Christopher D Manning. 2015. Improved semantic representations from tree-structured long short-term memory networks. In Association for Computational Linguistics (ACL).
    Google ScholarLocate open access versionFindings
  • Ming Tan, Cicero dos Santos, Bing Xiang, and Bowen Zhou. 2015. Lstm-based deep learning models for non-factoid answer selection. arXiv preprint arXiv:1511.04108.
    Findings
  • Alex Wang, Amapreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R Bowman. 2018. Glue: A multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461.
    Findings
  • Adina Williams, Nikita Nangia, and Samuel Bowman. 2018. A broad-coverage challenge corpus for sentence understanding through inference. In North American Chapter of the Association for Computational Linguistics (NAACL).
    Google ScholarLocate open access versionFindings
  • Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W Cohen, Ruslan Salakhutdinov, and Christopher D Manning. 2018. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. In Empirical Methods in Natural Language Processing (EMNLP).
    Google ScholarLocate open access versionFindings
Full Text
Your rating :
0

 

Tags
Comments