Poly-encoders: Architectures and Pre-training Strategies for Fast and Accurate Multi-sentence Scoring
In this paper we present new architectures and pre-training strategies for deep bidirectional transformers in candidate selection tasks
The use of deep pre-trained transformers has led to remarkable progress in a number of applications (Devlin et al., 2018). For tasks that make pairwise comparisons between sequences, matching a given input with a corresponding label, two approaches are common: Cross-encoders performing full self-attention over the pair and Bi-encoders enc...More
PPT (Upload PPT)
- Recently, substantial improvements to state-of-the-art benchmarks on a variety of language understanding tasks have been achieved through the use of deep pre-trained language models followed by fine-tuning (Devlin et al, 2019)
- In this work we explore improvements to this approach for the class of tasks that require multi-sentence scoring: given an input context, score a set of candidate labels, a setup common in retrieval and dialogue tasks, amongst others
- To pre-train our architectures, we show that choosing abundant data more similar to our downstream task brings significant gains over BERT pre-training
- We summarize all four datasets and their statistics in Table 1
- In this paper we present new architectures and pre-training strategies for deep bidirectional transformers in candidate selection tasks
- In terms of training these architectures, we showed that pre-training strategies more closely related to the downstream task bring strong improvements
- Substantial improvements to state-of-the-art benchmarks on a variety of language understanding tasks have been achieved through the use of deep pre-trained language models followed by fine-tuning (Devlin et al, 2019).
- Urbanek et al (2019) employed pre-trained BERT models, and fine-tuned both Bi- and Cross-encoders, explicitly comparing them on dialogue and action tasks, and finding that Cross-encoders perform better.
- Transformers Our Bi-, Cross-, and Poly-encoders, described in sections 4.2, 4.3 and 4.4 respectively, are based on large pre-trained transformer models with the same architecture and dimension as BERT-base (Devlin et al, 2019), which has 12 layers, 12 attention heads, and a hidden size of 768.
- The former is performed to verify that reproducing a BERT-like setting gives us the same results as reported previously, while the latter tests whether pre-training on data more similar to the downstream tasks of interest helps.
- The Cross-encoder allows for rich interactions between the input context and candidate label, as they are jointly encoded to obtain a final representation.
- Similar to the procedure in pre-training, the context and candidate are surrounded by the special token [S] and concatenated into a single vector, which is encoded using one transformer.
- With the setups described above, we fine-tune the Bi- and Cross-encoders on the datasets, and report the results in Table 4.
- We do not report fine-tuning of BERT for Wikipedia IR as we cannot guarantee the test set is not part of the pre-training for that dataset.
- We note that since reporting our results, the authors of Li et al (2019) have conducted a human evaluation study on ConvAI2, in which our Poly-encoder architecture outperformed all other models compared against, both generative and retrieval based, including the winners of the competition.
- We show that pre-training on Reddit gives further state-ofthe-art performance over our previous results with BERT, a finding that we see for all three dialogue tasks, and all three architectures.
- The results obtained with fine-tuning on our own transformers pre-trained on Toronto Books + Wikipedia are very similar to those obtained with the original BERT weights, indicating that the choice of dataset used to pre-train the models impacts the final results, not some other detail in our training.
- We introduced the Poly-encoder method, which provides a mechanism for attending over the context using the label candidate, while maintaining the ability to precompute each candidate’s representation, which allows for fast real-time inference in a production setup, giving an improved trade off between accuracy and speed.
- Pre-training from scratch on Reddit allows us to outperform the results we obtain with BERT, a result that holds for all three model architectures and all three dialogue datasets we tried.
- The methods introduced in this work are not specific to dialogue, and can be used for any task where one is scoring a set of candidates, which we showed for an information retrieval task as well
- Table1: Datasets used in this paper
- Table2: Validation performance on ConvAI2 after fine-tuning a Bi-encoder pre-trained with BERT, averaged over 5 runs. The batch size is the number of training negatives + 1 as we use the other elements of the batch as negatives during training
- Table3: Validation performance (R@1/20) on ConvAI2 using pre-trained weights of BERT-base with different parameters fine-tuned. Average over 5 runs (Bi-encoders) or 3 runs (Cross-encoders)
- Table4: Test performance of Bi-, Poly- and Cross-encoders on our selected tasks
- Table5: Average time in milliseconds to predict the next dialogue utterance from C possible candidates on ConvAI2. * are inferred
- Table6: Training time in hours
- Table7: Bi-encoder results on the ConvAI2 valid set for different choices of function red(·)
- Table8: Validation and test performance of Poly-encoder variants, with weights initialized from (Devlin et al, 2019). Scores are shown for ConvAI2 and DSTC 7 Track 1. Bold numbers indicate the highest performing variant within that number of codes
- Table9: Average time in milliseconds to predict the next dialogue utterance from N possible candidates. * are inferred
- Table10: Validation and test performances of Bi-, Poly- and Cross-encoders. Scores are shown for ConvAI2, DSTC7 Track 1 and Ubuntu v2, and the previous state-of-the-art models in the literature
- The task of scoring candidate labels given an input context is a classical problem in machine learning. While multi-class classification is a special case, the more general task involves candidates as structured objects rather than discrete classes; in this work we consider the inputs and the candidate labels to be sequences of text.
There is a broad class of models that map the input and a candidate label separately into a common feature space wherein typically a dot product, cosine or (parameterized) non-linearity is used to measure their similarity. We refer to these models as Bi-encoders. Such methods include vector space models (Salton et al, 1975), LSI (Deerwester et al, 1990), supervised embeddings (Bai et al, 2009; Wu et al, 2018) and classical siamese networks (Bromley et al, 1994). For the next utterance prediction tasks we consider in this work, several Bi-encoder neural approaches have been considered, in particular Memory Networks (Zhang et al, 2018a) and Transformer Memory networks (Dinan et al, 2019) as well as LSTMs (Lowe et al, 2015) and CNNs (Kadlec et al, 2015) which encode input and candidate label separately. A major advantage of Bi-encoder methods is their ability to cache the representations of a large, fixed candidate set. Since the candidate encodings are independent of the input, Bi-encoders are very efficient during evaluation.
- Develops a new transformer architecture, the Poly-encoder, that learns global rather than token level self-attention features
- Shows our models achieve state-of-the-art results on four tasks; that Poly-encoders are faster than Cross-encoders and more accurate than Bi-encoders; and that the best results are obtained by pre-training on large datasets similar to the downstream tasks
- Explores improvements to this approach for the class of tasks that require multi-sentence scoring: given an input context, score a set of candidate labels, a setup common in retrieval and dialogue tasks, amongst others
- Provides novel contributions that improve both the quality and speed axes over the current state-of-the-art
- Introduces the Poly-encoder, an architecture with an additional learnt attention mechanism that represents more global features from which to perform self-attention, resulting in performance gains over Bi-encoders and large speed gains over Cross-Encoders
Study subjects and analysis
existing datasets: 4
This is true across all different architecture choices and downstream tasks we try. We conduct experiments comparing the new approaches, in addition to analysis of what works best for various setups of existing methods, on four existing datasets in the domains of dialogue and information retrieval (IR), with pre-training strategies based on Reddit (Mazareet al., 2018) compared. ∗ Joint First Authors
datasets with our best architectures and pre-training strategies: 4
to Wikipedia/Toronto Books (i.e., BERT). We obtain a new state-of-the-art on all four datasets with our best architectures and pre-training strategies, as well as providing practical implementations for real-time use. Our code and models will be released open-source
The best reported method is the learningto-rank embedding model, StarSpace, which outperforms fastText, SVMs, and other baselines. We summarize all four datasets and their statistics in Table 1. Train Ex