s few-shot capabilities. Inspired by the recent success of leveraging a retrieval module to augment large-scale neural network models, we propose to retrieve examples that are semantically-similar to a test sample to formulate its corresponding prompt. Intuitively, the in-context examples selected with such a strategy may serve as more informative inputs to unleash GPT-$3 s extensive knowledge. We evaluate the proposed approach on several natural language understanding and generation benchmarks, where the retrieval-based prompt selection approach consistently outperforms the random baseline. Moreover, it is observed that the sentence encoders fine-tuned on task-related datasets yield even more helpful retrieval results. Notably, significant gains are observed on tasks such as table-to-text generation (41.9% on the ToTTo dataset) and open-domain question answering (45.5% on the NQ dataset). We hope our investigation could help understand the behaviors of GPT-$3$ and large-scale pre-trained LMs in general and enhance their few-shot capabilities. ","authors":[{"name":"Jiachang Liu"},{"id":"561d7d0145cedb33980841c8","name":"Dinghan Shen"},{"id":"562f456b45cedb33995dbe96","name":"Yizhe Zhang"},{"id":"53f43f9cdabfaedd74ddb705","name":"Bill Dolan"},{"id":"53f58452dabfaeaca9f8045b","name":"Lawrence Carin"},{"id":"53f44b16dabfaedf435df98d","name":"Weizhu Chen"}],"id":"6006bb4891e0111a1b6a2346","num_citation":0,"order":5,"pdf":"https:\u002F\u002Fstatic.aminer.cn\u002Fstorage\u002Fpdf\u002Farxiv\u002F21\u002F2101\u002F2101.06804.pdf","title":"What Makes Good In-Context Examples for GPT-$3$?","urls":["https:\u002F\u002Farxiv.org\u002Fabs\u002F2101.06804"],"versions":[{"id":"6006bb4891e0111a1b6a2346","sid":"2101.06804","src":"arxiv","year":2021}],"year":2021},{"abstract":" Current open-domain question answering (QA) systems often follow a Retriever-Reader (R2) architecture, where the retriever first retrieves relevant passages and the reader then reads the retrieved passages to form an answer. In this paper, we propose a simple and effective passage reranking method, Reader-guIDEd Reranker (Rider), which does not involve any training and reranks the retrieved passages solely based on the top predictions of the reader before reranking. We show that Rider, despite its simplicity, achieves 10 to 20 absolute gains in top-1 retrieval accuracy and 1 to 4 Exact Match (EM) score gains without refining the retriever or reader. In particular, Rider achieves 48.3 EM on the Natural Questions dataset and 66.4 on the TriviaQA dataset when only 1,024 tokens (7.8 passages on average) are used as the reader input. ","authors":[{"id":"542a9c8fdabfae5346b027e9","name":"Yuning Mao"},{"id":"54307d8cdabfaea2f5554f4c","name":"Pengcheng He"},{"id":"5429f74fdabfaec7081d080e","name":"Xiaodong Liu"},{"id":"53f43ddedabfaedd74dd7eac","name":"Yelong Shen"},{"id":"53f428e8dabfaec22b9e1c5d","name":"Jianfeng Gao"},{"id":"53f42f36dabfaedce54dcd0c","name":"Jiawei Han"},{"id":"53f44b16dabfaedf435df98d","name":"Weizhu Chen"}],"id":"5ff4379d91e01130648dc35c","num_citation":1,"order":6,"pages":{"end":"350","start":"344"},"pdf":"https:\u002F\u002Fstatic.aminer.cn\u002Fstorage\u002Fpdf\u002Farxiv\u002F21\u002F2101\u002F2101.00294.pdf","title":"Reader-Guided Passage Reranking for Open-Domain Question Answering.","urls":["https:\u002F\u002Farxiv.org\u002Fabs\u002F2101.00294","https:\u002F\u002Fdblp.org\u002Frec\u002Fconf\u002Facl\u002FMaoHLSGHC21","https:\u002F\u002Faclanthology.org\u002F2021.findings-acl.29"],"venue":{"info":{"name":"ACL\u002FIJCNLP"}},"versions":[{"id":"5ff4379d91e01130648dc35c","sid":"2101.00294","src":"arxiv","year":2021},{"id":"6103d7ba91e01159791b214a","sid":"conf\u002Facl\u002FMaoHLSGHC21","src":"dblp","vsid":"conf\u002Facl","year":2021}],"year":2021},{"abstract":" To date, most of recent work under the retrieval-reader framework for open-domain QA focuses on either extractive or generative reader exclusively. In this paper, we study a hybrid approach for leveraging the strengths of both models. We apply novel techniques to enhance both extractive and generative readers built upon recent pretrained neural language models, and find that proper training methods can provide large improvement over previous state-of-the-art models. We demonstrate that a simple hybrid approach by combining answers from both readers can efficiently take advantages of extractive and generative answer inference strategies and outperforms single models as well as homogeneous ensembles. Our approach outperforms previous state-of-the-art models by 3.3 and 2.7 points in exact match on NaturalQuestions and TriviaQA respectively. ","authors":[{"name":"Hao Cheng"},{"id":"53f43ddedabfaedd74dd7eac","name":"Yelong Shen"},{"id":"5429f74fdabfaec7081d080e","name":"Xiaodong Liu"},{"id":"54307d8cdabfaea2f5554f4c","name":"Pengcheng He"},{"id":"53f44b16dabfaedf435df98d","name":"Weizhu Chen"},{"id":"53f428e8dabfaec22b9e1c5d","name":"Jianfeng Gao"}],"id":"5ff432cc91e01130648dc2e8","num_citation":0,"order":4,"pdf":"https:\u002F\u002Fstatic.aminer.cn\u002Fstorage\u002Fpdf\u002Farxiv\u002F21\u002F2101\u002F2101.00178.pdf","title":"UnitedQA: A Hybrid Approach for Open Domain Question Answering","urls":["https:\u002F\u002Farxiv.org\u002Fabs\u002F2101.00178"],"versions":[{"id":"5ff432cc91e01130648dc2e8","sid":"2101.00178","src":"arxiv","year":2021}],"year":2021},{"abstract":"Recent progress in pre-trained neural language models has significantly improved the performance of many natural language processing (NLP) tasks. In this paper we propose a new model architecture DeBERTa (Decoding-enhanced BERT with disentangled attention) that improves the BERT and RoBERTa models using two novel techniques. The first is the disentangled attention mechanism, where each word is represented using two vectors that encode its content and position, respectively, and the attention weights among words are computed using disentangled matrices on their contents and relative positions. Second, an enhanced mask decoder is used to replace the output softmax layer to predict the masked tokens for model pretraining. We show that these two techniques significantly improve the efficiency of model pre-training and performance of downstream tasks. Compared to RoBERTaLarge, a DeBERTa model trained on half of the training data performs consistently better on a wide range of NLP tasks, achieving improvements on MNLI by +0.9% (90.2% vs. 91.1%), on SQuAD v2.0 by +2.3% (88.4% vs. 90.7%) and RACE by +3.6% (83.2% vs. 86.8%). The DeBERTa code and pre-trained models will be made publicly available https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FDeBERTa.","authors":[{"id":"54307d8cdabfaea2f5554f4c","name":"He Pengcheng"},{"id":"5429f74fdabfaec7081d080e","name":"Liu Xiaodong"},{"id":"53f428e8dabfaec22b9e1c5d","name":"Gao Jianfeng"},{"id":"53f44b16dabfaedf435df98d","name":"Chen Weizhu"}],"flags":[{"flag":"affirm_author","person_id":"53f44b16dabfaedf435df98d"}],"id":"5edf5dd891e011bc656deb7d","num_citation":17,"order":3,"pdf":"https:\u002F\u002Fstatic.aminer.cn\u002Fupload\u002Fpdf\u002F419\u002F467\u002F128\u002F5edf5dd891e011bc656deb7d_0.pdf","title":"DeBERTa: Decoding-enhanced BERT with Disentangled Attention","urls":["https:\u002F\u002Farxiv.org\u002Fabs\u002F2006.03654","https:\u002F\u002Fwww.microsoft.com\u002Fen-us\u002Fresearch\u002Fpublication\u002Fdeberta-decoding-enhanced-bert-with-disentangled-attention-2\u002F","https:\u002F\u002Fdblp.uni-trier.de\u002Fdb\u002Fjournals\u002Fcorr\u002Fcorr2006.html#abs-2006-03654","https:\u002F\u002Fopenreview.net\u002Fpdf?id=XPZIaotutsD","https:\u002F\u002Fopenreview.net\u002Fforum?id=XPZIaotutsD","https:\u002F\u002Farxiv.org\u002Fpdf\u002F2006.03654","http:\u002F\u002Fui.adsabs.harvard.edu\u002Fabs\u002F2020arXiv200603654H\u002Fabstract","https:\u002F\u002Fwww.arxiv-vanity.com\u002Fpapers\u002F2006.03654\u002F","https:\u002F\u002Fdblp.org\u002Frec\u002Fconf\u002Ficlr\u002FHeLGC21"],"venue":{"info":{"name":"ICLR"}},"versions":[{"id":"5edf5dd891e011bc656deb7d","sid":"2006.03654","src":"arxiv","year":2020},{"id":"603768e7d3485cfff1dddd83","sid":"3122890974","src":"mag","vsid":"2584161585","year":2020},{"id":"60d5b36f91e011c8cebefd36","sid":"conf\u002Ficlr\u002FHeLGC21","src":"dblp","vsid":"conf\u002Ficlr","year":2021}],"year":2021},{"abstract":"Large-scale language models have demonstrated impressive empirical performance in recent years. Nevertheless, the improved results are attained at the price of bigger size, more power consumption, and slower inference, which hinder their applicability to low-resource (memory and computation) platforms. Knowledge distillation (KD) has been demonstrated as an effective framework for compressing such big models. However, large-scale neural network systems are prone to memorizing training instances, and thus tend to make inconsistent predictions when the data distribution is slightly altered. Moreover, the student model has few opportunities to request useful information from teacher model when there is limited task-specific data available. To address these issues, we propose MixKD, a data-agnostic distillation framework that leverages Mixup, a simple yet efficient data augmentation approach, to endow the resulting model with stronger generalization ability. Concretely, in addition to the original training examples, the student model is encouraged to mimic teacher\\u0027s behaviour on the linear interpolations of example pairs as well. We prove, from a theoretical perspective, that MixKD gives rise to a smaller gap between the generalization error and the empirical error. To verify its effectiveness, we conduct extensive experiments on the GLUE benchmark, where MixKD consistently leads to significant gains over the standard KD training, and outperforms several competitive baselines. Experiments under a limited-data setting and ablation studies further demonstrate the advantages of the proposed approach.","authors":[{"id":"53f453f6dabfaedf43602478","name":"Kevin J Liang"},{"name":"Weituo Hao"},{"id":"561d7d0145cedb33980841c8","name":"Dinghan Shen"},{"id":"562d520a45cedb3398dc3be2","name":"Yufan Zhou"},{"id":"53f44b16dabfaedf435df98d","name":"Weizhu Chen"},{"id":"542dbb37dabfae489b98a547","name":"Changyou Chen"},{"id":"53f58452dabfaeaca9f8045b","name":"Lawrence Carin"}],"flags":[{"flag":"affirm_author","person_id":"53f44b16dabfaedf435df98d"}],"id":"5fa14ed291e011f3c66576e2","num_citation":0,"order":4,"pdf":"https:\u002F\u002Fstatic.aminer.cn\u002Fupload\u002Fpdf\u002F2023\u002F1032\u002F1371\u002F5fa14ed291e011f3c66576e2_0.pdf","title":"MixKD: Towards Efficient Distillation of Large-scale Language Models","urls":["https:\u002F\u002Farxiv.org\u002Fabs\u002F2011.00593","https:\u002F\u002Fdblp.uni-trier.de\u002Fdb\u002Fjournals\u002Fcorr\u002Fcorr2011.html#abs-2011-00593","https:\u002F\u002Fopenreview.net\u002Fpdf?id=UFGEelJkLu5","https:\u002F\u002Fopenreview.net\u002Fforum?id=UFGEelJkLu5","https:\u002F\u002Fdblp.org\u002Frec\u002Fconf\u002Ficlr\u002FLiangHSZCCC21"],"venue":{"info":{"name":"ICLR"}},"versions":[{"id":"5fa14ed291e011f3c66576e2","sid":"2011.00593","src":"arxiv","year":2020},{"id":"605aa35ae4510cd7c86d0827","sid":"3129779966","src":"mag","vsid":"2584161585","year":2021},{"id":"60d5b36f91e011c8cebefdb9","sid":"conf\u002Ficlr\u002FLiangHSZCCC21","src":"dblp","vsid":"conf\u002Ficlr","year":2021}],"year":2021},{"abstract":"In this work, we aim at equipping pre-trained language models with structured knowledge. We present two self-supervised tasks learning over raw text with the guidance from knowledge graphs. Building upon entity-level masked language models, our first contribution is an entity masking scheme that exploits relational knowledge underlying the text. This is fulfilled by using a linked knowledge graph to select informative entities and then masking their mentions. In addition, we use knowledge graphs to obtain distractors for the masked entities, and propose a novel distractor-suppressed ranking objective that is optimized jointly with masked language model. In contrast to existing paradigms, our approach uses knowledge graphs implicitly, only during pre-training, to inject language models with structured knowledge via learning from raw text. It is more efficient than retrieval-based methods that perform entity linking and integration during finetuning and inference, and generalizes more effectively than the methods that directly learn from concatenated graph triples. Experiments show that our proposed model achieves improved performance on five benchmarks, including question answering and knowledge base completion.","authors":[{"id":"5631596745cedb3399d9f9b1","name":"Shen Tao"},{"id":"53f7ea7edabfae9060af7bb9","name":"Mao Yi"},{"id":"54307d8cdabfaea2f5554f4c","name":"He Pengcheng"},{"id":"53f45318dabfaee0d9be6e0a","name":"Long Guodong"},{"id":"563193f445cedb3399e786cb","name":"Trischler Adam"},{"id":"53f44b16dabfaedf435df98d","name":"Chen Weizhu"}],"doi":"10.18653\u002FV1\u002F2020.EMNLP-MAIN.722","flags":[{"flag":"affirm_author","person_id":"53f44b16dabfaedf435df98d"}],"id":"5eaaa1d691e011fa9e15ea8e","num_citation":1,"order":5,"pages":{"end":"8994","start":"8980"},"pdf":"https:\u002F\u002Fstatic.aminer.cn\u002Fupload\u002Fpdf\u002F527\u002F1895\u002F1558\u002F5eaaa1d691e011fa9e15ea8e_0.pdf","title":"Exploiting Structured Knowledge in Text via Graph Guided Representation Learning","urls":["https:\u002F\u002Farxiv.org\u002Fabs\u002F2004.14224","https:\u002F\u002F2020.emnlp.org\u002Fpapers\u002Fmain","https:\u002F\u002Fdblp.uni-trier.de\u002Fdb\u002Fjournals\u002Fcorr\u002Fcorr2004.html#abs-2004-14224","https:\u002F\u002Fwww.aclweb.org\u002Fanthology\u002F2020.emnlp-main.722\u002F","https:\u002F\u002Fwww.microsoft.com\u002Fen-us\u002Fresearch\u002Fpublication\u002Fexploiting-structured-knowledge-in-text-via-graph-guided-representation-learning\u002F","http:\u002F\u002Fui.adsabs.harvard.edu\u002Fabs\u002F2020arXiv200414224S\u002Fabstract"],"venue":{"info":{"name":"EMNLP 2020"}},"versions":[{"id":"5eaaa1d691e011fa9e15ea8e","sid":"2004.14224","src":"arxiv","year":2020},{"id":"5f7fe6d80205f07f689731e2","sid":"emnlp2020#202","src":"conf_emnlp","year":2020},{"id":"5ff68d47d4150a363cd4fa99","sid":"3105082862","src":"mag","vsid":"1192655580","year":2020}],"year":2020},{"abstract":" Data augmentation has been demonstrated as an effective strategy for improving model generalization and data efficiency. However, due to the discrete nature of natural language, designing label-preserving transformations for text data tends to be more challenging. In this paper, we propose a novel data augmentation framework dubbed CoDA, which synthesizes diverse and informative augmented examples by integrating multiple transformations organically. Moreover, a contrastive regularization objective is introduced to capture the global relationship among all the data samples. A momentum encoder along with a memory bank is further leveraged to better estimate the contrastive loss. To verify the effectiveness of the proposed framework, we apply CoDA to Transformer-based models on a wide range of natural language understanding tasks. On the GLUE benchmark, CoDA gives rise to an average improvement of 2.2% while applied to the RoBERTa-large model. More importantly, it consistently exhibits stronger results relative to several competitive data augmentation and adversarial training base-lines (including the low-resource settings). Extensive experiments show that the proposed contrastive objective can be flexibly combined with various data augmentation approaches to further boost their performance, highlighting the wide applicability of the CoDA framework. ","authors":[{"id":"5432ce0ddabfae8cc1c1277f","name":"Yanru Qu"},{"id":"561d7d0145cedb33980841c8","name":"Dinghan Shen"},{"id":"53f43ddedabfaedd74dd7eac","name":"Yelong Shen"},{"name":"Sandra Sajeev"},{"id":"53f42f36dabfaedce54dcd0c","name":"Jiawei Han"},{"id":"53f44b16dabfaedf435df98d","name":"Weizhu Chen"}],"flags":[{"flag":"affirm_author","person_id":"53f44b16dabfaedf435df98d"}],"id":"5f8eabd391e01153024c4bac","num_citation":0,"order":5,"pdf":"https:\u002F\u002Fstatic.aminer.cn\u002Fstorage\u002Fpdf\u002Farxiv\u002F20\u002F2010\u002F2010.08670.pdf","title":"CoDA: Contrast-enhanced and Diversity-promoting Data Augmentation for Natural Language Understanding","urls":["https:\u002F\u002Farxiv.org\u002Fabs\u002F2010.08670","https:\u002F\u002Fdblp.uni-trier.de\u002Fdb\u002Fjournals\u002Fcorr\u002Fcorr2010.html#abs-2010-08670","https:\u002F\u002Fopenreview.net\u002Fforum?id=Ozk9MrX1hvA","https:\u002F\u002Fopenreview.net\u002Fpdf?id=Ozk9MrX1hvA"],"venue":{"info":{"name":"international conference on learning representations"}},"versions":[{"id":"5f8eabd391e01153024c4bac","sid":"2010.08670","src":"arxiv","year":2020},{"id":"60376736d3485cfff1db8bcc","sid":"3120793869","src":"mag","vsid":"2584161585","year":2020}],"year":2020},{"abstract":" Masked Language Model (MLM) framework has been widely adopted for self-supervised language pre-training. In this paper, we argue that randomly sampled masks in MLM would lead to undesirably large gradient variance. Thus, we theoretically quantify the gradient variance via correlating the gradient covariance with the Hamming distance between two different masks (given a certain text sequence). To reduce the variance due to the sampling of masks, we propose a fully-explored masking strategy, where a text sequence is divided into a certain number of non-overlapping segments. Thereafter, the tokens within one segment are masked for training. We prove, from a theoretical perspective, that the gradients derived from this new masking schema have a smaller variance and can lead to more efficient self-supervised training. We conduct extensive experiments on both continual pre-training and general pre-training from scratch. Empirical results confirm that this new masking strategy can consistently outperform standard random masking. Detailed efficiency analysis and ablation studies further validate the advantages of our fully-explored masking strategy under the MLM framework. ","authors":[{"name":"Mingzhi Zheng"},{"name":"Dinghan Shen"},{"id":"53f43ddedabfaedd74dd7eac","name":"Yelong Shen"},{"id":"53f44b16dabfaedf435df98d","name":"Weizhu Chen"},{"id":"542a3b1ddabfae61d495d058","name":"Lin Xiao"}],"id":"5f86c32891e011dbc7eba20d","num_citation":0,"order":3,"pdf":"https:\u002F\u002Fstatic.aminer.cn\u002Fstorage\u002Fpdf\u002Farxiv\u002F20\u002F2010\u002F2010.06040.pdf","title":"Improving Self-supervised Pre-training via a Fully-Explored Masked Language Model","urls":["https:\u002F\u002Farxiv.org\u002Fabs\u002F2010.06040"],"versions":[{"id":"5f86c32891e011dbc7eba20d","sid":"2010.06040","src":"arxiv","year":2020}],"year":2020},{"abstract":"The learning rate warmup heuristic achieves remarkable success in stabilizing training, accelerating convergence and improving generalization for adaptive stochastic optimization algorithms like RMSprop and Adam. Pursuing the theory behind warmup, we identify a problem of the adaptive learning rate -- its variance is problematically large in the early stage, and presume warmup works as a variance reduction technique. We provide both empirical and theoretical evidence to verify our hypothesis. We further propose Rectified Adam (RAdam), a novel variant of Adam, by introducing a term to rectify the variance of the adaptive learning rate. Experimental results on image classification, language modeling, and neural machine translation verify our intuition and demonstrate the efficacy and robustness of RAdam. ","authors":[{"id":"562dc5e745ce1e5967abb31a","name":"Liyuan Liu"},{"id":"561d722845ce1e59647f2f9e","name":"Haoming Jiang"},{"id":"54307d8cdabfaea2f5554f4c","name":"Pengcheng He"},{"id":"53f44b16dabfaedf435df98d","name":"Weizhu Chen"},{"id":"5429f74fdabfaec7081d080e","name":"Xiaodong Liu"},{"id":"53f428e8dabfaec22b9e1c5d","name":"Jianfeng Gao"},{"id":"53f42f36dabfaedce54dcd0c","name":"Jiawei Han"}],"doi":"","flags":[{"flag":"affirm_author","person_id":"53f44b16dabfaedf435df98d"}],"id":"5e5e189993d709897ce1e202","num_citation":434,"order":3,"pages":{"end":"","start":""},"pdf":"https:\u002F\u002Fstatic.aminer.cn\u002Fupload\u002Fpdf\u002Fprogram\u002F5e5e189993d709897ce1e202_0.pdf","title":"On the Variance of the Adaptive Learning Rate and Beyond","urls":["https:\u002F\u002Fopenreview.net\u002Fforum?id=rkgz2aEKDr","https:\u002F\u002Fopenreview.net\u002Fpdf?id=rkgz2aEKDr","http:\u002F\u002Fdblp.uni-trier.de\u002Fdb\u002Fjournals\u002Fcorr\u002Fcorr1908.html#abs-1908-03265","https:\u002F\u002Fdblp.org\u002Frec\u002Fconf\u002Ficlr\u002FLiuJHCLG020","https:\u002F\u002Farxiv.org\u002Fabs\u002F1908.03265"],"venue":{"info":{"name":"ICLR"},"issue":"","volume":""},"versions":[{"id":"5e5e189993d709897ce1e202","sid":"2994689640","src":"mag","vsid":"2584161585","year":2020},{"id":"5eb52f5791e01138ffc2fd2a","sid":"conf\u002Ficlr\u002FLiuJHCLG020","src":"dblp","vsid":"conf\u002Ficlr","year":2020},{"id":"5d513ad13a55ac3e91dfe390","sid":"1908.03265","src":"arxiv","year":2019}],"year":2020},{"abstract":" Transfer learning has fundamentally changed the landscape of natural language processing (NLP) research. Many existing state-of-the-art models are first pre-trained on a large text corpus and then fine-tuned on downstream tasks. However, due to limited data resources from downstream tasks and the extremely large capacity of pre-trained models, aggressive fine-tuning often causes the adapted model to overfit the data of downstream tasks and forget the knowledge of the pre-trained model. To address the above issue in a more principled manner, we propose a new computational framework for robust and efficient fine-tuning for pre-trained language models. Specifically, our proposed framework contains two important ingredients: 1. Smoothness-inducing regularization, which effectively manages the capacity of the model; 2. Bregman proximal point optimization, which is a class of trust-region methods and can prevent knowledge forgetting. Our experiments demonstrate that our proposed method achieves the state-of-the-art performance on multiple NLP benchmarks. ","authors":[{"id":"561d722845ce1e59647f2f9e","name":"Jiang Haoming"},{"id":"54307d8cdabfaea2f5554f4c","name":"He Pengcheng"},{"id":"53f44b16dabfaedf435df98d","name":"Chen Weizhu"},{"id":"5429f74fdabfaec7081d080e","name":"Liu Xiaodong"},{"id":"53f428e8dabfaec22b9e1c5d","name":"Gao Jianfeng"},{"id":"54487665dabfae87b7e291c8","name":"Zhao Tuo"}],"flags":[{"flag":"affirm_author","person_id":"53f44b16dabfaedf435df98d"}],"id":"5dc9327d3a55acc104249aaf","num_citation":29,"order":2,"pages":{"end":"2190","start":"2177"},"pdf":"https:\u002F\u002Fstatic.aminer.cn\u002Fupload\u002Fpdf\u002F1225\u002F1077\u002F899\u002F5dc9327d3a55acc104249aaf_0.pdf","title":"SMART: Robust and Efficient Fine-Tuning for Pre-trained Natural Language Models through Principled Regularized Optimization","urls":["https:\u002F\u002Farxiv.org\u002Fabs\u002F1911.03437","https:\u002F\u002Facl2020.org\u002Fprogram\u002Faccepted\u002F","https:\u002F\u002Fdblp.org\u002Frec\u002Fconf\u002Facl\u002FJiangHCLGZ20","https:\u002F\u002Fwww.aclweb.org\u002Fanthology\u002F2020.acl-main.197\u002F","https:\u002F\u002Facl2020.org\u002Fprogram\u002Faccepted\u002F#474","https:\u002F\u002Farxiv.org\u002Fpdf\u002F1911.03437.pdf","https:\u002F\u002Fdblp.uni-trier.de\u002Fdb\u002Fjournals\u002Fcorr\u002Fcorr1911.html#abs-1911-03437","https:\u002F\u002Fwww.microsoft.com\u002Fen-us\u002Fresearch\u002Fpublication\u002Fsmart-robust-and-efficient-fine-tuning-for-pre-trained-natural-language-models-through-principled-regularized-optimization\u002F","http:\u002F\u002Farxiv.org\u002Fpdf\u002F1911.03437.pdf"],"venue":{"info":{"name":"ACL"}},"versions":[{"id":"5dc9327d3a55acc104249aaf","sid":"1911.03437","src":"arxiv","year":2019},{"id":"5ec49a639fced0a24b4de8ee","sid":"acl2020#475","src":"conf_acl","year":2020},{"id":"5ef5c81691e011b33003b6ed","sid":"conf\u002Facl\u002FJiangHCLGZ20","src":"dblp","vsid":"conf\u002Facl","year":2020},{"id":"5fae6dc2d4150a363cec2901","sid":"3035204084","src":"mag","vsid":"1188739475","year":2020}],"year":2020},{"abstract":" We present MT-DNN, an open-source natural language understanding (NLU) toolkit that makes it easy for researchers and developers to train customized deep learning models. Built upon PyTorch and Transformers, MT-DNN is designed to facilitate rapid customization for a broad spectrum of NLU tasks, using a variety of objectives (classification, regression, structured prediction) and text encoders (e.g., RNNs, BERT, RoBERTa, UniLM). A unique feature of MT-DNN is its built-in support for robust and transferable learning using the adversarial multi-task learning paradigm. To enable efficient production deployment, MT-DNN supports multi-task knowledge distillation, which can substantially compress a deep neural model without significant performance drop. We demonstrate the effectiveness of MT-DNN on a wide range of NLU applications across general and biomedical domains. The software and pre-trained models will be publicly available at https:\u002F\u002Fgithub.com\u002Fnamisan\u002Fmt-dnn. ","authors":[{"id":"5429f74fdabfaec7081d080e","name":"Liu Xiaodong"},{"id":"54407c9cdabfae7f9b33cc0b","name":"Wang Yu"},{"name":"Ji Jianshu"},{"name":"Cheng Hao"},{"name":"Zhu Xueyun"},{"name":"Awa Emmanuel"},{"id":"54307d8cdabfaea2f5554f4c","name":"He Pengcheng"},{"id":"53f44b16dabfaedf435df98d","name":"Chen Weizhu"},{"id":"53f44623dabfaee43ec7cddc","name":"Poon Hoifung"},{"name":"Cao Guihong"},{"id":"53f428e8dabfaec22b9e1c5d","name":"Gao Jianfeng"}],"id":"5e4e5ac53a55ac305df4b5d8","num_citation":6,"order":7,"pages":{"end":"126","start":"118"},"pdf":"https:\u002F\u002Fstatic.aminer.cn\u002Fstorage\u002Fpdf\u002Farxiv\u002F20\u002F2002\u002F2002.07972.pdf","title":"The Microsoft Toolkit of Multi-Task Deep Neural Networks for Natural Language Understanding","urls":["https:\u002F\u002Farxiv.org\u002Fabs\u002F2002.07972","https:\u002F\u002Fdblp.org\u002Frec\u002Fconf\u002Facl\u002FLiuWJCZAHCPCG20","https:\u002F\u002Fwww.aclweb.org\u002Fanthology\u002F2020.acl-demos.16\u002F","https:\u002F\u002Fdblp.uni-trier.de\u002Fdb\u002Fjournals\u002Fcorr\u002Fcorr2002.html#abs-2002-07972","https:\u002F\u002Farxiv.org\u002Fpdf\u002F2002.07972","http:\u002F\u002Fui.adsabs.harvard.edu\u002Fabs\u002F2020arXiv200207972L\u002Fabstract","https:\u002F\u002Fwww.arxiv-vanity.com\u002Fpapers\u002F2002.07972\u002F"],"venue":{"info":{"name":"ACL"}},"versions":[{"id":"5e4e5ac53a55ac305df4b5d8","sid":"2002.07972","src":"arxiv","year":2020},{"id":"5ef876eb91e0115941835dee","sid":"conf\u002Facl\u002FLiuWJCZAHCPCG20","src":"dblp","vsid":"conf\u002Facl","year":2020},{"id":"5fae6f21d4150a363ceea362","sid":"3037624666","src":"mag","vsid":"1188739475","year":2020}],"year":2020},{"abstract":"Transformers have proved effective in many NLP tasks. However, their training requires non-trivial efforts regarding carefully designing cutting-edge optimizers and learning rate schedulers (e.g., conventional SGD fails to train Transformers effectively). Our objective here is to understand __what complicates Transformer training__ from both empirical and theoretical perspectives. Our analysis reveals that unbalanced gradients are not the root cause of the instability of training. Instead, we identify an amplification effect that influences training substantially—for each layer in a multi-layer Transformer model, heavy dependency on its residual branch makes training unstable, since it amplifies small parameter perturbations (e.g., parameter updates) and results in significant disturbances in the model output. Yet we observe that a light dependency limits the model potential and leads to inferior trained models. Inspired by our analysis, we propose Admin (Adaptive model initialization) to stabilize the early stage’s training and unleash its full potential in the late stage. Extensive experiments show that Admin is more stable, converges faster, and leads to better performance","authors":[{"id":"562dc5e745ce1e5967abb31a","name":"Liu Liyuan"},{"id":"5429f74fdabfaec7081d080e","name":"Liu Xiaodong"},{"id":"53f428e8dabfaec22b9e1c5d","name":"Gao Jianfeng"},{"id":"53f44b16dabfaedf435df98d","name":"Chen Weizhu"},{"id":"53f42f36dabfaedce54dcd0c","name":"Han Jiawei"}],"doi":"10.18653\u002FV1\u002F2020.EMNLP-MAIN.463","flags":[{"flag":"affirm_author","person_id":"53f44b16dabfaedf435df98d"}],"id":"5e9d72b391e0117173ad2c33","num_citation":21,"order":3,"pages":{"end":"5763","start":"5747"},"pdf":"https:\u002F\u002Fstatic.aminer.cn\u002Fupload\u002Fpdf\u002F1915\u002F151\u002F820\u002F5e9d72b391e0117173ad2c33_1.pdf","title":"Understanding the Difficulty of Training Transformers","urls":["https:\u002F\u002Farxiv.org\u002Fabs\u002F2004.08249","https:\u002F\u002F2020.emnlp.org\u002Fpapers\u002Fmain","https:\u002F\u002Fdblp.uni-trier.de\u002Fdb\u002Fjournals\u002Fcorr\u002Fcorr2004.html#abs-2004-08249","https:\u002F\u002Farxiv.org\u002Fpdf\u002F2004.08249","https:\u002F\u002Fwww.aclweb.org\u002Fanthology\u002F2020.emnlp-main.463\u002F","https:\u002F\u002Fwww.arxiv-vanity.com\u002Fpapers\u002F2004.08249\u002F"],"venue":{"info":{"name":"EMNLP 2020"}},"versions":[{"id":"5e9d72b391e0117173ad2c33","sid":"2004.08249","src":"arxiv","year":2020},{"id":"5f7fe6d80205f07f68973337","sid":"emnlp2020#543","src":"conf_emnlp","year":2020},{"id":"5ff68cbed4150a363cd3446d","sid":"3103334733","src":"mag","vsid":"1192655580","year":2020}],"year":2020},{"abstract":" Self-attention mechanisms have achieved great success on a variety of NLP tasks due to its flexibility of capturing dependency between arbitrary positions in a sequence. For problems such as query-based summarization (Qsumm) and knowledge graph reasoning where each input sequence is associated with an extra query, explicitly modeling such conditional contextual dependencies can lead to a more accurate solution, which however cannot be captured by existing self-attention mechanisms. In this paper, we propose \\textit{conditional self-attention} (CSA), a neural network module designed for conditional dependency modeling. CSA works by adjusting the pairwise attention between input tokens in a self-attention module with the matching score of the inputs to the given query. Thereby, the contextual dependencies modeled by CSA will be highly relevant to the query. We further studied variants of CSA defined by different types of attention. Experiments on Debatepedia and HotpotQA benchmark datasets show CSA consistently outperforms vanilla Transformer and previous models for the Qsumm problem. ","authors":[{"id":"56285d2345ce1e5965e40f47","name":"Xie Yujia"},{"id":"53f43a89dabfaee2a1d0e14d","name":"Zhou Tianyi"},{"name":"Mao Yi"},{"id":"53f44b16dabfaedf435df98d","name":"Chen Weizhu"}],"flags":[{"flag":"affirm_author","person_id":"53f44b16dabfaedf435df98d"}],"id":"5e4d083f3a55ac8cfd770bd3","num_citation":1,"order":3,"pdf":"https:\u002F\u002Fstatic.aminer.cn\u002Fstorage\u002Fpdf\u002Farxiv\u002F20\u002F2002\u002F2002.07338.pdf","title":"Conditional Self-Attention for Query-based Summarization","urls":["https:\u002F\u002Farxiv.org\u002Fabs\u002F2002.07338"],"versions":[{"id":"5e4d083f3a55ac8cfd770bd3","sid":"2002.07338","src":"arxiv","year":2020}],"year":2020},{"abstract":" Adversarial training has been shown effective at endowing the learned representations with stronger generalization ability. However, it typically requires expensive computation to determine the direction of the injected perturbations. In this paper, we introduce a set of simple yet effective data augmentation strategies dubbed cutoff, where part of the information within an input sentence is erased to yield its restricted views (during the fine-tuning stage). Notably, this process relies merely on stochastic sampling and thus adds little computational overhead. A Jensen-Shannon Divergence consistency loss is further utilized to incorporate these augmented samples into the training objective in a principled manner. To verify the effectiveness of the proposed strategies, we apply cutoff to both natural language understanding and generation problems. On the GLUE benchmark, it is demonstrated that cutoff, in spite of its simplicity, performs on par or better than several competitive adversarial-based approaches. We further extend cutoff to machine translation and observe significant gains in BLEU scores (based upon the Transformer Base model). Moreover, cutoff consistently outperforms adversarial training and achieves state-of-the-art results on the IWSLT2014 German-English dataset. ","authors":[{"name":"Dinghan Shen"},{"name":"Mingzhi Zheng"},{"id":"53f43ddedabfaedd74dd7eac","name":"Yelong Shen"},{"name":"Yanru Qu"},{"id":"53f44b16dabfaedf435df98d","name":"Weizhu Chen"}],"id":"5f75d5e991e0111c1eb4d7e2","num_citation":0,"order":4,"pdf":"https:\u002F\u002Fstatic.aminer.cn\u002Fstorage\u002Fpdf\u002Farxiv\u002F20\u002F2009\u002F2009.13818.pdf","title":"A Simple but Tough-to-Beat Data Augmentation Approach for Natural Language Understanding and Generation","urls":["https:\u002F\u002Farxiv.org\u002Fabs\u002F2009.13818"],"versions":[{"id":"5f75d5e991e0111c1eb4d7e2","sid":"2009.13818","src":"arxiv","year":2020}],"year":2020},{"abstract":" Conventional sparse retrieval methods such as TF-IDF and BM25 are simple and efficient, but solely rely on lexical overlap and fail to conduct semantic matching. Recent dense retrieval methods learn latent representations to tackle the lexical mismatch problem, while being more computationally expensive and sometimes insufficient for exact matching as they embed the entire text sequence into a single vector with limited capacity. In this paper, we present Generation-Augmented Retrieval (GAR), a query expansion method that augments a query with relevant contexts through text generation. We demonstrate on open-domain question answering (QA) that the generated contexts significantly enrich the semantics of the queries and thus GAR with sparse representations (BM25) achieves comparable or better performance than the current state-of-the-art dense method DPR \\cite{karpukhin2020dense}. We show that generating various contexts of a query is beneficial as fusing their results consistently yields a better retrieval accuracy. Moreover, GAR achieves the state-of-the-art performance of extractive QA on the Natural Questions and TriviaQA datasets when equipped with an extractive reader. ","authors":[{"id":"542a9c8fdabfae5346b027e9","name":"Yuning Mao"},{"id":"54307d8cdabfaea2f5554f4c","name":"Pengcheng He"},{"id":"5429f74fdabfaec7081d080e","name":"Xiaodong Liu"},{"id":"53f43ddedabfaedd74dd7eac","name":"Yelong Shen"},{"id":"53f428e8dabfaec22b9e1c5d","name":"Jianfeng Gao"},{"id":"53f42f36dabfaedce54dcd0c","name":"Jiawei Han"},{"id":"53f44b16dabfaedf435df98d","name":"Weizhu Chen"}],"id":"5f68708291e011c23f13b4d8","num_citation":7,"order":6,"title":"Generation-Augmented Retrieval for Open-domain Question Answering","urls":["https:\u002F\u002Farxiv.org\u002Fabs\u002F2009.08553"],"versions":[{"id":"5f68708291e011c23f13b4d8","sid":"2009.08553","src":"arxiv","year":2020}],"year":2020},{"abstract":" We present a novel approach to named entity recognition (NER) in the presence of scarce data that we call example-based NER. Our train-free few-shot learning approach takes inspiration from question-answering to identify entity spans in a new and unseen domain. In comparison with the current state-of-the-art, the proposed method performs significantly better, especially when using a low number of support examples. ","authors":[{"name":"Morteza Ziyadi"},{"name":"Yuting Sun"},{"name":"Abhishek Goswami"},{"name":"Jade Huang"},{"id":"53f44b16dabfaedf435df98d","name":"Weizhu Chen"}],"flags":[{"flag":"affirm_author","person_id":"53f44b16dabfaedf435df98d"}],"id":"5f44fde291e011872f85efc0","num_citation":0,"order":4,"pdf":"https:\u002F\u002Fstatic.aminer.cn\u002Fstorage\u002Fpdf\u002Farxiv\u002F20\u002F2008\u002F2008.10570.pdf","title":"Example-Based Named Entity Recognition","urls":["https:\u002F\u002Farxiv.org\u002Fabs\u002F2008.10570"],"versions":[{"id":"5f44fde291e011872f85efc0","sid":"2008.10570","src":"arxiv","year":2020}],"year":2020},{"abstract":" This paper presents a comprehensive study to efficiently build named entity recognition (NER) systems when a small number of in-domain labeled data is available. Based upon recent Transformer-based self-supervised pre-trained language models (PLMs), we investigate three orthogonal schemes to improve the model generalization ability for few-shot settings: (1) meta-learning to construct prototypes for different entity types, (2) supervised pre-training on noisy web data to extract entity-related generic representations and (3) self-training to leverage unlabeled in-domain data. Different combinations of these schemes are also considered. We perform extensive empirical comparisons on 10 public NER datasets with various proportions of labeled data, suggesting useful insights for future research. Our experiments show that (i) in the few-shot learning setting, the proposed NER schemes significantly improve or outperform the commonly used baseline, a PLM-based linear classifier fine-tuned on domain labels; (ii) We create new state-of-the-art results on both few-shot and training-free settings compared with existing methods. We will release our code and pre-trained models for reproducible research. ","authors":[{"id":"562e056f45ce1e5967bd5558","name":"Jiaxin Huang"},{"id":"53f42d9adabfaee1c0a36753","name":"Chunyuan Li"},{"name":"Krishan Subudhi"},{"name":"Damien Jose"},{"name":"Shobana Balakrishnan"},{"id":"53f44b16dabfaedf435df98d","name":"Weizhu Chen"},{"id":"562d021645cedb3398d293a1","name":"Baolin Peng"},{"id":"53f428e8dabfaec22b9e1c5d","name":"Jianfeng Gao"},{"id":"53f42f36dabfaedce54dcd0c","name":"Jiawei Han"}],"id":"5feefe4691e0113b2659ff37","num_citation":0,"order":5,"pdf":"https:\u002F\u002Fstatic.aminer.cn\u002Fstorage\u002Fpdf\u002Farxiv\u002F20\u002F2012\u002F2012.14978.pdf","title":"Few-Shot Named Entity Recognition: A Comprehensive Study","urls":["https:\u002F\u002Farxiv.org\u002Fabs\u002F2012.14978"],"versions":[{"id":"5feefe4691e0113b2659ff37","sid":"2012.14978","src":"arxiv","year":2020}],"year":2020},{"abstract":" Generalization and robustness are both key desiderata for designing machine learning methods. Adversarial training can enhance robustness, but past work often finds it hurts generalization. In natural language processing (NLP), pre-training large neural language models such as BERT have demonstrated impressive gain in generalization for a variety of tasks, with further improvement from adversarial fine-tuning. However, these models are still vulnerable to adversarial attacks. In this paper, we show that adversarial pre-training can improve both generalization and robustness. We propose a general algorithm ALUM (Adversarial training for large neural LangUage Models), which regularizes the training objective by applying perturbations in the embedding space that maximizes the adversarial loss. We present the first comprehensive study of adversarial training in all stages, including pre-training from scratch, continual pre-training on a well-trained model, and task-specific fine-tuning. ALUM obtains substantial gains over BERT on a wide range of NLP tasks, in both regular and adversarial scenarios. Even for models that have been well trained on extremely large text corpora, such as RoBERTa, ALUM can still produce significant gains from continual pre-training, whereas conventional non-adversarial methods can not. ALUM can be further combined with task-specific fine-tuning to attain additional gains. The ALUM code is publicly available at https:\u002F\u002Fgithub.com\u002Fnamisan\u002Fmt-dnn. ","authors":[{"id":"5429f74fdabfaec7081d080e","name":"Liu Xiaodong"},{"name":"Cheng Hao"},{"id":"54307d8cdabfaea2f5554f4c","name":"He Pengcheng"},{"id":"53f44b16dabfaedf435df98d","name":"Chen Weizhu"},{"id":"54407c9cdabfae7f9b33cc0b","name":"Wang Yu"},{"id":"53f44623dabfaee43ec7cddc","name":"Poon Hoifung"},{"id":"53f428e8dabfaec22b9e1c5d","name":"Gao Jianfeng"}],"id":"5eabf33691e011664ffd248c","num_citation":17,"order":3,"pdf":"https:\u002F\u002Fstatic.aminer.cn\u002Fstorage\u002Fpdf\u002Farxiv\u002F20\u002F2004\u002F2004.08994.pdf","title":"Adversarial Training for Large Neural Language Models","urls":["https:\u002F\u002Farxiv.org\u002Fabs\u002F2004.08994"],"versions":[{"id":"5eabf33691e011664ffd248c","sid":"2004.08994","src":"arxiv","year":2020}],"year":2020}],"profilePubsTotal":91,"profilePatentsPage":1,"profilePatents":[],"profilePatentsTotal":0,"profilePatentsEnd":true,"profileProjectsPage":0,"profileProjects":null,"profileProjectsTotal":null,"newInfo":null,"checkDelPubs":[]}};