Unsupervised Domain Clusters in Pretrained Language Models

ACL, pp. 7747-7763, 2020.

Cited by: 17|Views99
EI
Weibo:
We showed that massive pre-trained language models are highly effective in mapping data to domains in a fully-unsupervised manner using averagepooled sentence representations and Gaussian Mixture Models-based clustering

Abstract:

The notion of "in-domain data" in NLP is often over-simplistic and vague, as textual data varies in many nuanced linguistic aspects such as topic, style or level of formality. In addition, domain labels are many times unavailable, making it challenging to build domain-specific systems. We show that massive pre-trained language models im...More

Code:

Data:

0
Full Text
Bibtex
Weibo
Introduction
  • It is common knowledge in modern NLP that using large amounts of high-quality training data is a key aspect in building successful machine-learning based systems.
  • Domain labels are usually unavailable – e.g. in large-scale web-crawled data like Common Crawl1 which was recently used to
  • It koran subtitles medical law train state-of-the-art pretrained language models for various tasks (Raffel et al, 2019).
  • The authors propose methods to leverage these emergent domain clusters for domain data selection in two 2 Emerging Domain Clusters in ways: Pretrained Language Models
Highlights
  • It is common knowledge in modern NLP that using large amounts of high-quality training data is a key aspect in building successful machine-learning based systems
  • As pretrained LMs demonstrate state-ofthe-art performance across many NLP tasks after being trained on massive amounts of data, we hypothesize that the robust representations they learn can be useful for mapping sentences to domains in an unsupervised, data-driven approach. We show that these models learn to cluster sentence representations to domains without further supervision (e.g. Figure 1), and quantify this phenomenon by fitting Gaussian Mixture Models (GMMs) to the learned representations and measuring the purity of the resulting unsupervised clustering
  • We evaluate our method on data selection for neural machine translation (NMT) using the multi-domain German-English parallel corpus composed by Koehn and Knowles (2017)
  • It is nice to see that all selection methods performed better than using all the available data or the oracleselected data when averaged across all domains, showing again that more data is not necessarily better in multi-domain scenarios and that data selection is a useful approach
  • We showed that massive pre-trained language models are highly effective in mapping data to domains in a fully-unsupervised manner using averagepooled sentence representations and Gaussian Mixture Models-based clustering
  • We proposed new methods to harness this property for domain data selection using distancebased ranking in vector space and pretrained LM fine-tuning, requiring only a small set of in-domain data
Methods
  • The authors propose two methods for domain data selection with pretrained language models.

    Domain-Cosine In this method the authors first compute a query vector, which is the element-wise average over the vector representations of the sentences in the small in-domain set.
  • Domain-Finetune It is common knowledge that pretrained language models are especially useful when fine-tuned for the task of interest in an end-to-end manner (Ruder et al, 2019)
  • In this method the authors fine-tune the pretrained LM for binary classification, where the authors use the in-domain sentences as positive examples, and randomly sampled general-domain sentences as negative examples.
  • This can be seen as an instance of positiveunlabeled learning for document-set expansion; see Jacovi et al (2019) for a recent discussion and methodology for this task
Results
  • Results and Discussion

    As can be seen in Table 1, pre-trained language models are highly capable of generating sentence representations that cluster by domains, resulting in up to 87.66%, 89.04% and 89.94% accuracy when using k=5, k=10 and k=15 clusters, respectively, across 10,000 sentences in 5 domains.
  • As the Subtitles cluster (Blue) is closer to the Koran cluster (Green), the highest cross-domain BLEU score on the Koran test set is from the Subtitles model
  • To further quantify this phenomenon, the authors plot and measure Pearson’s correlation between the cosine similarity of the centroids for the English BERT-based dev sentence representations for each domain pair, and the cross-domain BLEU score for this domain pair.
  • Note that the authors used DistilBERT in these experiments: the authors believe that using larger, nondistilled models may result in even better selection performance
Conclusion
  • Conclusions and Future

    Work

    The authors showed that massive pre-trained language models are highly effective in mapping data to domains in a fully-unsupervised manner using averagepooled sentence representations and GMM-based clustering.
  • This work just scratches the surface with what can be done on the subject; possible avenues for future work include extending this with multilingual data selection and multilingual LMs (Conneau and Lample, 2019; Conneau et al, 2019; Wu et al, 2019; Hu et al, 2020), using such selection methods with domain-curriculum training (Zhang et al, 2019; Wang et al, 2019b), applying them on noisy, web-crawled data (Junczys-Dowmunt, 2018) or for additional tasks (Gururangan et al, 2020)
  • Another interesting avenue is applying this to unsupervised NMT, which is highly sensitive to domain mismatch (Marchisio et al, 2020; Kim et al, 2020).
  • The authors hope this work will encourage more research on finding the right data for the task, towards more efficient and robust NLP
Summary
  • Introduction:

    It is common knowledge in modern NLP that using large amounts of high-quality training data is a key aspect in building successful machine-learning based systems.
  • Domain labels are usually unavailable – e.g. in large-scale web-crawled data like Common Crawl1 which was recently used to
  • It koran subtitles medical law train state-of-the-art pretrained language models for various tasks (Raffel et al, 2019).
  • The authors propose methods to leverage these emergent domain clusters for domain data selection in two 2 Emerging Domain Clusters in ways: Pretrained Language Models
  • Methods:

    The authors propose two methods for domain data selection with pretrained language models.

    Domain-Cosine In this method the authors first compute a query vector, which is the element-wise average over the vector representations of the sentences in the small in-domain set.
  • Domain-Finetune It is common knowledge that pretrained language models are especially useful when fine-tuned for the task of interest in an end-to-end manner (Ruder et al, 2019)
  • In this method the authors fine-tune the pretrained LM for binary classification, where the authors use the in-domain sentences as positive examples, and randomly sampled general-domain sentences as negative examples.
  • This can be seen as an instance of positiveunlabeled learning for document-set expansion; see Jacovi et al (2019) for a recent discussion and methodology for this task
  • Results:

    Results and Discussion

    As can be seen in Table 1, pre-trained language models are highly capable of generating sentence representations that cluster by domains, resulting in up to 87.66%, 89.04% and 89.94% accuracy when using k=5, k=10 and k=15 clusters, respectively, across 10,000 sentences in 5 domains.
  • As the Subtitles cluster (Blue) is closer to the Koran cluster (Green), the highest cross-domain BLEU score on the Koran test set is from the Subtitles model
  • To further quantify this phenomenon, the authors plot and measure Pearson’s correlation between the cosine similarity of the centroids for the English BERT-based dev sentence representations for each domain pair, and the cross-domain BLEU score for this domain pair.
  • Note that the authors used DistilBERT in these experiments: the authors believe that using larger, nondistilled models may result in even better selection performance
  • Conclusion:

    Conclusions and Future

    Work

    The authors showed that massive pre-trained language models are highly effective in mapping data to domains in a fully-unsupervised manner using averagepooled sentence representations and GMM-based clustering.
  • This work just scratches the surface with what can be done on the subject; possible avenues for future work include extending this with multilingual data selection and multilingual LMs (Conneau and Lample, 2019; Conneau et al, 2019; Wu et al, 2019; Hu et al, 2020), using such selection methods with domain-curriculum training (Zhang et al, 2019; Wang et al, 2019b), applying them on noisy, web-crawled data (Junczys-Dowmunt, 2018) or for additional tasks (Gururangan et al, 2020)
  • Another interesting avenue is applying this to unsupervised NMT, which is highly sensitive to domain mismatch (Marchisio et al, 2020; Kim et al, 2020).
  • The authors hope this work will encourage more research on finding the right data for the task, towards more efficient and robust NLP
Tables
  • Table1: Unsupervised domain clustering as measured by purity for the different models. Best results are marked in bold for each setting
  • Table2: Sentences from one domain which were assigned to a cluster of another domain by the BERT-based clustering, k=5
  • Table3: Number of training examples for each domain in the original split (<a class="ref-link" id="cMuller_et+al_2019_a" href="#rMuller_et+al_2019_a">Muller et al, 2019</a>) and in our split
  • Table4: SacreBLEU (<a class="ref-link" id="cPost_2018_a" href="#rPost_2018_a">Post, 2018</a>) scores of our baseline systems on the test sets of the new data split. Each row represents the results from one model on each test set. The best result in each column is marked in bold
  • Table5: Ablation analysis showing precision (p) recall (r) and F1 for the binary classification accuracy on a held-out set, with and without pre-ranking
  • Table6: SacreBLEU scores for the data selection experiments. Highest scores per column are marked in bold
  • Table7: Precision (p) and recall (r) for data selection of 500k sentences with respect to the oracle selection
  • Table8: Details about the different data splits for the multi-domain corpus
Download tables as Excel
Related work
  • Previous works used n-gram LMs for data selection (Moore and Lewis, 2010; Axelrod et al, 2011) or other count-based methods (Axelrod, 2017; Poncelas et al, 2018; Parcheta et al, 2018; Santamarıa and Axelrod, 2019). While such methods work well in practice, they cannot generalize beyond the N-grams observed in the in-domain datasets, which are usually small.

    Duh et al (2013) proposed to replace n-gram models with RNN-based LMs with notable improvements. However, such methods do not capture the rich sentence-level global context as in the recent self-attention-based MLMs; as we showed in the clustering experiments, autoregressive neural LMs were inferior to masked LMs in clustering the data by domain. In addition, training very large neural LMs may be prohibitive without relying on pre-training.

    Regarding domain clustering for MT, Hasler et al (2014) discovered topics using LDA instead of using domain labels. Cuong et al (2016) induced latent subdomains from the training data using a dedicated probabilistic model.
Reference
  • Roee Aharoni and Yoav Goldberg. 2018. Split and rephrase: Better evaluation and stronger baselines. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 719–724, Melbourne, Australia. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Mikko Aulamo and Jorg Tiedemann. 2019. The OPUS resource repository: An open package for creating parallel corpora and machine translation services. In Proceedings of the 22nd Nordic Conference on Computational Linguistics, pages 389–394, Turku, Finland. Linkoping University Electronic Press.
    Google ScholarLocate open access versionFindings
  • Amittai Axelrod. 2017. Cynical selection of language model training data. arXiv preprint arXiv:1709.02279.
    Findings
  • Amittai Axelrod, Xiaodong He, and Jianfeng Gao. 2011. Domain adaptation via pseudo in-domain data selection. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, pages 355–362, Edinburgh, Scotland, UK. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Ankur Bapna and Orhan Firat. 2019. Non-parametric adaptation for neural machine translation. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 1921–1931, Minneapolis, Minnesota. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Loıc Barrault, Ondrej Bojar, Marta R. Costa-jussa, Christian Federmann, Mark Fishel, Yvette Graham, Barry Haddow, Matthias Huck, Philipp Koehn, Shervin Malmasi, Christof Monz, Mathias Muller, Santanu Pal, Matt Post, and Marcos Zampieri. 2019. Findings of the 2019 conference on machine translation (WMT19). In Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1), pages 1–61, Florence, Italy. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • David M Blei, Andrew Y Ng, and Michael I Jordan. 2003. Latent dirichlet allocation. Journal of machine Learning research, 3(Jan):993–1022.
    Google ScholarLocate open access versionFindings
  • Chenhui Chu and Rui Wang. 201A survey of domain adaptation for neural machine translation. In Proceedings of the 27th International Conference on Computational Linguistics, pages 1304–1319, Santa Fe, New Mexico, USA. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Kevin Clark, Urvashi Khandelwal, Omer Levy, and Christopher D Manning. 201What does bert look at? an analysis of bert’s attention. arXiv preprint arXiv:1906.04341.
    Findings
  • Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzman, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Unsupervised cross-lingual representation learning at scale. arXiv preprint arXiv:1911.02116.
    Findings
  • Alexis Conneau and Guillaume Lample. 2019. Crosslingual language model pretraining. In Advances in Neural Information Processing Systems, pages 7057–7067.
    Google ScholarLocate open access versionFindings
  • Susan M Conrad and Douglas Biber. 2005. The frequency and use of lexical bundles in conversation and academic prose. Lexicographica.
    Google ScholarFindings
  • Hoang Cuong, Khalil Sima’an, and Ivan Titov. 2016. Adapting to all domains at once: Rewarding domain invariance in SMT. Transactions of the Association for Computational Linguistics, 4:99–112.
    Google ScholarLocate open access versionFindings
  • Hal Daume. 2009. K-means vs gmm, sum-product vs max-product.
    Google ScholarFindings
  • Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Zi-Yi Dou, Junjie Hu, Antonios Anastasopoulos, and Graham Neubig. 2019a. Unsupervised domain adaptation for neural machine translation with domain-aware feature embeddings. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 1417–1422, Hong Kong, China. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Zi-Yi Dou, Xinyi Wang, Junjie Hu, and Graham Neubig. 2019b. Domain differential adaptation for neural machine translation. In Proceedings of the 3rd Workshop on Neural Generation and Translation, Hong Kong. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Kevin Duh, Graham Neubig, Katsuhito Sudoh, and Hajime Tsukada. 2013. Adaptation data selection using neural language models: Experiments in machine translation. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 678–683, Sofia, Bulgaria. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Mirela-Stefania Duma and Wolfgang Menzel. 2016. Data selection for IT texts using paragraph vector. In Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers, pages 428–434, Berlin, Germany. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Sauleh Eetemadi, William Lewis, Kristina Toutanova, and Hayder Radha. 2015. Survey of data-selection methods in statistical machine translation. Machine Translation, 29(3-4):189–223.
    Google ScholarLocate open access versionFindings
  • M. Amin Farajian, Marco Turchi, Matteo Negri, and Marcello Federico. 2017. Multi-domain neural machine translation through unsupervised adaptation. In Proceedings of the Second Conference on Machine Translation, pages 127–137, Copenhagen, Denmark. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Guillem Gasco, Martha-Alicia Rocha, German Sanchis-Trilles, Jesus Andres-Ferrer, and Francisco Casacuberta. 2012. Does more data always yield better translations? In Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 152–161, Avignon, France. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Yoav Goldberg. 2019. Assessing bert’s syntactic abilities. arXiv preprint arXiv:1901.05287.
    Findings
  • Suchin Gururangan, Ana Marasovi, Swabha Swayamdipta, Kyle Lo, Iz Beltagy, Doug Downey, and Noah A. Smith. 2020. Don’t stop pretraining: Adapt language models to domains and tasks. ACL.
    Google ScholarLocate open access versionFindings
  • Gholamreza Haffari, Maxim Roy, and Anoop Sarkar. 2009. Active learning for statistical phrase-based machine translation. In Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pages 415–423, Boulder, Colorado. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Eva Hasler, Phil Blunsom, Philipp Koehn, and Barry Haddow. 2014. Dynamic topic adaptation for phrase-based MT. In Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, pages 328– 337, Gothenburg, Sweden. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Kenneth Heafield. 2011. KenLM: Faster and smaller language model queries. In Proceedings of the Sixth Workshop on Statistical Machine Translation, pages 187–197, Edinburgh, Scotland. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Junjie Hu, Sebastian Ruder, Aditya Siddhant, Graham Neubig, Orhan Firat, and Melvin Johnson. 2020. Xtreme: A massively multilingual multi-task benchmark for evaluating cross-lingual generalization. arXiv preprint arXiv:2003.11080.
    Findings
  • Junjie Hu, Mengzhou Xia, Graham Neubig, and Jaime Carbonell. 2019. Domain adaptation of neural machine translation by lexicon induction. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Alon Jacovi, Gang Niu, Yoav Goldberg, and Masashi Sugiyama. 2019. Scalable evaluation and improvement of document set expansion via neural positive-unlabeled learning. arXiv preprint arXiv:1910.13339.
    Findings
  • Marcin Junczys-Dowmunt. 2018. Dual conditional cross-entropy filtering of noisy parallel corpora. In Proceedings of the Third Conference on Machine Translation: Shared Task Papers, pages 888–895, Belgium, Brussels. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Yunsu Kim, Miguel Graca, and Hermann Ney. 2020. When and why is unsupervised neural machine translation useless? arXiv preprint arXiv:2004.10581.
    Findings
  • Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
    Findings
  • Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, Chris Dyer, Ondrej Bojar, Alexandra Constantin, and Evan Herbst. 2007. Moses: Open source toolkit for statistical machine translation. In Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics Companion Volume Proceedings of the Demo and Poster Sessions, pages 177–180, Prague, Czech Republic. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Philipp Koehn and Rebecca Knowles. 2017. Six challenges for neural machine translation. In Proceedings of the First Workshop on Neural Machine Translation, pages 28–39, Vancouver. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Katherine Lee, Orhan Firat, Ashish Agarwal, Clara Fannjiang, and David Sussillo. 2018. Hallucinations in neural machine translation.
    Google ScholarFindings
  • Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.
    Findings
  • Xiaofei Ma, Peng Xu, Zhiguo Wang, Ramesh Nallapati, and Bing Xiang. 2019. Domain adaptation with BERT-based domain classification and data selection. In Proceedings of the 2nd Workshop on Deep Learning Approaches for Low-Resource NLP (DeepLo 2019), pages 76–83, Hong Kong, China. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Christopher D Manning, Prabhakar Raghavan, and Hinrich Schutze. 2008. Introduction to information retrieval. Cambridge university press.
    Google ScholarFindings
  • Kelly Marchisio, Kevin Duh, and Philipp Koehn. 2020. When does unsupervised machine translation work? arXiv preprint arXiv:2004.05516.
    Findings
  • Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pages 3111–3119.
    Google ScholarLocate open access versionFindings
  • Robert C. Moore and William Lewis. 2010. Intelligent selection of language model training data. In Proceedings of the ACL 2010 Conference Short Papers, pages 220–224, Uppsala, Sweden. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Mathias Muller, Annette Rios, and Rico Sennrich. 2019. Domain robustness in neural machine translation.
    Google ScholarFindings
  • Nathan Ng, Kyra Yee, Alexei Baevski, Myle Ott, Michael Auli, and Sergey Edunov. 2019. Facebook FAIR’s WMT19 news translation task submission. In Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1), pages 314–319, Florence, Italy. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Xing Niu, Marianna Martindale, and Marine Carpuat. 2017. A study of style in machine translation: Controlling the formality of machine translation output. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 2814–2819, Copenhagen, Denmark. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, Sam Gross, Nathan Ng, David Grangier, and Michael Auli. 2019. fairseq: A fast, extensible toolkit for sequence modeling. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations), pages 48–53, Minneapolis, Minnesota. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Zuzanna Parcheta, German Sanchis-Trilles, and Francisco Casacuberta. 2018. Data selection for nmt using infrequent n-gram recovery. EAMT 2018.
    Google ScholarLocate open access versionFindings
  • F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830.
    Google ScholarLocate open access versionFindings
  • Alvaro Peris, Mara Chinea-Rıos, and Francisco Casacuberta. 2017. Neural networks classifier for data selection in statistical machine translation. The Prague Bulletin of Mathematical Linguistics, 108(1):283–294.
    Google ScholarLocate open access versionFindings
  • Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages
    Google ScholarLocate open access versionFindings
  • 2227–2237, New Orleans, Louisiana. Association for Computational Linguistics.
    Google ScholarFindings
  • Alberto Poncelas, Gideon Maillette de Buy Wenniger, and Andy Way. 2018. Data selection with feature decay algorithms using an approximated target side. arXiv preprint arXiv:1811.03039.
    Findings
  • Matt Post. 2018. A call for clarity in reporting BLEU scores. In Proceedings of the Third Conference on Machine Translation: Research Papers, pages 186– 191, Belgium, Brussels. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Ofir Press and Lior Wolf. 2017. Using the output embedding to improve language models. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, pages 157–163, Valencia, Spain. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Improving language understanding by generative pre-training. OpenAI blog.
    Google ScholarLocate open access versionFindings
  • Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2019. Exploring the limits of transfer learning with a unified text-to-text transformer.
    Google ScholarFindings
  • Sebastian Ruder, Matthew E. Peters, Swabha Swayamdipta, and Thomas Wolf. 2019. Transfer learning in natural language processing. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Tutorials, pages 15–18, Minneapolis, Minnesota. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Sebastian Ruder and Barbara Plank. 2017. Learning to select data for transfer learning with Bayesian optimization. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 372–382, Copenhagen, Denmark. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Hassan Sajjad, Nadir Durrani, Fahim Dalvi, Yonatan Belinkov, and Stephan Vogel. 2017. Neural machine translation training in a multi-domain scenario. arXiv preprint arXiv:1708.08712.
    Findings
  • Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. 2019.
    Google ScholarFindings
  • Lucıa Santamarıa and Amittai Axelrod. 2019. Data selection with cluster-based language difference models and cynical selection. arXiv preprint arXiv:1904.04900.
    Findings
  • Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational
    Google ScholarLocate open access versionFindings
  • Linguistics (Volume 1: Long Papers), pages 1715– 1725, Berlin, Germany. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Catarina Cruz Silva, Chao-Hong Liu, Alberto Poncelas, and Andy Way. 2018. Extracting in-domain training corpora for neural machine translation using data selection methods. In Proceedings of the Third Conference on Machine Translation: Research Papers, pages 224–231, Belgium, Brussels. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Ian Tenney, Dipanjan Das, and Ellie Pavlick. 2019. BERT rediscovers the classical NLP pipeline. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4593–4601, Florence, Italy. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Jorg Tiedemann. 2012. Parallel data, tools and interfaces in OPUS. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC-2012), pages 2214–2218, Istanbul, Turkey. European Languages Resources Association (ELRA).
    Google ScholarLocate open access versionFindings
  • Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 5998–6008. Curran Associates, Inc.
    Google ScholarLocate open access versionFindings
  • Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R Bowman. 2019a. Superglue: A stickier benchmark for general-purpose language understanding systems. arXiv preprint arXiv:1905.00537.
    Findings
  • Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. 2018. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 353–355, Brussels, Belgium. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Rui Wang, Andrew Finch, Masao Utiyama, and Eiichiro Sumita. 2017. Sentence embedding for neural machine translation domain adaptation. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 560–566, Vancouver, Canada. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Wei Wang, Isaac Caswell, and Ciprian Chelba. 2019b. Dynamically composing domain-data selection with clean-data selection by “co-curricular learning” for neural machine translation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 1282–1292, Florence, Italy. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Marlies van der Wees. 2017. What’s in a Domain? Towards Fine-Grained Adaptation for Machine Translation. Ph.D. thesis, University of Amsterdam.
    Google ScholarFindings
  • Marlies van der Wees, Arianna Bisazza, Wouter Weerkamp, and Christof Monz. 2015. What’s in a domain? analyzing genre and topic differences in statistical machine translation. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pages 560–566, Beijing, China. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, R’emi Louf, Morgan Funtowicz, and Jamie Brew. 2019. Huggingface’s transformers: State-of-the-art natural language processing. ArXiv, abs/1910.03771.
    Findings
  • Shijie Wu, Alexis Conneau, Haoran Li, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Emerging cross-lingual structure in pretrained language models. arXiv preprint arXiv:1911.01464.
    Findings
  • Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, and Quoc V Le. 2019. XLNet: Generalized autoregressive pretraining for language understanding. arXiv preprint arXiv:1906.08237.
    Findings
  • Xuan Zhang, Pamela Shapiro, Gaurav Kumar, Paul McNamee, Marine Carpuat, and Kevin Duh. 2019. Curriculum learning for domain adaptation in neural machine translation. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 1903–1915, Minneapolis, Minnesota. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
Your rating :
0

 

Tags
Comments