CCNet: Extracting High Quality Monolingual Datasets from Web Crawl Data

LREC, pp. 4003-4012, 2019.

Cited by: 12|Bibtex|Views99
EI
Other Links: arxiv.org|dblp.uni-trier.de|academic.microsoft.com

Abstract:

Pre-training text representations have led to significant improvements in many areas of natural language processing. The quality of these models benefits greatly from the size of the pretraining corpora as long as its quality is preserved. In this paper, we describe an automatic pipeline to extract massive high-quality monolingual datas...More

Code:

Data:

Full Text
Your rating :
0

 

Tags
Comments