CCNet: Extracting High Quality Monolingual Datasets from Web Crawl Data
LREC, pp. 4003-4012, 2019.
EI
Abstract:
Pre-training text representations have led to significant improvements in many areas of natural language processing. The quality of these models benefits greatly from the size of the pretraining corpora as long as its quality is preserved. In this paper, we describe an automatic pipeline to extract massive high-quality monolingual datas...More
Code:
Data:
Full Text
Tags
Comments