Asynchronous Pipeline for Processing Huge Corpora on Medium to Low Resource Infrastructures

7th Workshop on the Challenges in the Management of Large Corpora, 2019.

Cited by: 2|Views35

Abstract:

Common Crawl is a considerably large, heterogeneous multilingual corpus comprised of crawled documents from the internet, surpassing 20TB of data and distributed as a set of more than 50 thousand plain text files where each contains many documents written in a wide variety of languages. Even though each document has a metadata block assoc...More

Code:

Data:

Get fulltext within 24h
Bibtex
Your rating :
0

 

Tags
Comments