Asynchronous Pipeline for Processing Huge Corpora on Medium to Low Resource Infrastructures
7th Workshop on the Challenges in the Management of Large Corpora, 2019.
Common Crawl is a considerably large, heterogeneous multilingual corpus comprised of crawled documents from the internet, surpassing 20TB of data and distributed as a set of more than 50 thousand plain text files where each contains many documents written in a wide variety of languages. Even though each document has a metadata block assoc...More
Get fulltext within 24h
Full Text (Upload PDF)
PPT (Upload PPT)