Pre-training and evaluation of BERT-based language models for Spanish. More information at https://huggingface.co/bertin-project/bertin-roberta-base-spanish
Team members 6
50 million documents in Spanish extracted from mC4 applying perplexity sampling via mc4-sampling: "https://huggingface.co/datasets/bertin-project/mc4-sampling". Please, refer to BERTIN Project. The original dataset is the Multlingual Colossal, Cleaned version of Common Crawl's web crawl corpus (mC4), based on the Common Crawl dataset: "https://c...
A sampling-enabled version of mC4, the colossal, cleaned version of Common Crawl's web crawl corpus. Based on Common Crawl dataset: "https://commoncrawl.org". This is a version of the processed version of Google's mC4 dataset by AllenAI, in which sampling methods are implemented to perform on the fly.