# bertin-project /bertin-base-gaussian

This is a RoBERTa-base model trained from scratch in Spanish.

The training dataset is mc4 subsampling documents to a total of about 50 million examples. Sampling is biased towards average perplexity values (using a Gaussian function), discarding more often documents with very large values (poor quality) of very small values (short, repetitive texts).

This model has been trained for 250.000 steps.

Mask token: <mask>