versae's picture
Explanations
295ff9b
metadata
language: es
license: CC-BY 4.0
tags:
  - spanish
  - roberta
pipeline_tag: fill-mask
widget:
  - text: Fui a la librería a comprar un <mask>.
  • Version 1 (beta): July 15th, 2021
  • Version 1: July 19th, 2021

BERTIN

BERTIN is a series of BERT-based models for Spanish. The current model hub points to the best of all RoBERTa-base models trained from scratch on the Spanish portion of mC4 using Flax. All code and scripts are included.

This is part of the Flax/Jax Community Week, organised by HuggingFace and TPU usage sponsored by Google Cloud.

Spanish mC4

mC4 is a multilingual variant of the C4, the Colossal, Cleaned version of Common Crawl's web crawl corpus. While C4 was used to train the T5 text-to-text Transformer models, mC4 comprises natural text in 101 languages drawn from the public Common Crawl web scrape and was used to train mT5, the multilingual version of T5.

The Spanish portion of mC4 (mc4-es) contains about 416 million samples and 235 billion words in aproximatelly 1TB of uncompressed data.

$ zcat c4/multilingual/c4-es*.tfrecord*.json.gz | wc -l
416057992
$ zcat c4/multilingual/c4-es*.tfrecord-*.json.gz | jq -r '.text | split(" ") | length' | paste -s -d+ - | bc
235303687795

Perplexity sampling

Since the amount of Spanish text in mC4 is problematic to train a language model in a reasonable time, within the context of the Flax/JAX Community Event by HuggingFace, we explored the posibility of creating an optimal subset of the samples good enough to train a well performing model with roughly one eighth of the data (~50M samples) and in approxiamtely half the steps. The goal was to pre-train a RoBERTa-base model from scratch for the duration of the Flax/JAX Community Event in which Google Cloud provided free TPUv3-8 to do the training using Huggingface's Flax implementations of their library.

In order to efficiently build this subset of data, we decied to leverage a technique we now call perplexity sampling and whose origin can be traced to the constructon of CCNet (Wenzek et al., 2020) and their work extracting high quality monolingual datasets from web crawl data. In their work, the suggest the possibility of applying fast language models traiend on high quality data such as Wikipedia to filter out text that deviates too much from correct expressions of a language (see Figure 1). They also released Kneser-Ney models for 100 languages (Spanish included) as implemented in the KenLM library (Heafield, 2011) and trained on their respective Wikipedias.

Figure 1. Perplexity distributions by percentage CCNet corpus.

In this work, we tested the hyphothesis that perplexity sampling might help reduce training data size and time.

Methodology

In order to test our hyphothesis, we first calculated the perplexity of each document in the entire mC4-es and extracted its distributions and quartiles. Effectively, we only extracted perplexity values for roughly a quarter of the datatet and plotted its distribution and the corresponding quartiles (see Figure 2).

Figure 2. Perplexity distributions and quarties (red lines) of 100M samples of mc4-es.

With the extracted perplexity percentiles, we created two functions to oversample the central quartiles with the idea of excluding samples that were neither too small (short, repetitive texts) or too long (potentially poor quality) (see Figure 3). The first function was a stepwise that simple oversampled the central quartiles using que quartiles boundaries and a factor for how heavily these should be oversampled. The second function was a gaussian approximation of the stepwise function to smoth out the sharp boundaries and give a better approximation of the underlying distribution (see Figure 4). We adjusted the factor parameter of the stepwise function, and the factor and width parameter of the gaussian function to roughly be able to sample 50M samples from the 416M in mc4-es (see Figure 4). For comparison, we also sampled randomply mc-4 up to 50M samples as well. In terms of sizes, we went down from 1TB of data to ~200GB.

Figure 3. Expected perplexity distributions of the sample `mc4-es` after applying `stepwise` function.

Figure 4. Expected perplexity distributions of the sample `mc4-es` after applying `gaussian` function.

Figure 5 shows the effective perplexity distributions of the 50M subsets for each of the approximations. All subsets can be easily accessed for reproducibility purposes using the bertin-project/mc4-es-sampled dataset. Since the validation set was too small to extract a 10% (5M) of the samples using perplexity sampling with the same factor and width, in our experiments we decided to sample from the training sets. In the bertin-project/mc4-es-sampled dataset, the validation set pulls the samples from the origina mc4.

from datasets import load_dataset

for split in ("random", "stepwise", "gaussian"):
    mc4es = load_dataset(
        "bertin-project/mc4-es-sampled",
        "train",
        split=split,
        streaming=True
    ).shuffle(buffer_size=1000)
    for sample in mc4es:
        print(split, sample)
        break

Figure 5. Real perplexity distributions of the sampled `mc4-es` after applying `gaussian` and `stepwise` functions.

The random sampling also displayed the same perplexity distribution of the underlying true distribution, as it can be seen in Figure 6.

Figure 6. Real perplexity distributions of the sampled `mc4-es` after applying `random` sampling.

We then used the same setup as in Liu et al. (2019) but trained only for half the steps (250k) on a sequence length of 128. Then, we continue training the most promising model for 25k more on sequence length of 512.

Our first test, tagged beta in this repository, refers to an initial experiment using stepwise but a small factor to oversample everything.

Results

Our first test, tagged beta in this repository, refers to an initial experiment using stepwise on 128 sequence lengths but a small factor to oversample everything. During the community event, the Barcelona Supercomputing Center in association with the National Library of Spain released RoBERTa base and large models trained on 200M documents (570GB) of high quality data clean using 100 nodes with 48 CPU cores of MareNostrum 4 during 96h. At the end of the process they were left with 2TB of clean data at the document level that further cleaned up to the final 570GB. In all our experiments and procedures, we had access to 3xTPUv3-8 for 10 days to do cleaning, sampling, taining, and evaluation. The BSC team evaluated our early release of the model beta and the results can be seen in Table 1. We are no waiting for the evaluation on the rest of our experiments to finish. The final models were trained on different number of steps and sequence lengths and achieve different masked word prediction accuracies. Some of the datasets used for evaluation are not freely available, therefore we are not in position to verify the figures.

Dataset Metric RoBERTa-b RoBERTa-l BETO mBERT BERTIN
UD-POS F1 0.9907 0.9901 0.9900 0.9886 0.9904
Conll-NER F1 0.8851 0.8772 0.8759 0.8691 0.8627
Capitel-POS F1 0.9846 0.9851 0.9836 0.9839 0.9826
Capitel-NER F1 0.8959 0.8998 0.8771 0.8810 0.8741
STS Combined 0.8423 0.8420 0.8216 0.8249 0.7822
MLDoc Accuracy 0.9595 0.9600 0.9650 0.9560 0.9673
PAWS-X F1 0.9035 0.9000 0.8915 0.9020 0.8820
XNLI Accuracy 0.8016 WiP 0.8130 0.7876 WiP
Table 1. Evaluation made by the Barcelona Supercomputing Center of their models and BERTIN (beta).

Conclusions

With roughly 10 days to access to TPUs, we have achieve remarkable results surpassing previous state of the art in a few tasks, and even improving document classification on models trained in massive supercomputers with humongous private and highly curated datasets.

The expericence has been incredible and we feel this kind of events provide an amazing opportunity for small teams on low or non-existent budgets to learn how the big players in the field pre-train their models. The trade-off between learning and experimenting, and being beta-testers of libraries (Flax/JAX) and infrastructure (TPU VMs) is a marginal cost to pay compared to the benefits of access.

We hope our work set the basis for more small teams playing and experimenting with language models training on small subsets of data and for shorter times, since the performance of our models is on par with those trained on big machines for long times.

Team members

Useful links

References

  • CCNet: Extracting High Quality Monolingual Datasets from Web Crawl Data, Guillaume Wenzek, Marie-Anne Lachaux, Alexis Conneau, Vishrav Chaudhary, Francisco Guzmán, Armand Joulin, Edouard Grave, Proceedings of the 12th Language Resources and Evaluation Conference (LREC), p. 4003-4012, May 2020.

  • Heafield, K. (2011). KenLM: faster and smaller language model queries. In Proceedings of the EMNLP2011 Sixth Workshop on Statistical Machine Translation.