Pablogps's picture
Update README.md
7818afa
metadata
language: es
license: CC-BY 4.0
tags:
  - spanish
  - roberta
pipeline_tag: fill-mask
widget:
  - text: Fui a la librería a comprar un <mask>.
  • Version 1 (beta): July 15th, 2021
  • Version 1: July 19th, 2021

BERTIN

BERTIN is a series of BERT-based models for Spanish. The current model hub points to the best of all RoBERTa-base models trained from scratch on the Spanish portion of mC4 using Flax. All code and scripts are included.

This is part of the Flax/Jax Community Week, organized by HuggingFace and TPU usage sponsored by Google Cloud.

The aim of this project was to pre-train a RoBERTa-base model from scratch for during the Flax/JAX Community Event in which Google Cloud provided free TPUv3-8 to do the training using Huggingface's Flax implementations of their library.

Spanish mC4

mC4 is a multilingual variant of the C4, the Colossal, Cleaned version of Common Crawl's web crawl corpus. While C4 was used to train the T5 text-to-text Transformer models, mC4 comprises natural text in 101 languages drawn from the public Common Crawl web-scrape and was used to train mT5, the multilingual version of T5.

The Spanish portion of mC4 (mc4-es) contains about 416 million samples and 235 billion words in approximately 1TB of uncompressed data.

$ zcat c4/multilingual/c4-es*.tfrecord*.json.gz | wc -l
416057992
$ zcat c4/multilingual/c4-es*.tfrecord-*.json.gz | jq -r '.text | split(" ") | length' | paste -s -d+ - | bc
235303687795

Perplexity sampling

The large amount of text in mC4-es makes training a language model within the time constraints of the Flax/JAX Community Event by HuggingFace problematic. This motivated the exploration of sampling methods, with the goal of creating a subset of the dataset that allows well-performing training with roughly one eighth of the data (~50M samples) and in approximately half the training steps.

In order to efficiently build this subset of data, we decided to leverage a technique we call perplexity sampling and whose origin can be traced to the construction of CCNet (Wenzek et al., 2020) and their work extracting high quality monolingual datasets from web-crawl data. In their work, they suggest the possibility of applying fast language-models trained on high-quality data such as Wikipedia to filter out texts that deviate too much from correct expressions of a language (see Figure 1). They also released Kneser-Ney models for 100 languages (Spanish included) as implemented in the KenLM library (Heafield, 2011) and trained on their respective Wikipedias.

Figure 1. Perplexity distributions by percentage CCNet corpus.

In this work, we tested the hypothesis that perplexity sampling might help reduce training-data size and training times.

Methodology

In order to test our hypothesis, we first calculated the perplexity of each document in a random subset (roughly a quarter of the data) of mC4-es and extracted their distribution and quartiles (see Figure 2).

Figure 2. Perplexity distributions and quartiles (red lines) of 100M samples of mc4-es.

With the extracted perplexity percentiles, we created two functions to oversample the central quartiles with the idea of biasing against samples that are either too small (short, repetitive texts) or too long (potentially poor quality) (see Figure 3).

The first function is a Stepwise that simply oversamples the central quartiles using quartile boundaries and a factor for the desired sampling frequency for each quartile, obviously given larger frequencies for middle quartiles (oversampling Q2, Q3, subsampling Q1, Q4). The second function was a Gaussian approximation of the Stepwise function to smooth out the sharp boundaries and give a better approximation of the underlying distribution (see Figure 4).

We adjusted the factor parameter of the Stepwise function, and the factor and width parameter of the Gaussian function to roughly be able to sample 50M samples from the 416M in mc4-es (see Figure 4). For comparison, we also sampled randomly mC4-es up to 50M samples as well. In terms of sizes, we went down from 1TB of data to ~200GB.

Figure 3. Expected perplexity distributions of the sample `mc4-es` after applying the `Stepwise` function.

Figure 4. Expected perplexity distributions of the sample `mc4-es` after applying `Gaussian` function.

Figure 5 shows the perplexity distributions of the 50M subsets for each of the approximations. All subsets can be easily accessed for reproducibility purposes using the bertin-project/mc4-es-sampled dataset. Since the validation set was too small to extract a 10% (5M) of the samples using perplexity-sampling with the same factor and width, in our experiments we decided to sample from the training sets. In the bertin-project/mc4-es-sampled dataset, the validation set pulls the samples from the original mc4.

from datasets import load_dataset

for split in ("random", "stepwise", "gaussian"):
    mc4es = load_dataset(
        "bertin-project/mc4-es-sampled",
        "train",
        split=split,
        streaming=True
    ).shuffle(buffer_size=1000)
    for sample in mc4es:
        print(split, sample)
        break

Figure 5. Experimental perplexity distributions of the sampled `mc4-es` after applying `Gaussian` and `Stepwise` functions.

Random sampling displayed the same perplexity distribution of the underlying true distribution, as can be seen in Figure 6.

Figure 6. Experimental perplexity distribution of the sampled `mc4-es` after applying `Random` sampling.

We then used the same setup as Liu et al. (2019) but trained only for half the steps (250k) on a sequence length of 128. In particular, Gaussian trained for the 250k steps, while Random was stopped at 230k and Stepwise at 180k (this was a decision based on an analysis of training performance and the computational resources available at the time).

Then, we continued training the most promising model for a few steps (~25k) more on sequence length 512. We tried two strategies for this, since it is not easy to find clear details about this change in the literature. It turns out this decision had a big impact in the final performance.

For Random sampling we trained with seq len 512 during the last 20 steps of the 250 training steps, keeping the optimizer state intact. Results for this are underwhelming, as seen in Figure 7:

Figure 7. Training profile for Random sampling. Note the drop in performance after the change from 128 to 512 sequence lenght.

For Gaussian sampling we started a new optimizer after 230 steps with 128 seq len, using a short warmup interval. Results are much better using this procedure. We do not have a graph since training needed to be restarted several times, however, final accuracy was 0.6873 compared to 0.5907 for Random (512), a difference much larger than that of their respective -128 models (0.6520 for Random, 0.6608 for Gaussian).

Results

Our first test, tagged beta in this repository, refers to an initial experiment using Stepwise on 128 sequence length and trained for 210k steps. Two nearly identical versions of this model can be found, one at bertin-roberta-base-spanish and the other at flax-community/bertin-roberta-large-spanish (do note this is not our best model!). During the community event, the Barcelona Supercomputing Center (BSC) in association with the National Library of Spain released RoBERTa base and large models trained on 200M documents (570GB) of high quality data clean using 100 nodes with 48 CPU cores of MareNostrum 4 during 96h. At the end of the process they were left with 2TB of clean data at the document level that were further cleaned up to the final 570GB. This is an interesting contrast to our own resources (3xTPUv3-8 for 10 days to do cleaning, sampling, taining, and evaluation) and makes for a valuable reference. The BSC team evaluated our early release of the model beta and the results can be seen in Table 1.

Our final models were trained on a different number of steps and sequence lengths and achieve different—higher—masked-word prediction accuracies. Despite these limitations it is interesting to see the results they obtained using the early version of our model. Note that some of the datasets used for evaluation by BSC are not freely available, therefore it is not possible to verify the figures.

Dataset Metric RoBERTa-b RoBERTa-l BETO mBERT BERTIN
UD-POS F1 0.9907 0.9901 0.9900 0.9886 0.9904
Conll-NER F1 0.8851 0.8772 0.8759 0.8691 0.8627
Capitel-POS F1 0.9846 0.9851 0.9836 0.9839 0.9826
Capitel-NER F1 0.8959 0.8998 0.8771 0.8810 0.8741
STS Combined 0.8423 0.8420 0.8216 0.8249 0.7822
MLDoc Accuracy 0.9595 0.9600 0.9650 0.9560 0.9673
PAWS-X F1 0.9035 0.9000 0.8915 0.9020 0.8820
XNLI Accuracy 0.8016 WiP 0.8130 0.7876 WiP
Table 1. Evaluation made by the Barcelona Supercomputing Center of their models and BERTIN (beta, seq len 128).

All of our models attained good accuracy values, in the range of 0.65, as can be seen in Table 2:

Model Accuracy
bertin-project/bertin-roberta-base-spanish 0.6547
bertin-project/bertin-base-random 0.6520
bertin-project/bertin-base-stepwise 0.6487
bertin-project/bertin-base-gaussian 0.6608
bertin-project/bertin-base-random-exp-512seqlen 0.5907
bertin-project/bertin-base-gaussian-exp-512seqlen 0.6873
Table 2. Accuracy for the different language models.

We are currently in the process of applying our language models to downstream tasks.

SQUAD-es

Using sequence length 128 we have achieved exact match 50.96 and F1 68.74.

POS

All models trained with max length 512 and batch size 8, using the CoNLL 2002 dataset.

Model F1 Accuracy
bert-base-multilingual-cased 0.9629 0.9687
dccuchile/bert-base-spanish-wwm-cased 0.9642 0.9700
BSC-TeMU/roberta-base-bne 0.9659 0.9707
bertin-project/bertin-roberta-base-spanish 0.9638 0.9690
bertin-project/bertin-base-random 0.9656 0.9704
bertin-project/bertin-base-stepwise 0.9656 0.9707
bertin-project/bertin-base-gaussian 0.9662 0.9709
bertin-project/bertin-base-random-exp-512seqlen 0.9660 0.9707
bertin-project/bertin-base-gaussian-exp-512seqlen 0.9662 0.9714
Table 3. Results for POS.

NER

All models trained with max length 512 and batch size 8, using the CoNLL 2002 dataset.

Model F1 Accuracy
bert-base-multilingual-cased 0.8539 0.9779
dccuchile/bert-base-spanish-wwm-cased 0.8579 0.9783
BSC-TeMU/roberta-base-bne 0.8700 0.9807
bertin-project/bertin-roberta-base-spanish 0.8725 0.9812
bertin-project/bertin-base-random 0.8704 0.9807
bertin-project/bertin-base-stepwise 0.8705 0.9809
bertin-project/bertin-base-gaussian 0.8792 0.9816
bertin-project/bertin-base-random-exp-512seqlen 0.8616 0.9803
bertin-project/bertin-base-gaussian-exp-512seqlen 0.8764 0.9819
Table 4. Results for NER.

PAWS-X

All models trained with max length 512 and batch size 8. Even though this model has been run several times, it looks like some of the values reported may not be correct due to clerical errors (particularly the repeated 0.5765 values), so a new run is ongoing.

Model Accuracy
bert-base-multilingual-cased 0.5765
dccuchile/bert-base-spanish-wwm-cased 0.5765
BSC-TeMU/roberta-base-bne 0.5765
bertin-project/bertin-roberta-base-spanish 0.6550
bertin-project/bertin-base-random 0.8665
bertin-project/bertin-base-stepwise 0.8610
bertin-project/bertin-base-gaussian 0.8800
bertin-project/bertin-base-random-exp-512seqlen 0.5765
bertin-project/bertin-base-gaussian-exp-512seqlen 0.875
Table 5. Results for PAWS-X.

CNLI

All models trained with max length 256 and batch size 16.

Model Accuracy
bert-base-multilingual-cased WIP
dccuchile/bert-base-spanish-wwm-cased WIP
BSC-TeMU/roberta-base-bne WIP
bertin-project/bertin-roberta-base-spanish WIP
bertin-project/bertin-base-random 0.7745
bertin-project/bertin-base-stepwise 0.7820
bertin-project/bertin-base-gaussian 0.7942
bertin-project/bertin-base-random-exp-512seqlen 0.7723
bertin-project/bertin-base-gaussian-exp-512seqlen 0.7878
Table 6. Results for CNLI.

Conclusions

With roughly 10 days worth of access to 3xTPUv3-8, we have achieved remarkable results surpassing previous state of the art in a few tasks, and even improving document classification on models trained in massive supercomputers with very large—private—and highly curated datasets.

The experience has been incredible and we feel this kind of events provide an amazing opportunity for small teams on low or non-existent budgets to learn how the big players in the field pre-train their models. The trade-off between learning and experimenting, and being beta-testers of libraries (Flax/JAX) and infrastructure (TPU VMs) is a marginal cost to pay compared to the benefits such access has to offer.

We hope our work will set the basis for more small teams playing and experimenting with language models training on small subsets of data with reduced training times, since the performance of our models is on par with those trained on big machines for longer times.

Team members

Useful links

References

  • CCNet: Extracting High Quality Monolingual Datasets from Web Crawl Data, Guillaume Wenzek, Marie-Anne Lachaux, Alexis Conneau, Vishrav Chaudhary, Francisco Guzmán, Armand Joulin, Edouard Grave, Proceedings of the 12th Language Resources and Evaluation Conference (LREC), p. 4003-4012, May 2020.

  • Heafield, K. (2011). KenLM: faster and smaller language model queries. In Proceedings of the EMNLP2011 Sixth Workshop on Statistical Machine Translation.