metadata

language:
  - es
thumbnail: url to a thumbnail used in social sharing
license: apache-2.0
datasets:
  - oscar

SELECTRA: A Spanish ELECTRA

SELECTRA is a Spanish pre-trained language model based on ELECTRA. We release a small and medium version with the following configuration:

Model	Layers	Embedding/Hidden Size	Params	Vocab Size	Max Sequence Length	Cased
SELECTRA small	12	256	22M	50k	512	True
SELECTRA medium	12	384	41M	50k	512	True

Selectra small (medium) is about 5 (3) times smaller than BETO but achieves comparable results (see Metrics section below).

Usage

From the original ELECTRA model card: "ELECTRA models are trained to distinguish "real" input tokens vs "fake" input tokens generated by another neural network, similar to the discriminator of a GAN." The discriminator should therefore activate the logit corresponding to the fake input token, as the following example demonstrates:

from transformers import ElectraForPreTraining, ElectraTokenizerFast

discriminator = ElectraForPreTraining.from_pretrained("Recognai/selectra_small")
tokenizer = ElectraTokenizerFast.from_pretrained("Recognai/selectra_small")

sentence_with_fake_token = "Estamos desayunando pan rosa con tomate y aceite de oliva."

inputs = tokenizer.encode(sentence_with_fake_token, return_tensors="pt")
logits = discriminator(inputs).logits.tolist()[0]

print("\t".join(tokenizer.tokenize(sentence_with_fake_token)))
print("\t".join(map(lambda x: str(x)[:4], logits[1:-1])))
"""Output:
Estamos desayun ##ando  pan     rosa    con     tomate  y       aceite  de      oliva   .
-3.1    -3.6    -6.9    -3.0    0.19    -4.5    -3.3    -5.1    -5.7    -7.7    -4.4    -4.2
"""

However, you probably want to use this model to fine-tune it on a down-stream task.

Links to our zero-shot-classifiers

Metrics

We fine-tune our models on 4 different down-stream tasks:

For each task, we conduct 5 trials and state the mean and standard deviation of the metrics in the table below. To compare our results to other Spanish language models, we provide the same metrics taken from Table 4 of the Bertin-project model card.

Model	CoNLL2002 - POS (acc)	CoNLL2002 - NER (f1)	PAWS-X (acc)	XNLI (acc)	Params
SELECTRA small	0.9653 +- 0.0007	0.863 +- 0.004	0.896 +- 0.002	0.784 +- 0.002	22M
SELECTRA medium	0.9677 +- 0.0004	0.870 +- 0.003	0.896 +- 0.002	0.804 +- 0.002	41M
mBERT	0.9689	0.8616	0.8895	0.7606	178M
BETO	0.9693	0.8596	0.8720	0.8012	110M
BSC-BNE	0.9706	0.8764	0.8815	0.7771	125M
Bertin	0.9697	0.8707	0.8965	0.7843	125M

Some details of our fine-tuning runs:

epochs: 5
batch-size: 32
learning rate: 1e-4
warmup proportion: 0.1
linear learning rate decay
layerwise learning rate decay

For all the details, check out our selectra repo.

Training

We pre-trained our SELECTRA models on the Spanish portion of the Oscar dataset, which is about 150GB in size. Each model version is trained for 300k steps, with a warm restart of the learning rate after the first 150k steps. Some details of the training:

steps: 300k
batch-size: 128
learning rate: 5e-4
warmup steps: 10k
linear learning rate decay
TPU cores: 8 (v2-8)

For all details, check out our selectra repo.

Note: Due to a misconfiguration in the pre-training scripts the embeddings of the vocabulary containing an accent were not optimized. If you fine-tune this model on a down-stream task, you might consider using a tokenizer that does not strip the accents:

tokenizer = ElectraTokenizerFast.from_pretrained("Recognai/selectra_small", strip_accents=False)

Motivation

Despite the abundance of excellent Spanish language models (BETO, BSC-BNE, Bertin, ELECTRICIDAD, etc.), we felt there was still a lack of distilled or compact Spanish language models and a lack of comparing those to their bigger siblings.

Acknowledgment

This research was supported by the Google TPU Research Cloud (TRC) program.

Authors

David Fidalgo (GitHub)
Javier Lopez (GitHub)
Daniel Vila (GitHub)
Francisco Aranda (GitHub)