David commited on
Commit
5697249
1 Parent(s): 7e22797

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +113 -0
README.md ADDED
@@ -0,0 +1,113 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - es
4
+ thumbnail: "url to a thumbnail used in social sharing"
5
+ license: apache-2.0
6
+ datasets:
7
+ - oscar
8
+ ---
9
+
10
+ # SELECTRA: A Spanish ELECTRA
11
+
12
+ SELECTRA is a Spanish pre-trained language model based on [ELECTRA](https://github.com/google-research/electra).
13
+ We release a `small` and `medium` version with the following configuration:
14
+
15
+ | Model | Layers | Embedding/Hidden Size | Params | Vocab Size | Max Sequence Length | Cased |
16
+ | --- | --- | --- | --- | --- | --- | --- |
17
+ | SELECTRA small | 12 | 256 | 22M | 50k | 512 | True |
18
+ | **SELECTRA medium** | **12** | **384** | **41M** | **50k** | **512** | **True** |
19
+
20
+ Selectra small (medium) is about 5 (3) times smaller than BETO but achieves comparable results (see Metrics section below).
21
+
22
+ ## Usage
23
+
24
+ From the original [ELECTRA model card](https://huggingface.co/google/electra-small-discriminator): "ELECTRA models are trained to distinguish "real" input tokens vs "fake" input tokens generated by another neural network, similar to the discriminator of a GAN."
25
+ The discriminator should therefore activate the logit corresponding to the fake input token, as the following example demonstrates:
26
+
27
+ ```python
28
+ from transformers import ElectraForPreTraining, ElectraTokenizerFast
29
+
30
+ discriminator = ElectraForPreTraining.from_pretrained("Recognai/selectra_small")
31
+ tokenizer = ElectraTokenizerFast.from_pretrained("Recognai/selectra_small")
32
+
33
+ sentence_with_fake_token = "Estamos desayunando pan rosa con tomate y aceite de oliva."
34
+
35
+ inputs = tokenizer.encode(sentence_with_fake_token, return_tensors="pt")
36
+ logits = discriminator(inputs).logits.tolist()[0]
37
+
38
+ print("\t".join(tokenizer.tokenize(sentence_with_fake_token)))
39
+ print("\t".join(map(lambda x: str(x)[:4], logits[1:-1])))
40
+ """Output:
41
+ Estamos desayun ##ando pan rosa con tomate y aceite de oliva .
42
+ -3.1 -3.6 -6.9 -3.0 0.19 -4.5 -3.3 -5.1 -5.7 -7.7 -4.4 -4.2
43
+ """
44
+ ```
45
+
46
+ However, you probably want to use this model to fine-tune it on a down-stream task.
47
+
48
+ - Links to our zero-shot-classifiers
49
+
50
+ ## Metrics
51
+
52
+ We fine-tune our models on 4 different down-stream tasks:
53
+
54
+ - [XNLI](https://huggingface.co/datasets/xnli)
55
+ - [PAWS-X](https://huggingface.co/datasets/paws-x)
56
+ - [CoNLL2002 - POS](https://huggingface.co/datasets/conll2002)
57
+ - [CoNLL2002 - NER](https://huggingface.co/datasets/conll2002)
58
+
59
+ For each task, we conduct 5 trials and state the mean and standard deviation of the metrics in the table below.
60
+ To compare our results to other Spanish language models, we provide the same metrics taken from [Table 4](https://huggingface.co/bertin-project/bertin-roberta-base-spanish#results) of the Bertin-project model card.
61
+
62
+ | Model | CoNLL2002 - POS (acc) | CoNLL2002 - NER (f1) | PAWS-X (acc) | XNLI (acc) | Params |
63
+ | --- | --- | --- | --- | --- | --- |
64
+ | SELECTRA small | 0.9653 +- 0.0007 | 0.863 +- 0.004 | 0.896 +- 0.002 | 0.784 +- 0.002 | **22M** |
65
+ | SELECTRA medium | 0.9677 +- 0.0004 | 0.870 +- 0.003 | 0.896 +- 0.002 | **0.804 +- 0.002** | 41M |
66
+ | [mBERT](https://huggingface.co/bert-base-multilingual-cased) | 0.9689 | 0.8616 | 0.8895 | 0.7606 | 178M |
67
+ | [BETO](https://huggingface.co/dccuchile/bert-base-spanish-wwm-cased) | 0.9693 | 0.8596 | 0.8720 | 0.8012 | 110M |
68
+ | [BSC-BNE](https://huggingface.co/BSC-TeMU/roberta-base-bne) | **0.9706** | **0.8764** | 0.8815 | 0.7771 | 125M |
69
+ | [Bertin](https://huggingface.co/bertin-project/bertin-roberta-base-spanish/tree/v1-512) | 0.9697 | 0.8707 | **0.8965** | 0.7843 | 125M |
70
+
71
+ Some details of our fine-tuning runs:
72
+ - epochs: 5
73
+ - batch-size: 32
74
+ - learning rate: 1e-4
75
+ - warmup proportion: 0.1
76
+ - linear learning rate decay
77
+ - layerwise learning rate decay
78
+
79
+ For all the details, check out our [selectra repo](https://github.com/recognai/selectra).
80
+
81
+ ## Training
82
+
83
+ We pre-trained our SELECTRA models on the Spanish portion of the [Oscar](https://huggingface.co/datasets/oscar) dataset, which is about 150GB in size.
84
+ Each model version is trained for 300k steps, with a warm restart of the learning rate after the first 150k steps.
85
+ Some details of the training:
86
+ - steps: 300k
87
+ - batch-size: 128
88
+ - learning rate: 5e-4
89
+ - warmup steps: 10k
90
+ - linear learning rate decay
91
+ - TPU cores: 8 (v2-8)
92
+
93
+ For all details, check out our [selectra repo](https://github.com/recognai/selectra).
94
+
95
+ **Note:** Due to a misconfiguration in the pre-training scripts the embeddings of the vocabulary containing an accent were not optimized. If you fine-tune this model on a down-stream task, you might consider using a tokenizer that does not strip the accents:
96
+ ```python
97
+ tokenizer = ElectraTokenizerFast.from_pretrained("Recognai/selectra_small", strip_accents=False)
98
+ ```
99
+
100
+ ## Motivation
101
+
102
+ Despite the abundance of excellent Spanish language models (BETO, BSC-BNE, Bertin, ELECTRICIDAD, etc.), we felt there was still a lack of distilled or compact Spanish language models and a lack of comparing those to their bigger siblings.
103
+
104
+ ## Acknowledgment
105
+
106
+ This research was supported by the Google TPU Research Cloud (TRC) program.
107
+
108
+ ## Authors
109
+
110
+ - David Fidalgo ([GitHub](https://github.com/dcfidalgo))
111
+ - Javier Lopez ([GitHub](https://github.com/javispp))
112
+ - Daniel Vila ([GitHub](https://github.com/dvsrepo))
113
+ - Francisco Aranda ([GitHub](https://github.com/frascuchon))