David
commited on
Commit
•
bc7c5f6
1
Parent(s):
6864ed8
Update README.md
Browse files
README.md
CHANGED
@@ -2,15 +2,9 @@
|
|
2 |
language:
|
3 |
- es
|
4 |
thumbnail: "url to a thumbnail used in social sharing"
|
5 |
-
tags:
|
6 |
-
- tag1
|
7 |
-
- tag2
|
8 |
license: apache-2.0
|
9 |
datasets:
|
10 |
- oscar
|
11 |
-
metrics:
|
12 |
-
- metric1
|
13 |
-
- metric2
|
14 |
---
|
15 |
|
16 |
# SELECTRA: A Spanish ELECTRA
|
@@ -27,7 +21,8 @@ Selectra small is about 5 times smaller than BETO but achieves comparable result
|
|
27 |
|
28 |
## Usage
|
29 |
|
30 |
-
|
|
|
31 |
|
32 |
```python
|
33 |
from transformers import ElectraForPreTraining, ElectraTokenizerFast
|
@@ -48,6 +43,8 @@ Estamos desayun ##ando pan rosa con tomate y aceite de
|
|
48 |
"""
|
49 |
```
|
50 |
|
|
|
|
|
51 |
- Links to our zero-shot-classifiers
|
52 |
|
53 |
## Metrics
|
@@ -59,31 +56,59 @@ We fine-tune our models on 4 different down-stream tasks:
|
|
59 |
- [CoNLL2002 - POS](https://huggingface.co/datasets/conll2002)
|
60 |
- [CoNLL2002 - NER](https://huggingface.co/datasets/conll2002)
|
61 |
|
62 |
-
|
63 |
-
|
64 |
-
The metrics
|
65 |
|
|
|
66 |
|
67 |
| Model | CoNLL2002 - POS (acc) | CoNLL2002 - NER (f1) | PAWS-X (acc) | XNLI (acc) | Params |
|
68 |
| --- | --- | --- | --- | --- | --- |
|
69 |
-
| SELECTRA small | 0.9653 +- 0.0007 | 0.863 +- 0.004 | 0.896 +- 0.002 | 0.784 +- 0.002 | 22M |
|
70 |
-
| SELECTRA medium | 0.9677 +- 0.0004 | 0.870 +- 0.003 | 0.896 +- 0.002 | 0.804 +- 0.002 | 41M |
|
71 |
| [mBERT](https://huggingface.co/bert-base-multilingual-cased) | 0.9689 | 0.8616 | 0.8895 | 0.7606 | 178M |
|
72 |
| [BETO](https://huggingface.co/dccuchile/bert-base-spanish-wwm-cased) | 0.9693 | 0.8596 | 0.8720 | 0.8012 | 110M |
|
73 |
-
| [BSC-BNE](https://huggingface.co/BSC-TeMU/roberta-base-bne) | 0.9706 | 0.8764 | 0.8815 | 0.7771 | 125M |
|
74 |
-
| [Bertin](https://huggingface.co/bertin-project/bertin-roberta-base-spanish/tree/v1-512) | 0.9697 | 0.8707 | 0.8965 | 0.7843 | 125M |
|
75 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
76 |
|
77 |
## Training
|
78 |
|
79 |
-
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
80 |
|
81 |
## Motivation
|
82 |
|
83 |
-
Despite the abundance of
|
84 |
|
85 |
## Acknowledgment
|
86 |
|
87 |
-
This research was supported by the
|
|
|
|
|
88 |
|
89 |
-
|
|
|
|
|
|
|
|
2 |
language:
|
3 |
- es
|
4 |
thumbnail: "url to a thumbnail used in social sharing"
|
|
|
|
|
|
|
5 |
license: apache-2.0
|
6 |
datasets:
|
7 |
- oscar
|
|
|
|
|
|
|
8 |
---
|
9 |
|
10 |
# SELECTRA: A Spanish ELECTRA
|
|
|
21 |
|
22 |
## Usage
|
23 |
|
24 |
+
From the original [ELECTRA model card](https://huggingface.co/google/electra-small-discriminator): "ELECTRA models are trained to distinguish "real" input tokens vs "fake" input tokens generated by another neural network, similar to the discriminator of a GAN."
|
25 |
+
The discriminator should therefore activate the logit corresponding to the fake input token, as the following example demonstrates:
|
26 |
|
27 |
```python
|
28 |
from transformers import ElectraForPreTraining, ElectraTokenizerFast
|
|
|
43 |
"""
|
44 |
```
|
45 |
|
46 |
+
However, you probably want to use this model to fine-tune it on a down-stream task.
|
47 |
+
|
48 |
- Links to our zero-shot-classifiers
|
49 |
|
50 |
## Metrics
|
|
|
56 |
- [CoNLL2002 - POS](https://huggingface.co/datasets/conll2002)
|
57 |
- [CoNLL2002 - NER](https://huggingface.co/datasets/conll2002)
|
58 |
|
59 |
+
For each task, we conduct 5 trials and state the mean and standard deviation of the metrics in the table below.
|
|
|
|
|
60 |
|
61 |
+
To compare our results to other Spanish language models, we provide the same metrics taken from [Table 4](https://huggingface.co/bertin-project/bertin-roberta-base-spanish#results) of the Bertin-project model card.
|
62 |
|
63 |
| Model | CoNLL2002 - POS (acc) | CoNLL2002 - NER (f1) | PAWS-X (acc) | XNLI (acc) | Params |
|
64 |
| --- | --- | --- | --- | --- | --- |
|
65 |
+
| SELECTRA small | 0.9653 +- 0.0007 | 0.863 +- 0.004 | 0.896 +- 0.002 | 0.784 +- 0.002 | **22M** |
|
66 |
+
| SELECTRA medium | 0.9677 +- 0.0004 | 0.870 +- 0.003 | 0.896 +- 0.002 | **0.804 +- 0.002** | 41M |
|
67 |
| [mBERT](https://huggingface.co/bert-base-multilingual-cased) | 0.9689 | 0.8616 | 0.8895 | 0.7606 | 178M |
|
68 |
| [BETO](https://huggingface.co/dccuchile/bert-base-spanish-wwm-cased) | 0.9693 | 0.8596 | 0.8720 | 0.8012 | 110M |
|
69 |
+
| [BSC-BNE](https://huggingface.co/BSC-TeMU/roberta-base-bne) | **0.9706** | **0.8764** | 0.8815 | 0.7771 | 125M |
|
70 |
+
| [Bertin](https://huggingface.co/bertin-project/bertin-roberta-base-spanish/tree/v1-512) | 0.9697 | 0.8707 | **0.8965** | 0.7843 | 125M |
|
71 |
|
72 |
+
Some details of our fine-tuning runs:
|
73 |
+
- epochs: 5
|
74 |
+
- batch-size: 32
|
75 |
+
- learning rate: 1e-4
|
76 |
+
- warmup proportion: 0.1
|
77 |
+
- linear learning rate decay
|
78 |
+
- layerwise learning rate decay
|
79 |
+
|
80 |
+
For all the details, check out our [selectra repo](https://github.com/recognai/selectra).
|
81 |
|
82 |
## Training
|
83 |
|
84 |
+
We pre-trained our SELECTRA models on the Spanish portion of the [Oscar](https://huggingface.co/datasets/oscar) dataset, which is about 150GB in size.
|
85 |
+
Each model version is trained for 300k steps, with a warm restart of the learning rate after the first 150k steps.
|
86 |
+
Some details of the training:
|
87 |
+
- steps: 300k
|
88 |
+
- batch-size: 128
|
89 |
+
- learning rate: 5e-4
|
90 |
+
- warmup steps: 10k
|
91 |
+
- linear learning rate decay
|
92 |
+
- TPU cores: 8 (v2-8)
|
93 |
+
|
94 |
+
For all details, check out our [selectra repo](https://github.com/recognai/selectra).
|
95 |
+
|
96 |
+
**Note:** Due to a misconfiguration in the pre-training scripts the embeddings of the vocabulary containing an accent were not optimized. If you fine-tune this model on a down-stream task, you might consider using a tokenizer that does not strip the accents:
|
97 |
+
```python
|
98 |
+
tokenizer = ElectraTokenizerFast.from_pretrained("Recognai/selectra_small", strip_accents=False)
|
99 |
+
```
|
100 |
|
101 |
## Motivation
|
102 |
|
103 |
+
Despite the abundance of excellent Spanish language models (BETO, BSC-BNE, Bertin, ELECTRICIDAD, etc.), we felt there was still a lack of distilled or compact Spanish language models and a lack of comparing those to their bigger siblings.
|
104 |
|
105 |
## Acknowledgment
|
106 |
|
107 |
+
This research was supported by the Google TPU Research Cloud (TRC) program.
|
108 |
+
|
109 |
+
## Authors
|
110 |
|
111 |
+
- David Fidalgo ([GitHub](https://github.com/dcfidalgo))
|
112 |
+
- Javier Lopez ([GitHub](https://github.com/javispp))
|
113 |
+
- Daniel Vila ([GitHub](https://github.com/dvsrepo))
|
114 |
+
- Francisco Aranda ([GitHub](https://github.com/frascuchon))
|