guillermoruiz
commited on
Commit
•
a1b59de
1
Parent(s):
c4f22f8
Update README.md
Browse files
README.md
CHANGED
@@ -1,47 +1,82 @@
|
|
1 |
---
|
|
|
|
|
|
|
|
|
|
|
2 |
tags:
|
3 |
-
-
|
4 |
-
|
5 |
-
-
|
6 |
-
|
|
|
|
|
7 |
---
|
|
|
8 |
|
9 |
-
|
10 |
-
|
|
|
11 |
|
12 |
-
|
13 |
|
14 |
-
|
15 |
-
It achieves the following results on the evaluation set:
|
16 |
|
|
|
17 |
|
18 |
-
|
19 |
|
20 |
-
|
21 |
|
22 |
-
|
23 |
|
24 |
-
|
|
|
|
|
|
|
25 |
|
26 |
-
## Training and evaluation data
|
27 |
|
28 |
-
More information needed
|
29 |
|
30 |
-
|
|
|
|
|
|
|
31 |
|
32 |
-
|
|
|
|
|
33 |
|
34 |
-
|
35 |
-
|
36 |
-
|
|
|
|
|
|
|
37 |
|
38 |
-
|
|
|
|
|
|
|
39 |
|
|
|
|
|
|
|
|
|
|
|
40 |
|
|
|
|
|
|
|
|
|
41 |
|
42 |
-
|
43 |
-
|
44 |
-
|
45 |
-
|
46 |
-
|
47 |
-
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
2 |
+
license: mit
|
3 |
+
language:
|
4 |
+
- es
|
5 |
+
metrics:
|
6 |
+
- accuracy
|
7 |
tags:
|
8 |
+
- code
|
9 |
+
- nlp
|
10 |
+
- custom
|
11 |
+
- bilma
|
12 |
+
tokenizer:
|
13 |
+
- yes
|
14 |
---
|
15 |
+
# BILMA (Bert In Latin aMericA)
|
16 |
|
17 |
+
Bilma is a BERT implementation in tensorflow and trained on the Masked Language Model task under the
|
18 |
+
https://sadit.github.io/regional-spanish-models-talk-2022/ datasets. It is a model trained on regionalized
|
19 |
+
Spanish short texts from the Twitter (now X) platform.
|
20 |
|
21 |
+
We have pretrained models for the countries of Argentina, Chile, Colombia, Spain, Mexico, United States, Uruguay, and Venezuela.
|
22 |
|
23 |
+
The accuracy of the models trained on the MLM task for different regions are:
|
|
|
24 |
|
25 |
+
![bilma-mlm-comp](https://user-images.githubusercontent.com/392873/163045798-89bd45c5-b654-4f16-b3e2-5cf404e12ddd.png)
|
26 |
|
27 |
+
# Pre-requisites
|
28 |
|
29 |
+
You will need TensorFlow 2.4 or newer.
|
30 |
|
31 |
+
# Quick guide
|
32 |
|
33 |
+
Install the following version for the transformers library
|
34 |
+
```
|
35 |
+
!pip install transformers==4.30.2
|
36 |
+
```
|
37 |
|
|
|
38 |
|
|
|
39 |
|
40 |
+
Instanciate the tokenizer and the trained model
|
41 |
+
```
|
42 |
+
from transformers import AutoTokenizer
|
43 |
+
from transformers import TFAutoModel
|
44 |
|
45 |
+
tok = AutoTokenizer.from_pretrained("guillermoruiz/bilma_mx")
|
46 |
+
model = TFAutoModel.from_pretrained("guillermoruiz/bilma_mx", trust_remote_code=True)
|
47 |
+
```
|
48 |
|
49 |
+
Now,we will need some text and then pass it through the tokenizer:
|
50 |
+
```
|
51 |
+
text = ["Vamos a comer [MASK].",
|
52 |
+
"Hace mucho que no voy al [MASK]."]
|
53 |
+
t = tok(text, padding="max_length", return_tensors="tf", max_length=280)
|
54 |
+
```
|
55 |
|
56 |
+
With this, we are ready to use the model
|
57 |
+
```
|
58 |
+
p = model(t)
|
59 |
+
```
|
60 |
|
61 |
+
Now, we get the most likely words with:
|
62 |
+
```
|
63 |
+
import tensorflow as tf
|
64 |
+
tok.batch_decode(tf.argmax(p["logits"], 2)[:,1:], skip_special_tokens=True)
|
65 |
+
```
|
66 |
|
67 |
+
which produces the output:
|
68 |
+
```
|
69 |
+
['vamos a comer tacos.', 'hace mucho que no voy al gym.']
|
70 |
+
```
|
71 |
|
72 |
+
If you find this model useful for your research, please cite the following paper:
|
73 |
+
```
|
74 |
+
@misc{tellez2022regionalized,
|
75 |
+
title={Regionalized models for Spanish language variations based on Twitter},
|
76 |
+
author={Eric S. Tellez and Daniela Moctezuma and Sabino Miranda and Mario Graff and Guillermo Ruiz},
|
77 |
+
year={2022},
|
78 |
+
eprint={2110.06128},
|
79 |
+
archivePrefix={arXiv},
|
80 |
+
primaryClass={cs.CL}
|
81 |
+
}
|
82 |
+
```
|