guillermoruiz
/

bilma_AR

Inference Endpoints

Model card Files Files and versions Community

guillermoruiz commited on Apr 2

Commit

a1b59de

•

1 Parent(s): c4f22f8

Update README.md

Files changed (1) hide show

README.md +62 -27

README.md CHANGED Viewed

@@ -1,47 +1,82 @@
 ---
 tags:
-- generated_from_keras_callback
-model-index:
-- name: bilma_AR
-  results: []
 ---
-<!-- This model card has been generated automatically according to the information Keras had access to. You should
-probably proofread and complete it, then remove this comment. -->
-# bilma_AR
-This model was trained from scratch on an unknown dataset.
-It achieves the following results on the evaluation set:
-## Model description
-More information needed
-## Intended uses & limitations
-More information needed
-## Training and evaluation data
-More information needed
-## Training procedure
-### Training hyperparameters
-The following hyperparameters were used during training:
-- optimizer: None
-- training_precision: float32
-### Training results
-### Framework versions
-- Transformers 4.30.2
-- TensorFlow 2.4.0
-- Datasets 2.13.2
-- Tokenizers 0.13.3

 ---
+license: mit
+language:
+- es
+metrics:
+- accuracy
 tags:
+- code
+- nlp
+- custom
+- bilma
+tokenizer:
+- yes
 ---
+# BILMA (Bert In Latin aMericA)
+Bilma is a BERT implementation in tensorflow and trained on the Masked Language Model task under the
+https://sadit.github.io/regional-spanish-models-talk-2022/ datasets. It is a model trained on regionalized
+Spanish short texts from the Twitter (now X) platform.
+We have pretrained models for the countries of Argentina, Chile, Colombia, Spain, Mexico, United States, Uruguay, and Venezuela.
+The accuracy of the models trained on the MLM task for different regions are:
+![bilma-mlm-comp](https://user-images.githubusercontent.com/392873/163045798-89bd45c5-b654-4f16-b3e2-5cf404e12ddd.png)
+# Pre-requisites
+You will need TensorFlow 2.4 or newer.
+# Quick guide
+Install the following version for the transformers library
+```
+!pip install transformers==4.30.2
+```
+Instanciate the tokenizer and the trained model
+```
+from transformers import AutoTokenizer
+from transformers import TFAutoModel
+tok = AutoTokenizer.from_pretrained("guillermoruiz/bilma_mx")
+model = TFAutoModel.from_pretrained("guillermoruiz/bilma_mx", trust_remote_code=True)
+```
+Now,we will need some text and then pass it through the tokenizer:
+```
+text = ["Vamos a comer [MASK].",
+        "Hace mucho que no voy al [MASK]."]
+t = tok(text, padding="max_length", return_tensors="tf", max_length=280)
+```
+With this, we are ready to use the model
+```
+p = model(t)
+```
+Now, we get the most likely words with:
+```
+import tensorflow as tf
+tok.batch_decode(tf.argmax(p["logits"], 2)[:,1:], skip_special_tokens=True)
+```
+which produces the output:
+```
+['vamos a comer tacos.', 'hace mucho que no voy al gym.']
+```
+If you find this model useful for your research, please cite the following paper:
+```
+@misc{tellez2022regionalized,
+      title={Regionalized models for Spanish language variations based on Twitter},
+      author={Eric S. Tellez and Daniela Moctezuma and Sabino Miranda and Mario Graff and Guillermo Ruiz},
+      year={2022},
+      eprint={2110.06128},
+      archivePrefix={arXiv},
+      primaryClass={cs.CL}
+}
+```