File size: 4,329 Bytes
8d005b7 b779edc 4031760 b779edc 0a1815a b779edc 4031760 b779edc 4031760 b779edc 4031760 b779edc 4031760 b779edc |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 |
---
license: apache-2.0
---
<h1 align="center">mT5 small spanish es</h1>
This is a Spanish fine-tuned version of Google's mT5-small model.
https://huggingface.co/google/mt5-small
# Datasets
The datasets used for the fine-tuning
Task Prefix
Multinli (English) multi nli premise:[Text] hypo:[Text]
Multinli (Spanish) multi nli premise:[Text] hypo:[Text]
Pawx (English) pawx sentence1:[Text] sentence2:[Text]
Pawx (Spanish) pawx sentence1:[Text] sentence2:[Text]
Squad (English) question:[Text] context:[Text]
Squad (Spanish) question:[Text] context:[Text]
Translations (English-Spanish) translate English to Spanish:[Text]
Translations (Spanish-English) translate Spanish to English:[Text]
# Inference
The following piece of code could be used to perfome the different model tasks.
Translations
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
model_name = "HURIDOCS/mt5-small-spanish-es"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
task = "translate Spanish to English:Esta frase es para probar el modelo"
input_ids = tokenizer(
[task],
return_tensors="pt",
padding="max_length",
truncation=True,
max_length=512
)["input_ids"]
output_ids = model.generate(
input_ids=input_ids,
max_length=84,
no_repeat_ngram_size=2,
num_beams=4
)[0]
result_text = tokenizer.decode(
output_ids,
skip_special_tokens=True,
clean_up_tokenization_spaces=False
)
print(result_text)
Question answering
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
model_name = "HURIDOCS/mt5-small-spanish-es"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
task = '''question:En qué país se encuentra Normandía? context:Los normandos (normandos: Nourmann; Francés: Normandos; Normanni)
fue el pueblo que en los siglos X y XI dio su nombre a Normandía, una región de Francia.
Eran descendientes de invasores nórdicos ('normandos" viene de "Norseman") y piratas de Dinamarca, Islandia y Noruega que,
bajo su líder Rollo, acordaron jurar lealtad al rey Carlos III de Francia Occidental. A través de generaciones de asimilación
y mezcla con las poblaciones nativas francas y galas romanas, sus descendientes se fusionarían gradualmente con las culturas
carolingias de Francia Occidental. La identidad cultural y étnica distintiva de los normandos surgió inicialmente en la
primera mitad del siglo X, y continuó evolucionando durante los siglos siguientes.'''
input_ids = tokenizer(
[task],
return_tensors="pt",
padding="max_length",
truncation=True,
max_length=512
)["input_ids"]
output_ids = model.generate(
input_ids=input_ids,
max_length=84,
no_repeat_ngram_size=2,
num_beams=4
)[0]
result_text = tokenizer.decode(
output_ids,
skip_special_tokens=True,
clean_up_tokenization_spaces=False
)
print(result_text)
# Fine-tuning
Check out the Transformers Libray examples
https://github.com/huggingface/transformers/tree/main/examples/pytorch/question-answering
# Performance
Spanish SQuAD v2 512 tokens
Model Exact match F1
rank 1 mrm8488/distill-bert-base-spanish-wwm-cased 50.43% 71.45%
rank 2 **mT5 small spanish es** 48.35% 62.03%
rank 3 flan-t5-small 41.44% 56.48% |