gabriel-p commited on
Commit
b779edc
1 Parent(s): b6e8a5f

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +118 -0
README.md CHANGED
@@ -1,3 +1,121 @@
1
  ---
2
  license: apache-2.0
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: apache-2.0
3
  ---
4
+
5
+ <h3 align="center">mT5 small spanish es</h3>
6
+
7
+ This a Spanish fine-tuned model using as a starting point the base model mt5-small by Google.
8
+
9
+ https://huggingface.co/google/mt5-small
10
+
11
+
12
+ ## Datasets
13
+
14
+ The datasets used for the fine-tuning
15
+
16
+ Task Prefix
17
+ Multinli (English) multi nli premise:[Text] hypo:[Text]
18
+ Multinli (Spanish) multi nli premise:[Text] hypo:[Text]
19
+ Pawx (English) pawx sentence1:[Text] sentence2:[Text]
20
+ Pawx (Spanish) pawx sentence1:[Text] sentence2:[Text]
21
+ Squad (English) question:[Text] context:[Text]
22
+ Squad (Spanish) question:[Text] context:[Text]
23
+ Translations (English-Spanish) translate English to Spanish:[Text]
24
+ Translations (Spanish-English) translate Spanish to English:[Text]
25
+
26
+
27
+
28
+ ## Inference
29
+
30
+ The following piece of code could be used to perfome the different model tasks.
31
+
32
+ Translations
33
+
34
+ from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
35
+
36
+ model_name = "HURIDOCS/mt5-small-spanish-es"
37
+
38
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
39
+ model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
40
+
41
+ task = "translate Spanish to English:Esta frase es para probar el modelo"
42
+ input_ids = tokenizer(
43
+ [task],
44
+ return_tensors="pt",
45
+ padding="max_length",
46
+ truncation=True,
47
+ max_length=512
48
+ )["input_ids"]
49
+
50
+ output_ids = model.generate(
51
+ input_ids=input_ids,
52
+ max_length=84,
53
+ no_repeat_ngram_size=2,
54
+ num_beams=4
55
+ )[0]
56
+
57
+ result_text = tokenizer.decode(
58
+ output_ids,
59
+ skip_special_tokens=True,
60
+ clean_up_tokenization_spaces=False
61
+ )
62
+
63
+ print(result_text)
64
+
65
+
66
+ Question answering
67
+
68
+
69
+ from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
70
+
71
+ model_name = "HURIDOCS/mt5-small-spanish-es"
72
+
73
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
74
+ model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
75
+
76
+ task = '''question:En qué país se encuentra Normandía? context:Los normandos (normandos: Nourmann; Francés: Normandos; Normanni)
77
+ fue el pueblo que en los siglos X y XI dio su nombre a Normandía, una región de Francia.
78
+ Eran descendientes de invasores nórdicos ('normandos" viene de "Norseman") y piratas de Dinamarca, Islandia y Noruega que,
79
+ bajo su líder Rollo, acordaron jurar lealtad al rey Carlos III de Francia Occidental. A través de generaciones de asimilación
80
+ y mezcla con las poblaciones nativas francas y galas romanas, sus descendientes se fusionarían gradualmente con las culturas
81
+ carolingias de Francia Occidental. La identidad cultural y étnica distintiva de los normandos surgió inicialmente en la
82
+ primera mitad del siglo X, y continuó evolucionando durante los siglos siguientes.'''
83
+
84
+ input_ids = tokenizer(
85
+ [task],
86
+ return_tensors="pt",
87
+ padding="max_length",
88
+ truncation=True,
89
+ max_length=512
90
+ )["input_ids"]
91
+
92
+ output_ids = model.generate(
93
+ input_ids=input_ids,
94
+ max_length=84,
95
+ no_repeat_ngram_size=2,
96
+ num_beams=4
97
+ )[0]
98
+
99
+ result_text = tokenizer.decode(
100
+ output_ids,
101
+ skip_special_tokens=True,
102
+ clean_up_tokenization_spaces=False
103
+ )
104
+
105
+ print(result_text)
106
+
107
+ ## Fine-tuning
108
+
109
+ Check out the Transformers Libray examples
110
+
111
+ https://github.com/huggingface/transformers/tree/main/examples/pytorch/question-answering
112
+
113
+
114
+ ## Performance
115
+
116
+ Spanish SQuAD v2 512 tokens
117
+
118
+ Model Exact match F1
119
+ rank 1 mrm8488/distill-bert-base-spanish-wwm-cased 50.43% 71.45%
120
+ rank 2 **mT5 small spanish es** 48.35% 62.03%
121
+ rank 3 flan-t5-small 41.44% 56.48%