tiedeman commited on
Commit
3c27c4d
1 Parent(s): a36c34e

Initial commit

Browse files
.gitattributes CHANGED
@@ -29,3 +29,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
29
  *.zip filter=lfs diff=lfs merge=lfs -text
30
  *.zst filter=lfs diff=lfs merge=lfs -text
31
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
29
  *.zip filter=lfs diff=lfs merge=lfs -text
30
  *.zst filter=lfs diff=lfs merge=lfs -text
31
  *tfevents* filter=lfs diff=lfs merge=lfs -text
32
+ *.spm filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,332 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - es
4
+ - fr
5
+ - he
6
+ - it
7
+ - itc
8
+ - pt
9
+
10
+ tags:
11
+ - translation
12
+ - opus-mt-tc
13
+
14
+ license: cc-by-4.0
15
+ model-index:
16
+ - name: opus-mt-tc-big-he-itc
17
+ results:
18
+ - task:
19
+ name: Translation heb-cat
20
+ type: translation
21
+ args: heb-cat
22
+ dataset:
23
+ name: flores101-devtest
24
+ type: flores_101
25
+ args: heb cat devtest
26
+ metrics:
27
+ - name: BLEU
28
+ type: bleu
29
+ value: 30.4
30
+ - name: chr-F
31
+ type: chrf
32
+ value: 0.56398
33
+ - task:
34
+ name: Translation heb-fra
35
+ type: translation
36
+ args: heb-fra
37
+ dataset:
38
+ name: flores101-devtest
39
+ type: flores_101
40
+ args: heb fra devtest
41
+ metrics:
42
+ - name: BLEU
43
+ type: bleu
44
+ value: 33.7
45
+ - name: chr-F
46
+ type: chrf
47
+ value: 0.59254
48
+ - task:
49
+ name: Translation heb-glg
50
+ type: translation
51
+ args: heb-glg
52
+ dataset:
53
+ name: flores101-devtest
54
+ type: flores_101
55
+ args: heb glg devtest
56
+ metrics:
57
+ - name: BLEU
58
+ type: bleu
59
+ value: 24.5
60
+ - name: chr-F
61
+ type: chrf
62
+ value: 0.51861
63
+ - task:
64
+ name: Translation heb-ita
65
+ type: translation
66
+ args: heb-ita
67
+ dataset:
68
+ name: flores101-devtest
69
+ type: flores_101
70
+ args: heb ita devtest
71
+ metrics:
72
+ - name: BLEU
73
+ type: bleu
74
+ value: 20.8
75
+ - name: chr-F
76
+ type: chrf
77
+ value: 0.50540
78
+ - task:
79
+ name: Translation heb-por
80
+ type: translation
81
+ args: heb-por
82
+ dataset:
83
+ name: flores101-devtest
84
+ type: flores_101
85
+ args: heb por devtest
86
+ metrics:
87
+ - name: BLEU
88
+ type: bleu
89
+ value: 33.1
90
+ - name: chr-F
91
+ type: chrf
92
+ value: 0.58818
93
+ - task:
94
+ name: Translation heb-ron
95
+ type: translation
96
+ args: heb-ron
97
+ dataset:
98
+ name: flores101-devtest
99
+ type: flores_101
100
+ args: heb ron devtest
101
+ metrics:
102
+ - name: BLEU
103
+ type: bleu
104
+ value: 22.3
105
+ - name: chr-F
106
+ type: chrf
107
+ value: 0.51480
108
+ - task:
109
+ name: Translation heb-spa
110
+ type: translation
111
+ args: heb-spa
112
+ dataset:
113
+ name: flores101-devtest
114
+ type: flores_101
115
+ args: heb spa devtest
116
+ metrics:
117
+ - name: BLEU
118
+ type: bleu
119
+ value: 21.6
120
+ - name: chr-F
121
+ type: chrf
122
+ value: 0.49786
123
+ - task:
124
+ name: Translation heb-fra
125
+ type: translation
126
+ args: heb-fra
127
+ dataset:
128
+ name: tatoeba-test-v2021-08-07
129
+ type: tatoeba_mt
130
+ args: heb-fra
131
+ metrics:
132
+ - name: BLEU
133
+ type: bleu
134
+ value: 47.5
135
+ - name: chr-F
136
+ type: chrf
137
+ value: 0.64713
138
+ - task:
139
+ name: Translation heb-ita
140
+ type: translation
141
+ args: heb-ita
142
+ dataset:
143
+ name: tatoeba-test-v2021-08-07
144
+ type: tatoeba_mt
145
+ args: heb-ita
146
+ metrics:
147
+ - name: BLEU
148
+ type: bleu
149
+ value: 42.1
150
+ - name: chr-F
151
+ type: chrf
152
+ value: 0.64836
153
+ - task:
154
+ name: Translation heb-por
155
+ type: translation
156
+ args: heb-por
157
+ dataset:
158
+ name: tatoeba-test-v2021-08-07
159
+ type: tatoeba_mt
160
+ args: heb-por
161
+ metrics:
162
+ - name: BLEU
163
+ type: bleu
164
+ value: 41.2
165
+ - name: chr-F
166
+ type: chrf
167
+ value: 0.61428
168
+ - task:
169
+ name: Translation heb-spa
170
+ type: translation
171
+ args: heb-spa
172
+ dataset:
173
+ name: tatoeba-test-v2021-08-07
174
+ type: tatoeba_mt
175
+ args: heb-spa
176
+ metrics:
177
+ - name: BLEU
178
+ type: bleu
179
+ value: 51.3
180
+ - name: chr-F
181
+ type: chrf
182
+ value: 0.69210
183
+ ---
184
+ # opus-mt-tc-big-he-itc
185
+
186
+ ## Table of Contents
187
+ - [Model Details](#model-details)
188
+ - [Uses](#uses)
189
+ - [Risks, Limitations and Biases](#risks-limitations-and-biases)
190
+ - [How to Get Started With the Model](#how-to-get-started-with-the-model)
191
+ - [Training](#training)
192
+ - [Evaluation](#evaluation)
193
+ - [Citation Information](#citation-information)
194
+ - [Acknowledgements](#acknowledgements)
195
+
196
+ ## Model Details
197
+
198
+ Neural machine translation model for translating from Hebrew (he) to Italic languages (itc).
199
+
200
+ This model is part of the [OPUS-MT project](https://github.com/Helsinki-NLP/Opus-MT), an effort to make neural machine translation models widely available and accessible for many languages in the world. All models are originally trained using the amazing framework of [Marian NMT](https://marian-nmt.github.io/), an efficient NMT implementation written in pure C++. The models have been converted to pyTorch using the transformers library by huggingface. Training data is taken from [OPUS](https://opus.nlpl.eu/) and training pipelines use the procedures of [OPUS-MT-train](https://github.com/Helsinki-NLP/Opus-MT-train).
201
+ **Model Description:**
202
+ - **Developed by:** Language Technology Research Group at the University of Helsinki
203
+ - **Model Type:** Translation (transformer-big)
204
+ - **Release**: 2022-07-25
205
+ - **License:** CC-BY-4.0
206
+ - **Language(s):**
207
+ - Source Language(s): heb
208
+ - Target Language(s): fra ita por spa
209
+ - Valid Target Language Labels: >>fra<< >>ita<< >>por<< >>spa<<
210
+ - **Original Model**: [opusTCv20210807_transformer-big_2022-07-25.zip](https://object.pouta.csc.fi/Tatoeba-MT-models/heb-itc/opusTCv20210807_transformer-big_2022-07-25.zip)
211
+ - **Resources for more information:**
212
+ - [OPUS-MT-train GitHub Repo](https://github.com/Helsinki-NLP/OPUS-MT-train)
213
+ - More information about released models for this language pair: [OPUS-MT heb-itc README](https://github.com/Helsinki-NLP/Tatoeba-Challenge/tree/master/models/heb-itc/README.md)
214
+ - [More information about MarianNMT models in the transformers library](https://huggingface.co/docs/transformers/model_doc/marian)
215
+ - [Tatoeba Translation Challenge](https://github.com/Helsinki-NLP/Tatoeba-Challenge/
216
+
217
+ This is a multilingual translation model with multiple target languages. A sentence initial language token is required in the form of `>>id<<` (id = valid target language ID), e.g. `>>fra<<`
218
+
219
+ ## Uses
220
+
221
+ This model can be used for translation and text-to-text generation.
222
+
223
+ ## Risks, Limitations and Biases
224
+
225
+ **CONTENT WARNING: Readers should be aware that the model is trained on various public data sets that may contain content that is disturbing, offensive, and can propagate historical and current stereotypes.**
226
+
227
+ Significant research has explored bias and fairness issues with language models (see, e.g., [Sheng et al. (2021)](https://aclanthology.org/2021.acl-long.330.pdf) and [Bender et al. (2021)](https://dl.acm.org/doi/pdf/10.1145/3442188.3445922)).
228
+
229
+ ## How to Get Started With the Model
230
+
231
+ A short example code:
232
+
233
+ ```python
234
+ from transformers import MarianMTModel, MarianTokenizer
235
+
236
+ src_text = [
237
+ ">>cat<< מרי פמיניסטית.",
238
+ ">>spa<< תתרמו לטטואבה."
239
+ ]
240
+
241
+ model_name = "pytorch-models/opus-mt-tc-big-he-itc"
242
+ tokenizer = MarianTokenizer.from_pretrained(model_name)
243
+ model = MarianMTModel.from_pretrained(model_name)
244
+ translated = model.generate(**tokenizer(src_text, return_tensors="pt", padding=True))
245
+
246
+ for t in translated:
247
+ print( tokenizer.decode(t, skip_special_tokens=True) )
248
+
249
+ # expected output:
250
+ # Mary és feminista.
251
+ # Donen a Tatoeba.
252
+ ```
253
+
254
+ You can also use OPUS-MT models with the transformers pipelines, for example:
255
+
256
+ ```python
257
+ from transformers import pipeline
258
+ pipe = pipeline("translation", model="Helsinki-NLP/opus-mt-tc-big-he-itc")
259
+ print(pipe(">>cat<< מרי פמיניסטית."))
260
+
261
+ # expected output: Mary és feminista.
262
+ ```
263
+
264
+ ## Training
265
+
266
+ - **Data**: opusTCv20210807 ([source](https://github.com/Helsinki-NLP/Tatoeba-Challenge))
267
+ - **Pre-processing**: SentencePiece (spm32k,spm32k)
268
+ - **Model Type:** transformer-big
269
+ - **Original MarianNMT Model**: [opusTCv20210807_transformer-big_2022-07-25.zip](https://object.pouta.csc.fi/Tatoeba-MT-models/heb-itc/opusTCv20210807_transformer-big_2022-07-25.zip)
270
+ - **Training Scripts**: [GitHub Repo](https://github.com/Helsinki-NLP/OPUS-MT-train)
271
+
272
+ ## Evaluation
273
+
274
+ * test set translations: [opusTCv20210807_transformer-big_2022-07-25.test.txt](https://object.pouta.csc.fi/Tatoeba-MT-models/heb-itc/opusTCv20210807_transformer-big_2022-07-25.test.txt)
275
+ * test set scores: [opusTCv20210807_transformer-big_2022-07-25.eval.txt](https://object.pouta.csc.fi/Tatoeba-MT-models/heb-itc/opusTCv20210807_transformer-big_2022-07-25.eval.txt)
276
+ * benchmark results: [benchmark_results.txt](benchmark_results.txt)
277
+ * benchmark output: [benchmark_translations.zip](benchmark_translations.zip)
278
+
279
+ | langpair | testset | chr-F | BLEU | #sent | #words |
280
+ |----------|---------|-------|-------|-------|--------|
281
+ | heb-fra | tatoeba-test-v2021-08-07 | 0.64713 | 47.5 | 3281 | 26123 |
282
+ | heb-ita | tatoeba-test-v2021-08-07 | 0.64836 | 42.1 | 1706 | 11464 |
283
+ | heb-por | tatoeba-test-v2021-08-07 | 0.61428 | 41.2 | 719 | 5335 |
284
+ | heb-spa | tatoeba-test-v2021-08-07 | 0.69210 | 51.3 | 1849 | 14213 |
285
+ | heb-cat | flores101-devtest | 0.56398 | 30.4 | 1012 | 27304 |
286
+ | heb-fra | flores101-devtest | 0.59254 | 33.7 | 1012 | 28343 |
287
+ | heb-glg | flores101-devtest | 0.51861 | 24.5 | 1012 | 26582 |
288
+ | heb-ita | flores101-devtest | 0.50540 | 20.8 | 1012 | 27306 |
289
+ | heb-por | flores101-devtest | 0.58818 | 33.1 | 1012 | 26519 |
290
+ | heb-ron | flores101-devtest | 0.51480 | 22.3 | 1012 | 26799 |
291
+ | heb-spa | flores101-devtest | 0.49786 | 21.6 | 1012 | 29199 |
292
+
293
+ ## Citation Information
294
+
295
+ * Publications: [OPUS-MT – Building open translation services for the World](https://aclanthology.org/2020.eamt-1.61/) and [The Tatoeba Translation Challenge – Realistic Data Sets for Low Resource and Multilingual MT](https://aclanthology.org/2020.wmt-1.139/) (Please, cite if you use this model.)
296
+
297
+ ```
298
+ @inproceedings{tiedemann-thottingal-2020-opus,
299
+ title = "{OPUS}-{MT} {--} Building open translation services for the World",
300
+ author = {Tiedemann, J{\"o}rg and Thottingal, Santhosh},
301
+ booktitle = "Proceedings of the 22nd Annual Conference of the European Association for Machine Translation",
302
+ month = nov,
303
+ year = "2020",
304
+ address = "Lisboa, Portugal",
305
+ publisher = "European Association for Machine Translation",
306
+ url = "https://aclanthology.org/2020.eamt-1.61",
307
+ pages = "479--480",
308
+ }
309
+
310
+ @inproceedings{tiedemann-2020-tatoeba,
311
+ title = "The Tatoeba Translation Challenge {--} Realistic Data Sets for Low Resource and Multilingual {MT}",
312
+ author = {Tiedemann, J{\"o}rg},
313
+ booktitle = "Proceedings of the Fifth Conference on Machine Translation",
314
+ month = nov,
315
+ year = "2020",
316
+ address = "Online",
317
+ publisher = "Association for Computational Linguistics",
318
+ url = "https://aclanthology.org/2020.wmt-1.139",
319
+ pages = "1174--1182",
320
+ }
321
+ ```
322
+
323
+ ## Acknowledgements
324
+
325
+ The work is supported by the [European Language Grid](https://www.european-language-grid.eu/) as [pilot project 2866](https://live.european-language-grid.eu/catalogue/#/resource/projects/2866), by the [FoTran project](https://www.helsinki.fi/en/researchgroups/natural-language-understanding-with-cross-lingual-grounding), funded by the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement No 771113), and the [MeMAD project](https://memad.eu/), funded by the European Union’s Horizon 2020 Research and Innovation Programme under grant agreement No 780069. We are also grateful for the generous computational resources and IT infrastructure provided by [CSC -- IT Center for Science](https://www.csc.fi/), Finland.
326
+
327
+ ## Model conversion info
328
+
329
+ * transformers version: 4.16.2
330
+ * OPUS-MT git hash: 8b9f0b0
331
+ * port time: Fri Aug 12 18:35:37 EEST 2022
332
+ * port machine: LM0-400-22516.local
benchmark_results.txt ADDED
@@ -0,0 +1,22 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ heb-cat flores101-dev 0.56174 30.1 997 25962
2
+ heb-fra flores101-dev 0.59526 34.5 997 26706
3
+ heb-glg flores101-dev 0.51734 25.0 997 25265
4
+ heb-ita flores101-dev 0.51067 21.4 997 25840
5
+ heb-por flores101-dev 0.58422 33.0 997 25287
6
+ heb-ron flores101-dev 0.51777 23.6 997 25616
7
+ heb-spa flores101-dev 0.49378 21.6 997 27793
8
+ heb-cat flores101-devtest 0.56398 30.4 1012 27304
9
+ heb-fra flores101-devtest 0.59254 33.7 1012 28343
10
+ heb-glg flores101-devtest 0.51861 24.5 1012 26582
11
+ heb-ita flores101-devtest 0.50540 20.8 1012 27306
12
+ heb-por flores101-devtest 0.58818 33.1 1012 26519
13
+ heb-ron flores101-devtest 0.51480 22.3 1012 26799
14
+ heb-spa flores101-devtest 0.49786 21.6 1012 29199
15
+ heb-por tatoeba-test-v2020-07-28 0.61114 40.9 702 5234
16
+ heb-lad tatoeba-test-v2021-03-30 0.19024 2.0 237 1434
17
+ heb-por tatoeba-test-v2021-03-30 0.61108 41.0 735 5458
18
+ heb-fra tatoeba-test-v2021-08-07 0.64713 47.5 3281 26123
19
+ heb-ita tatoeba-test-v2021-08-07 0.64836 42.1 1706 11464
20
+ heb-lad tatoeba-test-v2021-08-07 0.18441 1.7 218 1309
21
+ heb-por tatoeba-test-v2021-08-07 0.61428 41.2 719 5335
22
+ heb-spa tatoeba-test-v2021-08-07 0.69210 51.3 1849 14213
benchmark_translations.zip ADDED
@@ -0,0 +1,3 @@
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:ee4e410830c35966f8bbb7877063fe3404f560536c80a3cd114535a25dca5c56
3
+ size 2690703
config.json ADDED
@@ -0,0 +1,45 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "activation_dropout": 0.0,
3
+ "activation_function": "relu",
4
+ "architectures": [
5
+ "MarianMTModel"
6
+ ],
7
+ "attention_dropout": 0.0,
8
+ "bad_words_ids": [
9
+ [
10
+ 61145
11
+ ]
12
+ ],
13
+ "bos_token_id": 0,
14
+ "classifier_dropout": 0.0,
15
+ "d_model": 1024,
16
+ "decoder_attention_heads": 16,
17
+ "decoder_ffn_dim": 4096,
18
+ "decoder_layerdrop": 0.0,
19
+ "decoder_layers": 6,
20
+ "decoder_start_token_id": 61145,
21
+ "decoder_vocab_size": 61146,
22
+ "dropout": 0.1,
23
+ "encoder_attention_heads": 16,
24
+ "encoder_ffn_dim": 4096,
25
+ "encoder_layerdrop": 0.0,
26
+ "encoder_layers": 6,
27
+ "eos_token_id": 26845,
28
+ "forced_eos_token_id": 26845,
29
+ "init_std": 0.02,
30
+ "is_encoder_decoder": true,
31
+ "max_length": 512,
32
+ "max_position_embeddings": 1024,
33
+ "model_type": "marian",
34
+ "normalize_embedding": false,
35
+ "num_beams": 4,
36
+ "num_hidden_layers": 6,
37
+ "pad_token_id": 61145,
38
+ "scale_embedding": true,
39
+ "share_encoder_decoder_embeddings": true,
40
+ "static_position_embeddings": true,
41
+ "torch_dtype": "float16",
42
+ "transformers_version": "4.18.0.dev0",
43
+ "use_cache": true,
44
+ "vocab_size": 61146
45
+ }
pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:2338bc1a918976d2d7934111452aad477da62baaa81854f46d330407e2a2da47
3
+ size 603382723
source.spm ADDED
@@ -0,0 +1,3 @@
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:208e2b160d6b457b678223d08314c0eac4484505faa14c2731590803b61bd080
3
+ size 878069
special_tokens_map.json ADDED
@@ -0,0 +1 @@
 
1
+ {"eos_token": "</s>", "unk_token": "<unk>", "pad_token": "<pad>"}
target.spm ADDED
@@ -0,0 +1,3 @@
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:19fe555d4b6fb3f87af6cfa57da92479232459bbe60e623a512c2fbc2302ef0b
3
+ size 805079
tokenizer_config.json ADDED
@@ -0,0 +1 @@
 
1
+ {"source_lang": "he", "target_lang": "itc", "unk_token": "<unk>", "eos_token": "</s>", "pad_token": "<pad>", "model_max_length": 512, "sp_model_kwargs": {}, "separate_vocabs": false, "special_tokens_map_file": null, "name_or_path": "marian-models/opusTCv20210807_transformer-big_2022-07-25/he-itc", "tokenizer_class": "MarianTokenizer"}
vocab.json ADDED
The diff for this file is too large to render. See raw diff