File size: 6,910 Bytes
6f9254a
 
 
 
705ffa6
6f9254a
 
 
 
705ffa6
6f9254a
 
 
705ffa6
6f9254a
 
 
705ffa6
 
6f9254a
 
 
 
705ffa6
6f9254a
 
 
 
 
 
 
 
 
 
 
 
705ffa6
6f9254a
705ffa6
6f9254a
 
 
 
 
 
 
efd9a5c
6f9254a
 
 
 
 
8c66bd2
705ffa6
6f9254a
 
 
705ffa6
6f9254a
 
 
 
 
 
 
 
 
705ffa6
6f9254a
8c66bd2
6f9254a
 
 
 
 
 
ced5354
6f9254a
 
 
 
 
 
 
 
 
705ffa6
6f9254a
 
 
705ffa6
 
6f9254a
705ffa6
 
 
 
6f9254a
 
 
705ffa6
6f9254a
705ffa6
 
 
 
6f9254a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
705ffa6
6f9254a
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
---
license: apache-2.0
language:
- eu
- en
metrics:
- BLEU
- TER
---
## Hitz Center’s Basque-English machine translation model

## Model description

This model was trained from scratch using [Marian NMT](https://marian-nmt.github.io/) on a combination of English-Basque datasets totalling 18,067,996 sentence pairs. 9,033,998 sentence pairs were parallel data collected from the web while the remaining 9,033,998 sentence pairs were parallel synthetic data created using the [ES-EU translator from HiTZ](https://huggingface.co/HiTZ/mt-hitz-es-eu). The model was evaluated on the Flores, TaCon and NTREX evaluation datasets.

- **Developed by:** HiTZ Research Center & IXA Research group (University of the Basque Country UPV/EHU)
- **Model type:** traslation
- **Source Language:** Basque
- **Target Language:** English
- **License:** apache-2.0

## Intended uses and limitations

You can use this model for machine translation from Basque to English.

At the time of submission, no measures have been taken to estimate the bias and toxicity embedded in the model. However, we are aware that our models may be biased since the corpora have been collected using crawling techniques on multiple web sources.

## How to Get Started with the Model

Use the code below to get started with the model.

```
from transformers import MarianMTModel, MarianTokenizer
from transformers import AutoTokenizer
from transformers import AutoModelForSeq2SeqLM

src_text = ["hau proba bat da"]

model_name = "HiTZ/mt-hitz-eu-en"
tokenizer = MarianTokenizer.from_pretrained(model_name)

model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
translated = model.generate(**tokenizer(src_text, return_tensors="pt", padding=T
rue))
print([tokenizer.decode(t, skip_special_tokens=True) for t in translated])`
```
The recommended environments include the following transfomer versions: 4.12.3 , 4.15.0 , 4.26.1

## Training Details

### Training Data

The Basque-English data collected from the web was a combination of the following datasets:


| Dataset       	| Sentences before cleaning	|
|-----------------|--------------------------:|
| CCMatrix  v1    | 7,788,871  	          |
| EhuHac          | 585,210	                  |
| Ehuskaratuak 	  | 482,259	                  |
| Ehuskaratuak 	  | 482,259	                  |
| Elhuyar     	  | 1,176,529                 |
| HPLT        	  | 4,546,563                 |
| OpenSubtitles	  | 805,780                   |
| PaCO_2012    	  | 109,524                   |
| PaCO_2013    	  | 48,892                    |
| WikiMatrix   	  | 119,480                   |
| **Total**     	| **15,653,108**          |

The 9,033,998 sentence pairs of synthetic parallel data were created by translating a compendium of ES-EU parallel corpora into Basque using the [ES-EU translator from HiTZ](https://huggingface.co/HiTZ/mt-hitz-es-eu).


### Training Procedure

#### Preprocessing

After concatenation, all datasets are cleaned and deduplicated using [bifixer](https://github.com/bitextor/bifixer) [(Ramírez-Sánchez et al., 2020)](https://aclanthology.org/2020.eamt-1.31/) for identifying repetions and cleaning encoding problems and LaBSE embeddings to filter missaligned sentences. Any sentence pairs with a LaBSE similarity score of less than 0.5 is removed.  The filtered corpus is composed of 9,033,998 parallel sentences.

#### Tokenization
All data is tokenized using sentencepiece, with a 32,000 token sentencepiece model learned from the combination of all filtered training data. This model is included.

## Evaluation
### Variable and metrics
We use the BLEU and TER scores for evaluation on test sets: [Flores-200](https://github.com/facebookresearch/flores/tree/main/flores200), [TaCon](https://elrc-share.eu/repository/browse/tacon-spanish-constitution-mt-test-set/84a96138b98611ec9c1a00155d02670628f3e6857b0f422abd82abc3795ec8c2/) and [NTREX](https://github.com/MicrosoftTranslator/NTREX)

### Evaluation results
Below are the evaluation results on the machine translation from Basque to English compared to [Google Translate](https://translate.google.com/) and [NLLB 200 3.3B](https://huggingface.co/facebook/nllb-200-3.3B):

####BLEU scores


| Test set         	   |Google Translate | NLLB 3.3 |mt-hitz-eu-en|
|----------------------|-----------------|----------|-------------|
| Flores 200 devtest   |**36.1**         | 32.2     |   28.6      |
| TaCON                |  **22.8**       | 22.7     |   21.9      |
| NTREX                |  **33.7**       | 28.9     |   25.8      |
| Average          	   |  **30.9**       | 27.9     |   25.4      |

####TER scores

| Test set         	   |Google Translate | NLLB 3.3 |mt-hitz-eu-en|
|----------------------|-----------------|----------|-------------|
| Flores 200 devtest   |**46.5**         | 51.2     |  53.1       |
| TaCON                |**57.0**         | 63.0     |  57.5       |
| NTREX                |**50.2**         | 55.5     |  58.2       |
| Average          	   |**51.2**         | 56.6     |  56.3       |


<!-- Momentuz ez dugu artikulurik. ILENIAn zerbait egiten bada eguneratu beharko da -->

<!--
## Citation [optional]

<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. - ->

**BibTeX:**

[More Information Needed]

**APA:**

[More Information Needed]
-->

## Additional information
### Author
HiTZ Research Center & IXA Research group (University of the Basque Country UPV/EHU)
### Contact information
For further information, send an email to <hitz@ehu.eus>
### Licensing information
This work is licensed under a [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0)
### Funding
This work is funded by the Ministerio para la Transformación Digital y de la Función Pública - Funded by EU – NextGenerationEU within the framework of the [project ILENIA](https://proyectoilenia.es/) with reference 2022/TL22/00215337, 2022/TL22/00215336, 2022/TL22/00215335 y 2022/TL22/00215334
### DisclaimerThe recommended environments include the following transfomer versions: 4.12.3 , 4.15.0 , 4.26.1
<details>
<summary>Click to expand</summary>
The models published in this repository are intended for a generalist purpose and are available to third parties. These models may have bias and/or any other undesirable distortions.
When third parties, deploy or provide systems and/or services to other parties using any of these models (or using systems based on these models) or become users of the models, they should note that it is their responsibility to mitigate the risks arising from their use and, in any event, to comply with applicable regulations, including regulations regarding the use of Artificial Intelligence.
In no event shall the owner and creator of the models (HiTZ Research Center) be liable for any results arising from the use made by third parties of these models.
</details>