Update README.md
Browse files
README.md
CHANGED
@@ -9,13 +9,12 @@ metrics:
|
|
9 |
- bleu
|
10 |
library_name: fairseq
|
11 |
---
|
12 |
-
## Aina
|
13 |
|
14 |
## Model description
|
15 |
|
16 |
-
This model was trained from scratch using the [Fairseq toolkit](https://fairseq.readthedocs.io/en/latest/) on a combination of Catalan
|
17 |
-
|
18 |
-
biomedical, and news).
|
19 |
|
20 |
## Intended uses and limitations
|
21 |
|
@@ -44,7 +43,6 @@ translator = ctranslate2.Translator(model_dir)
|
|
44 |
translated = translator.translate_batch([tokenized[0]])
|
45 |
print(tokenizer.detokenize(translated[0][0]['tokens']))
|
46 |
```
|
47 |
-
|
48 |
## Limitations and bias
|
49 |
At the time of submission, no measures have been taken to estimate the bias and toxicity embedded in the model.
|
50 |
However, we are well aware that our models may be biased. We intend to conduct research in these areas in the future, and if completed, this model card will be updated.
|
@@ -99,18 +97,18 @@ The following hyperparamenters were set on the Fairseq toolkit:
|
|
99 |
| Dropout | 0.1 |
|
100 |
| Label smoothing | 0.1 |
|
101 |
|
102 |
-
The model was trained for a total of
|
103 |
|
104 |
## Evaluation
|
105 |
|
106 |
### Variable and metrics
|
107 |
|
108 |
We use the BLEU score for evaluation on test sets:
|
109 |
-
|
110 |
[Spanish Constitution (TaCon)](https://elrc-share.eu/repository/browse/tacon-spanish-constitution-mt-test-set/84a96138b98611ec9c1a00155d02670628f3e6857b0f422abd82abc3795ec8c2/),
|
111 |
[United Nations](https://zenodo.org/record/3888414#.Y33-_tLMIW0),
|
|
|
112 |
[European Commission](https://elrc-share.eu/repository/browse/european-commission-corpus/8a419b1758ea11ed9c1a00155d0267069bd085cae124481589b0858e5b274327/),
|
113 |
-
[Flores-
|
114 |
[Cybersecurity](https://elrc-share.eu/repository/browse/cyber-mt-test-set/2bd93faab98c11ec9c1a00155d026706b96a490ed3e140f0a29a80a08c46e91e/),
|
115 |
[wmt19 biomedical test set](http://www.statmt.org/wmt19/biomedical-translation-task.html),
|
116 |
[wmt13 news test set](https://elrc-share.eu/repository/browse/catalan-wmt2013-machine-translation-shared-task-test-set/84a96139b98611ec9c1a00155d0267061a0aa1b62e2248e89aab4952f3c230fc/).
|
@@ -123,15 +121,15 @@ Below are the evaluation results on the machine translation from Catalan to Engl
|
|
123 |
|
124 |
| Test set | SoftCatalà | Google Translate | aina-translator-ca-en |
|
125 |
|----------------------|------------|------------------|---------------|
|
126 |
-
| Spanish Constitution | 35,8 |
|
127 |
-
| United Nations | 44,4 |
|
128 |
-
| European Commission | 52,0 |
|
129 |
-
| Flores
|
130 |
-
| Flores
|
131 |
-
| Cybersecurity | 52,5 |
|
132 |
-
| wmt 19 biomedical | 18,3 |
|
133 |
-
| wmt 13 news | 37,8 |
|
134 |
-
| **Average** | 40,8 |
|
135 |
|
136 |
## Additional information
|
137 |
|
|
|
9 |
- bleu
|
10 |
library_name: fairseq
|
11 |
---
|
12 |
+
## Projecte Aina's English-Catalan machine translation model
|
13 |
|
14 |
## Model description
|
15 |
|
16 |
+
This model was trained from scratch using the [Fairseq toolkit](https://fairseq.readthedocs.io/en/latest/) on a combination of English-Catalan datasets,
|
17 |
+
which after filtering and cleaning comprised 30.023.034 sentence pairs. The model was evaluated on several public datasets comprising different domains.
|
|
|
18 |
|
19 |
## Intended uses and limitations
|
20 |
|
|
|
43 |
translated = translator.translate_batch([tokenized[0]])
|
44 |
print(tokenizer.detokenize(translated[0][0]['tokens']))
|
45 |
```
|
|
|
46 |
## Limitations and bias
|
47 |
At the time of submission, no measures have been taken to estimate the bias and toxicity embedded in the model.
|
48 |
However, we are well aware that our models may be biased. We intend to conduct research in these areas in the future, and if completed, this model card will be updated.
|
|
|
97 |
| Dropout | 0.1 |
|
98 |
| Label smoothing | 0.1 |
|
99 |
|
100 |
+
The model was trained for a total of 12.500 updates. Weights were saved every 1000 updates and reported results are the average of the last 6 checkpoints.
|
101 |
|
102 |
## Evaluation
|
103 |
|
104 |
### Variable and metrics
|
105 |
|
106 |
We use the BLEU score for evaluation on test sets:
|
|
|
107 |
[Spanish Constitution (TaCon)](https://elrc-share.eu/repository/browse/tacon-spanish-constitution-mt-test-set/84a96138b98611ec9c1a00155d02670628f3e6857b0f422abd82abc3795ec8c2/),
|
108 |
[United Nations](https://zenodo.org/record/3888414#.Y33-_tLMIW0),
|
109 |
+
[AAPP](https://elrc-share.eu/repository/browse/catalan-spanish-catgov-corpus/8088130a722811ed9c1a00155d02670690607f8261a847549c8a0583cbe729da/),
|
110 |
[European Commission](https://elrc-share.eu/repository/browse/european-commission-corpus/8a419b1758ea11ed9c1a00155d0267069bd085cae124481589b0858e5b274327/),
|
111 |
+
[Flores-200](https://github.com/facebookresearch/flores),
|
112 |
[Cybersecurity](https://elrc-share.eu/repository/browse/cyber-mt-test-set/2bd93faab98c11ec9c1a00155d026706b96a490ed3e140f0a29a80a08c46e91e/),
|
113 |
[wmt19 biomedical test set](http://www.statmt.org/wmt19/biomedical-translation-task.html),
|
114 |
[wmt13 news test set](https://elrc-share.eu/repository/browse/catalan-wmt2013-machine-translation-shared-task-test-set/84a96139b98611ec9c1a00155d0267061a0aa1b62e2248e89aab4952f3c230fc/).
|
|
|
121 |
|
122 |
| Test set | SoftCatalà | Google Translate | aina-translator-ca-en |
|
123 |
|----------------------|------------|------------------|---------------|
|
124 |
+
| Spanish Constitution | 35,8 | 39,1 | 42,8 |
|
125 |
+
| United Nations | 44,4 | 46,9 | 45,9 |
|
126 |
+
| European Commission | 52,0 | 53,7 | 54 |
|
127 |
+
| Flores 200 dev | 42,7 | 52,0 | 47,9 |
|
128 |
+
| Flores 200 devtest | 42,5 | 50,7 | 46,3 |
|
129 |
+
| Cybersecurity | 52,5 | 66,8 | 56,8 |
|
130 |
+
| wmt 19 biomedical | 18,3 | 24,4 | 25,2 |
|
131 |
+
| wmt 13 news | 37,8 | 42,5 | 39,4 |
|
132 |
+
| **Average** | 40,8 | 47,0 | 44,8 |
|
133 |
|
134 |
## Additional information
|
135 |
|