projecte-aina
/

aina-translator-ca-en

Fairseq

Catalan

English

Model card Files Files and versions Community

aleixsant commited on May 15, 2024

Commit

5d192f8

verified ·

1 Parent(s): 036dd02

Update README.md

Browse files

Files changed (1) hide show

README.md +15 -17

README.md CHANGED Viewed

@@ -9,13 +9,12 @@ metrics:
 - bleu
 library_name: fairseq
 ---
-## Aina Project's Catalan-English  machine translation model
 ## Model description
-This model was trained from scratch using the [Fairseq toolkit](https://fairseq.readthedocs.io/en/latest/) on a combination of Catalan-English datasets,
-up to 11  million sentences. Additionally, the model is evaluated on several public datasets comprising 5 different domains (general, adminstrative, technology,
-biomedical, and news).
 ## Intended uses and limitations
@@ -44,7 +43,6 @@ translator = ctranslate2.Translator(model_dir)
 translated = translator.translate_batch([tokenized[0]])
 print(tokenizer.detokenize(translated[0][0]['tokens']))
 ```
 ## Limitations and bias
 At the time of submission, no measures have been taken to estimate the bias and toxicity embedded in the model.
 However, we are well aware that our models may be biased. We intend to conduct research in these areas in the future, and if completed, this model card will be updated.
@@ -99,18 +97,18 @@ The following hyperparamenters were set on the Fairseq toolkit:
 | Dropout                            | 0.1                               |
 | Label smoothing                    | 0.1                               |
-The model was trained for a total of 35.000 updates. Weights were saved every 1000 updates and reported results are the average of the last 16 checkpoints.
 ## Evaluation
 ### Variable and metrics
 We use the BLEU score for evaluation on test sets:
 [Spanish Constitution (TaCon)](https://elrc-share.eu/repository/browse/tacon-spanish-constitution-mt-test-set/84a96138b98611ec9c1a00155d02670628f3e6857b0f422abd82abc3795ec8c2/),
 [United Nations](https://zenodo.org/record/3888414#.Y33-_tLMIW0),
 [European Commission](https://elrc-share.eu/repository/browse/european-commission-corpus/8a419b1758ea11ed9c1a00155d0267069bd085cae124481589b0858e5b274327/),
-[Flores-101](https://github.com/facebookresearch/flores),
 [Cybersecurity](https://elrc-share.eu/repository/browse/cyber-mt-test-set/2bd93faab98c11ec9c1a00155d026706b96a490ed3e140f0a29a80a08c46e91e/),
 [wmt19 biomedical test set](http://www.statmt.org/wmt19/biomedical-translation-task.html),
 [wmt13 news test set](https://elrc-share.eu/repository/browse/catalan-wmt2013-machine-translation-shared-task-test-set/84a96139b98611ec9c1a00155d0267061a0aa1b62e2248e89aab4952f3c230fc/).
@@ -123,15 +121,15 @@ Below are the evaluation results on the machine translation from Catalan to Engl
 | Test set             | SoftCatalà | Google Translate | aina-translator-ca-en |
 |----------------------|------------|------------------|---------------|
-| Spanish Constitution | 35,8       | **43,2**         | 40,3          |
-| United Nations       | 44,4       | **47,4**         | 44,8          |
-| European Commission  | 52,0       | **53,7**         | 53,1          |
-| Flores 101 dev       | 42,7       | **47,5**         | 46,1          |
-| Flores 101 devtest   | 42,5       | **46,9**         | 45,2          |
-| Cybersecurity        | 52,5       | **58,0**         | 54,2          |
-| wmt 19 biomedical    | 18,3       | **23,4**         | 21,6          |
-| wmt 13 news          | 37,8       | **39,8**         | 39,3          |
-| **Average**          | 40,8       | **45,0**         | 43,1          |
 ## Additional information

 - bleu
 library_name: fairseq
 ---
+## Projecte Aina's English-Catalan machine translation model
 ## Model description
+This model was trained from scratch using the [Fairseq toolkit](https://fairseq.readthedocs.io/en/latest/) on a combination of English-Catalan datasets,
+which after filtering and cleaning comprised 30.023.034 sentence pairs. The model was evaluated on several public datasets comprising different domains.
 ## Intended uses and limitations
 translated = translator.translate_batch([tokenized[0]])
 print(tokenizer.detokenize(translated[0][0]['tokens']))
 ```
 ## Limitations and bias
 At the time of submission, no measures have been taken to estimate the bias and toxicity embedded in the model.
 However, we are well aware that our models may be biased. We intend to conduct research in these areas in the future, and if completed, this model card will be updated.
 | Dropout                            | 0.1                               |
 | Label smoothing                    | 0.1                               |
+The model was trained for a total of 12.500 updates. Weights were saved every 1000 updates and reported results are the average of the last 6 checkpoints.
 ## Evaluation
 ### Variable and metrics
 We use the BLEU score for evaluation on test sets:
 [Spanish Constitution (TaCon)](https://elrc-share.eu/repository/browse/tacon-spanish-constitution-mt-test-set/84a96138b98611ec9c1a00155d02670628f3e6857b0f422abd82abc3795ec8c2/),
 [United Nations](https://zenodo.org/record/3888414#.Y33-_tLMIW0),
+[AAPP](https://elrc-share.eu/repository/browse/catalan-spanish-catgov-corpus/8088130a722811ed9c1a00155d02670690607f8261a847549c8a0583cbe729da/),
 [European Commission](https://elrc-share.eu/repository/browse/european-commission-corpus/8a419b1758ea11ed9c1a00155d0267069bd085cae124481589b0858e5b274327/),
+[Flores-200](https://github.com/facebookresearch/flores),
 [Cybersecurity](https://elrc-share.eu/repository/browse/cyber-mt-test-set/2bd93faab98c11ec9c1a00155d026706b96a490ed3e140f0a29a80a08c46e91e/),
 [wmt19 biomedical test set](http://www.statmt.org/wmt19/biomedical-translation-task.html),
 [wmt13 news test set](https://elrc-share.eu/repository/browse/catalan-wmt2013-machine-translation-shared-task-test-set/84a96139b98611ec9c1a00155d0267061a0aa1b62e2248e89aab4952f3c230fc/).
 | Test set             | SoftCatalà | Google Translate | aina-translator-ca-en |
 |----------------------|------------|------------------|---------------|
+| Spanish Constitution | 35,8       | 39,1         | 42,8          |
+| United Nations       | 44,4       | 46,9         | 45,9          |
+| European Commission  | 52,0       | 53,7         | 54            |
+| Flores 200 dev       | 42,7       | 52,0         | 47,9          |
+| Flores 200 devtest   | 42,5       | 50,7         | 46,3          |
+| Cybersecurity        | 52,5       | 66,8         | 56,8          |
+| wmt 19 biomedical    | 18,3       | 24,4         | 25,2          |
+| wmt 13 news          | 37,8       | 42,5         | 39,4          |
+| **Average**          | 40,8       | 47,0         | 44,8          |
 ## Additional information