Fairseq
Catalan
English
aleixsant commited on
Commit
5d192f8
·
verified ·
1 Parent(s): 036dd02

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +15 -17
README.md CHANGED
@@ -9,13 +9,12 @@ metrics:
9
  - bleu
10
  library_name: fairseq
11
  ---
12
- ## Aina Project's Catalan-English machine translation model
13
 
14
  ## Model description
15
 
16
- This model was trained from scratch using the [Fairseq toolkit](https://fairseq.readthedocs.io/en/latest/) on a combination of Catalan-English datasets,
17
- up to 11 million sentences. Additionally, the model is evaluated on several public datasets comprising 5 different domains (general, adminstrative, technology,
18
- biomedical, and news).
19
 
20
  ## Intended uses and limitations
21
 
@@ -44,7 +43,6 @@ translator = ctranslate2.Translator(model_dir)
44
  translated = translator.translate_batch([tokenized[0]])
45
  print(tokenizer.detokenize(translated[0][0]['tokens']))
46
  ```
47
-
48
  ## Limitations and bias
49
  At the time of submission, no measures have been taken to estimate the bias and toxicity embedded in the model.
50
  However, we are well aware that our models may be biased. We intend to conduct research in these areas in the future, and if completed, this model card will be updated.
@@ -99,18 +97,18 @@ The following hyperparamenters were set on the Fairseq toolkit:
99
  | Dropout | 0.1 |
100
  | Label smoothing | 0.1 |
101
 
102
- The model was trained for a total of 35.000 updates. Weights were saved every 1000 updates and reported results are the average of the last 16 checkpoints.
103
 
104
  ## Evaluation
105
 
106
  ### Variable and metrics
107
 
108
  We use the BLEU score for evaluation on test sets:
109
-
110
  [Spanish Constitution (TaCon)](https://elrc-share.eu/repository/browse/tacon-spanish-constitution-mt-test-set/84a96138b98611ec9c1a00155d02670628f3e6857b0f422abd82abc3795ec8c2/),
111
  [United Nations](https://zenodo.org/record/3888414#.Y33-_tLMIW0),
 
112
  [European Commission](https://elrc-share.eu/repository/browse/european-commission-corpus/8a419b1758ea11ed9c1a00155d0267069bd085cae124481589b0858e5b274327/),
113
- [Flores-101](https://github.com/facebookresearch/flores),
114
  [Cybersecurity](https://elrc-share.eu/repository/browse/cyber-mt-test-set/2bd93faab98c11ec9c1a00155d026706b96a490ed3e140f0a29a80a08c46e91e/),
115
  [wmt19 biomedical test set](http://www.statmt.org/wmt19/biomedical-translation-task.html),
116
  [wmt13 news test set](https://elrc-share.eu/repository/browse/catalan-wmt2013-machine-translation-shared-task-test-set/84a96139b98611ec9c1a00155d0267061a0aa1b62e2248e89aab4952f3c230fc/).
@@ -123,15 +121,15 @@ Below are the evaluation results on the machine translation from Catalan to Engl
123
 
124
  | Test set | SoftCatalà | Google Translate | aina-translator-ca-en |
125
  |----------------------|------------|------------------|---------------|
126
- | Spanish Constitution | 35,8 | **43,2** | 40,3 |
127
- | United Nations | 44,4 | **47,4** | 44,8 |
128
- | European Commission | 52,0 | **53,7** | 53,1 |
129
- | Flores 101 dev | 42,7 | **47,5** | 46,1 |
130
- | Flores 101 devtest | 42,5 | **46,9** | 45,2 |
131
- | Cybersecurity | 52,5 | **58,0** | 54,2 |
132
- | wmt 19 biomedical | 18,3 | **23,4** | 21,6 |
133
- | wmt 13 news | 37,8 | **39,8** | 39,3 |
134
- | **Average** | 40,8 | **45,0** | 43,1 |
135
 
136
  ## Additional information
137
 
 
9
  - bleu
10
  library_name: fairseq
11
  ---
12
+ ## Projecte Aina's English-Catalan machine translation model
13
 
14
  ## Model description
15
 
16
+ This model was trained from scratch using the [Fairseq toolkit](https://fairseq.readthedocs.io/en/latest/) on a combination of English-Catalan datasets,
17
+ which after filtering and cleaning comprised 30.023.034 sentence pairs. The model was evaluated on several public datasets comprising different domains.
 
18
 
19
  ## Intended uses and limitations
20
 
 
43
  translated = translator.translate_batch([tokenized[0]])
44
  print(tokenizer.detokenize(translated[0][0]['tokens']))
45
  ```
 
46
  ## Limitations and bias
47
  At the time of submission, no measures have been taken to estimate the bias and toxicity embedded in the model.
48
  However, we are well aware that our models may be biased. We intend to conduct research in these areas in the future, and if completed, this model card will be updated.
 
97
  | Dropout | 0.1 |
98
  | Label smoothing | 0.1 |
99
 
100
+ The model was trained for a total of 12.500 updates. Weights were saved every 1000 updates and reported results are the average of the last 6 checkpoints.
101
 
102
  ## Evaluation
103
 
104
  ### Variable and metrics
105
 
106
  We use the BLEU score for evaluation on test sets:
 
107
  [Spanish Constitution (TaCon)](https://elrc-share.eu/repository/browse/tacon-spanish-constitution-mt-test-set/84a96138b98611ec9c1a00155d02670628f3e6857b0f422abd82abc3795ec8c2/),
108
  [United Nations](https://zenodo.org/record/3888414#.Y33-_tLMIW0),
109
+ [AAPP](https://elrc-share.eu/repository/browse/catalan-spanish-catgov-corpus/8088130a722811ed9c1a00155d02670690607f8261a847549c8a0583cbe729da/),
110
  [European Commission](https://elrc-share.eu/repository/browse/european-commission-corpus/8a419b1758ea11ed9c1a00155d0267069bd085cae124481589b0858e5b274327/),
111
+ [Flores-200](https://github.com/facebookresearch/flores),
112
  [Cybersecurity](https://elrc-share.eu/repository/browse/cyber-mt-test-set/2bd93faab98c11ec9c1a00155d026706b96a490ed3e140f0a29a80a08c46e91e/),
113
  [wmt19 biomedical test set](http://www.statmt.org/wmt19/biomedical-translation-task.html),
114
  [wmt13 news test set](https://elrc-share.eu/repository/browse/catalan-wmt2013-machine-translation-shared-task-test-set/84a96139b98611ec9c1a00155d0267061a0aa1b62e2248e89aab4952f3c230fc/).
 
121
 
122
  | Test set | SoftCatalà | Google Translate | aina-translator-ca-en |
123
  |----------------------|------------|------------------|---------------|
124
+ | Spanish Constitution | 35,8 | 39,1 | 42,8 |
125
+ | United Nations | 44,4 | 46,9 | 45,9 |
126
+ | European Commission | 52,0 | 53,7 | 54 |
127
+ | Flores 200 dev | 42,7 | 52,0 | 47,9 |
128
+ | Flores 200 devtest | 42,5 | 50,7 | 46,3 |
129
+ | Cybersecurity | 52,5 | 66,8 | 56,8 |
130
+ | wmt 19 biomedical | 18,3 | 24,4 | 25,2 |
131
+ | wmt 13 news | 37,8 | 42,5 | 39,4 |
132
+ | **Average** | 40,8 | 47,0 | 44,8 |
133
 
134
  ## Additional information
135