Fairseq
Catalan
English
carlosep93 commited on
Commit
d408e65
1 Parent(s): edd2fde

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +53 -54
README.md CHANGED
@@ -62,30 +62,30 @@ print(tokenizer.detokenize(translated[0][0]['tokens']))
62
 
63
  The was trained on a combination of the following datasets:
64
 
65
- | Dataset | Sentences | Tokens |
66
- |--------------------|----------------|-------------------|
67
- | Global Voices | 21.342 | 438.032 |
68
- | Memories Lluires | 1.173.055 | 9.452.382 |
69
- | Wikimatrix | 1.205.908 | 28.111.517 |
70
- | TED Talks | 50.979 | 770.774 |
71
- | Tatoeba | 5.500 | 34.872 |
72
- | CoVost 2 ca-en | 79.633 | 809.660 |
73
- | CoVost 2 en-ca | 263.891 | 2.953.096 |
74
- | Europarl | 1.965.734 | 50.417.289 |
75
- | jw300 | 97.081 | 1.809.252 |
76
- | Crawled Generalitat| 38.595 | 858.385 |
77
- | Opus Books | 4.580 | 73.416 |
78
- | CC Aligned | 5.787.682 | 89.606.874 |
79
- | COVID_Wikipedia | 1.531 | 34.836 |
80
- | EuroBooks | 3.746 | 82.067 |
81
- | Gnome | 2.183 | 30.228 |
82
- | KDE 4 | 144.153 | 1.450.631 |
83
- | OpenSubtitles | 427.913 | 2.796.350 |
84
- | QED | 69.823 | 1.058.003 |
85
- | Ubuntu | 6.781 | 33.321 |
86
- | Wikimedia | 208.073 | 5.761.409 |
87
- |--------------------|----------------|-------------------|
88
- | **Total** | **11.558.183** | **196.582.394** |
89
 
90
  ### Training procedure
91
 
@@ -103,26 +103,26 @@ The was trained on a combination of the following datasets:
103
  The model is based on the Transformer-XLarge proposed by [Subramanian et al.](https://aclanthology.org/2021.wmt-1.18.pdf)
104
  The following hyperparamenters were set on the Fairseq toolkit:
105
 
106
- | Hyperparameter | Value |
107
- |------------------------------------|----------------------------------|
108
- | Architecture | transformer_vaswani_wmt_en_de_bi |
109
- | Embedding size | 1024 |
110
- | Feedforward size | 4096 |
111
- | Number of heads | 16 |
112
- | Encoder layers | 24 |
113
- | Decoder layers | 6 |
114
- | Normalize before attention | True |
115
- | --share-decoder-input-output-embed | True |
116
- | --share-all-embeddings | True |
117
- | Effective batch size | 96.000 |
118
- | Optimizer | adam |
119
- | Adam betas | (0.9, 0.980) |
120
- | Clip norm | 0.0 |
121
- | Learning rate | 1e-3 |
122
- | Lr. schedurer | inverse sqrt |
123
- | Warmup updates | 4000 |
124
- | Dropout | 0.1 |
125
- | Label smoothing | 0.1 |
126
 
127
  The model was trained for a total of 35.000 updates. Weights were saved every 1000 updates and reported results are the average of the last 16 checkpoints.
128
 
@@ -139,17 +139,16 @@ Below are the evaluation results on the machine translation from Catalan to Engl
139
 
140
  | Test set | SoftCatalà | Google Translate | mt-aina-ca-en |
141
  |----------------------|------------|------------------|---------------|
142
- | Spanish Constitution | | 43,2 | 40,3 |
143
- | United Nations | | 47,4 | 44,8 |
144
- | aina_aapp | | 53 | 51,5 |
145
- | aina_eu_comission | | | |
146
- | Flores 101 dev | | 47,5 | 46,1 |
147
- | Flores 101 devtest | | 46,9 | 45,2 |
148
- | Cybersecurity | | 58 | 54,2 |
149
- | wmt 19 biomedical | | 23,4 | 21,6 |
150
- | wmt 13 news | | 39,8 | 39,3 |
151
  |----------------------|------------|------------------|---------------|
152
- | Average | | | |
153
 
154
 
155
  ## Additional information
 
62
 
63
  The was trained on a combination of the following datasets:
64
 
65
+ | Dataset | Sentences |
66
+ |--------------------|----------------|
67
+ | Global Voices | 21.342 |
68
+ | Memories Lluires | 1.173.055 |
69
+ | Wikimatrix | 1.205.908 |
70
+ | TED Talks | 50.979 |
71
+ | Tatoeba | 5.500 |
72
+ | CoVost 2 ca-en | 79.633 |
73
+ | CoVost 2 en-ca | 263.891 |
74
+ | Europarl | 1.965.734 |
75
+ | jw300 | 97.081 |
76
+ | Crawled Generalitat| 38.595 |
77
+ | Opus Books | 4.580 |
78
+ | CC Aligned | 5.787.682 |
79
+ | COVID_Wikipedia | 1.531 |
80
+ | EuroBooks | 3.746 |
81
+ | Gnome | 2.183 |
82
+ | KDE 4 | 144.153 |
83
+ | OpenSubtitles | 427.913 |
84
+ | QED | 69.823 |
85
+ | Ubuntu | 6.781 |
86
+ | Wikimedia | 208.073 |
87
+ |--------------------|----------------|
88
+ | **Total** | **11.558.183** |
89
 
90
  ### Training procedure
91
 
 
103
  The model is based on the Transformer-XLarge proposed by [Subramanian et al.](https://aclanthology.org/2021.wmt-1.18.pdf)
104
  The following hyperparamenters were set on the Fairseq toolkit:
105
 
106
+ | Hyperparameter | Value |
107
+ |------------------------------------|-----------------------------------|
108
+ | Architecture | transformer_vaswani_wmt_en_de_big |
109
+ | Embedding size | 1024 |
110
+ | Feedforward size | 4096 |
111
+ | Number of heads | 16 |
112
+ | Encoder layers | 24 |
113
+ | Decoder layers | 6 |
114
+ | Normalize before attention | True |
115
+ | --share-decoder-input-output-embed | True |
116
+ | --share-all-embeddings | True |
117
+ | Effective batch size | 96.000 |
118
+ | Optimizer | adam |
119
+ | Adam betas | (0.9, 0.980) |
120
+ | Clip norm | 0.0 |
121
+ | Learning rate | 1e-3 |
122
+ | Lr. schedurer | inverse sqrt |
123
+ | Warmup updates | 4000 |
124
+ | Dropout | 0.1 |
125
+ | Label smoothing | 0.1 |
126
 
127
  The model was trained for a total of 35.000 updates. Weights were saved every 1000 updates and reported results are the average of the last 16 checkpoints.
128
 
 
139
 
140
  | Test set | SoftCatalà | Google Translate | mt-aina-ca-en |
141
  |----------------------|------------|------------------|---------------|
142
+ | Spanish Constitution | 35,8 | 43,2 | 40,3 |
143
+ | United Nations | 44,4 | 47,4 | 44,8 |
144
+ | aina_aapp | 48,8 | 53 | 51,5 |
145
+ | Flores 101 dev | 42,7 | 47,5 | 46,1 |
146
+ | Flores 101 devtest | 42,5 | 46,9 | 45,2 |
147
+ | Cybersecurity | 52,5 | 58 | 54,2 |
148
+ | wmt 19 biomedical | 18,3 | 23,4 | 21,6 |
149
+ | wmt 13 news | 37,8 | 39,8 | 39,3 |
 
150
  |----------------------|------------|------------------|---------------|
151
+ | Average | 39,2 | 45,0 | 41,6 |
152
 
153
 
154
  ## Additional information