Pablo commited on
Commit
77d58c4
2 Parent(s): dd7ee73 3bf5e63

Merge branch 'main' of https://huggingface.co/bertin-project/bertin-roberta-base-spanish into main

Browse files
Files changed (1) hide show
  1. README.md +100 -20
README.md CHANGED
@@ -120,7 +120,7 @@ for split in ("random", "stepwise", "gaussian"):
120
 
121
  We then used the same setup as Liu et al. (2019) but trained only for half the steps (250k) on a sequence length of 128. In particular, `Gaussian` trained for the 250k steps, while `Random` was stopped at 230k and `Stepwise` at 180k (this was a decision based on an analysis of training performance and the computational resources available at the time).
122
 
123
- Then, we continued training the most promising model for a few steps (~25k) more on sequence length 512. We tried two strategies for this, since it is not easy to find clear details about this change in the literature.
124
 
125
  For `Random` sampling we trained with seq len 512 during the last 20 steps of the 250 training steps, keeping the optimizer state intact. Results for this are underwhelming, as seen in Figure 7:
126
 
@@ -131,11 +131,11 @@ For `Random` sampling we trained with seq len 512 during the last 20 steps of th
131
  <caption>Figure 7. Training profile for Random sampling. Note the drop in performance after the change from 128 to 512 sequence lenght.</caption>
132
  </figure>
133
 
134
- For `Gaussian` sampling we started a new optimizer after 230 steps with 128 seq len, using a short warmup interval. Results are much better (we do not have a graph since training needed to be restarted several times).
135
 
136
  ## Results
137
 
138
- Our first test, tagged `beta` in this repository, refers to an initial experiment using `Stepwise` on 128 sequence lengths but a small `factor` to oversample everything. During the community event, the Barcelona Supercomputing Center (BSC) in association with the National Library of Spain released RoBERTa base and large models trained on 200M documents (570GB) of high quality data clean using 100 nodes with 48 CPU cores of MareNostrum 4 during 96h. At the end of the process they were left with 2TB of clean data at the document level that were further cleaned up to the final 570GB. In all our experiments and procedures, we had access to 3xTPUv3-8 for 10 days to do cleaning, sampling, taining, and evaluation. The BSC team evaluated our early release of the model `beta` and the results can be seen in Table 1.
139
 
140
  Our final models were trained on a different number of steps and sequence lengths and achieve different—higher—masked-word prediction accuracies. Despite these limitations it is interesting to see the results they obtained using the early version of our model. Note that some of the datasets used for evaluation by BSC are not freely available, therefore it is not possible to verify the figures.
141
 
@@ -153,7 +153,24 @@ Our final models were trained on a different number of steps and sequence length
153
  | XNLI | Accuracy | 0.8016 | WiP | 0.8130 | 0.7876 | WiP |
154
 
155
 
156
- <caption>Table 1. Evaluation made by the Barcelona Supercomputing Center of their models and BERTIN (beta).</caption>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
157
  </figure>
158
 
159
  We are currently in the process of applying our language models to downstream tasks.
@@ -161,28 +178,91 @@ We are currently in the process of applying our language models to downstream ta
161
  **SQUAD-es**
162
  Using sequence length 128 we have achieved exact match 50.96 and F1 68.74.
163
 
164
- **POS**
 
165
 
166
  <figure>
167
 
168
- | Model | Metric |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
169
  |----------------------------------------------------|----------|
170
- | bert-base-multilingual-cased | 0.9629 |
171
- | dccuchile/bert-base-spanish-wwm-cased | 0.9642 |
172
- | BSC-TeMU/roberta-base-bne | 0.9659 |
173
- | flax-community/bertin-roberta-large-spanish | 0.9646 |
174
- | bertin-project/bertin-roberta-base-spanish | 0.9638 |
175
- | bertin-project/bertin-base-random | 0.9656 |
176
- | bertin-project/bertin-base-stepwise | 0.9656 |
177
- | bertin-project/bertin-base-gaussian | **0.9662** |
178
- | bertin-project/bertin-base-random-exp-512seqlen | 0.9660 |
179
- | bertin-project/bertin-base-gaussian-exp-512seqlen | **0.9662** |
180
-
181
-
182
- <caption>Table 2. Results for POS.</caption>
183
  </figure>
184
 
185
- **Improve table 2 with details like number of epochs etc**
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
186
 
187
  # Conclusions
188
 
120
 
121
  We then used the same setup as Liu et al. (2019) but trained only for half the steps (250k) on a sequence length of 128. In particular, `Gaussian` trained for the 250k steps, while `Random` was stopped at 230k and `Stepwise` at 180k (this was a decision based on an analysis of training performance and the computational resources available at the time).
122
 
123
+ Then, we continued training the most promising model for a few steps (~25k) more on sequence length 512. We tried two strategies for this, since it is not easy to find clear details about this change in the literature. It turns out this decision had a big impact in the final performance.
124
 
125
  For `Random` sampling we trained with seq len 512 during the last 20 steps of the 250 training steps, keeping the optimizer state intact. Results for this are underwhelming, as seen in Figure 7:
126
 
131
  <caption>Figure 7. Training profile for Random sampling. Note the drop in performance after the change from 128 to 512 sequence lenght.</caption>
132
  </figure>
133
 
134
+ For `Gaussian` sampling we started a new optimizer after 230 steps with 128 seq len, using a short warmup interval. Results are much better using this procedure. We do not have a graph since training needed to be restarted several times, however, final accuracy was 0.6873 compared to 0.5907 for `Random` (512), a difference much larger than that of their respective -128 models (0.6520 for `Random`, 0.6608 for `Gaussian`).
135
 
136
  ## Results
137
 
138
+ Our first test, tagged `beta` in this repository, refers to an initial experiment using `Stepwise` on 128 sequence length and trained for 210k steps. Two nearly identical versions of this model can be found, one at **bertin-roberta-base-spanish** and the other at **flax-community/bertin-roberta-large-spanish** (do note this is **not our best model**!). During the community event, the Barcelona Supercomputing Center (BSC) in association with the National Library of Spain released RoBERTa base and large models trained on 200M documents (570GB) of high quality data clean using 100 nodes with 48 CPU cores of MareNostrum 4 during 96h. At the end of the process they were left with 2TB of clean data at the document level that were further cleaned up to the final 570GB. This is an interesting contrast to our own resources (3xTPUv3-8 for 10 days to do cleaning, sampling, taining, and evaluation) and makes for a valuable reference. The BSC team evaluated our early release of the model `beta` and the results can be seen in Table 1.
139
 
140
  Our final models were trained on a different number of steps and sequence lengths and achieve different—higher—masked-word prediction accuracies. Despite these limitations it is interesting to see the results they obtained using the early version of our model. Note that some of the datasets used for evaluation by BSC are not freely available, therefore it is not possible to verify the figures.
141
 
153
  | XNLI | Accuracy | 0.8016 | WiP | 0.8130 | 0.7876 | WiP |
154
 
155
 
156
+ <caption>Table 1. Evaluation made by the Barcelona Supercomputing Center of their models and BERTIN (beta, seq len 128).</caption>
157
+ </figure>
158
+
159
+ All of our models attained good accuracy values, in the range of 0.65, as can be seen in Table 2:
160
+
161
+ <figure>
162
+
163
+ | Model || Accuracy |
164
+ |----------------------------------------------------|----------|
165
+ | bertin-project/bertin-roberta-base-spanish | 0.6547 |
166
+ | bertin-project/bertin-base-random | 0.6520 |
167
+ | bertin-project/bertin-base-stepwise | 0.6487 |
168
+ | bertin-project/bertin-base-gaussian | 0.6608 |
169
+ | bertin-project/bertin-base-random-exp-512seqlen | 0.5907 |
170
+ | bertin-project/bertin-base-gaussian-exp-512seqlen | **0.6873** |
171
+
172
+
173
+ <caption>Table 2. Accuracy for the different language models.</caption>
174
  </figure>
175
 
176
  We are currently in the process of applying our language models to downstream tasks.
178
  **SQUAD-es**
179
  Using sequence length 128 we have achieved exact match 50.96 and F1 68.74.
180
 
181
+ **POS**
182
+ All models trained with max length 512 and batch size 8, using the CoNLL 2002 dataset.
183
 
184
  <figure>
185
 
186
+ | Model | F1 | Accuracy |
187
+ |----------------------------------------------------|----------|----------|
188
+ | bert-base-multilingual-cased | 0.9629 | 0.9687 |
189
+ | dccuchile/bert-base-spanish-wwm-cased | 0.9642 | 0.9700 |
190
+ | BSC-TeMU/roberta-base-bne | 0.9659 | 0.9707 |
191
+ | bertin-project/bertin-roberta-base-spanish | 0.9638 | 0.9690 |
192
+ | bertin-project/bertin-base-random | 0.9656 | 0.9704 |
193
+ | bertin-project/bertin-base-stepwise | 0.9656 | 0.9707 |
194
+ | bertin-project/bertin-base-gaussian | **0.9662** | 0.9709 |
195
+ | bertin-project/bertin-base-random-exp-512seqlen | 0.9660 | 0.9707 |
196
+ | bertin-project/bertin-base-gaussian-exp-512seqlen | **0.9662** | **0.9714** |
197
+
198
+
199
+ <caption>Table 3. Results for POS.</caption>
200
+ </figure>
201
+
202
+
203
+ **NER**
204
+ All models trained with max length 512 and batch size 8, using the CoNLL 2002 dataset.
205
+
206
+ <figure>
207
+
208
+ | Model | F1 | Accuracy |
209
+ |----------------------------------------------------|----------|----------|
210
+ | bert-base-multilingual-cased | 0.8539 | 0.9779 |
211
+ | dccuchile/bert-base-spanish-wwm-cased | 0.8579 | 0.9783 |
212
+ | BSC-TeMU/roberta-base-bne | 0.8700 | 0.9807 |
213
+ | bertin-project/bertin-roberta-base-spanish | 0.8725 | 0.9812 |
214
+ | bertin-project/bertin-base-random | 0.8704 | 0.9807 |
215
+ | bertin-project/bertin-base-stepwise | 0.8705 | 0.9809 |
216
+ | bertin-project/bertin-base-gaussian | **0.8792** | **0.9816** |
217
+ | bertin-project/bertin-base-random-exp-512seqlen | 0.8616 | 0.9803 |
218
+ | bertin-project/bertin-base-gaussian-exp-512seqlen | **0.8764** | **0.9819** |
219
+
220
+
221
+ <caption>Table 4. Results for NER.</caption>
222
+ </figure>
223
+
224
+
225
+ **PAWS-X**
226
+ All models trained with max length 512 and batch size 8. The accuracy values in this case are a bit surprising (given some models are below 0.60 while others are close to 0.90), so these were run 3 times, with very similar results (these are the metrics for the last run).
227
+
228
+ <figure>
229
+
230
+ | Model | Accuracy |
231
  |----------------------------------------------------|----------|
232
+ | bert-base-multilingual-cased | 0.5765 |
233
+ | dccuchile/bert-base-spanish-wwm-cased | 0.5765 |
234
+ | BSC-TeMU/roberta-base-bne | 0.5765 |
235
+ | bertin-project/bertin-roberta-base-spanish | 0.6550 |
236
+ | bertin-project/bertin-base-random | 0.8665 |
237
+ | bertin-project/bertin-base-stepwise | 0.8610 |
238
+ | bertin-project/bertin-base-gaussian | **0.8800** |
239
+ | bertin-project/bertin-base-random-exp-512seqlen | 0.5765 |
240
+ | bertin-project/bertin-base-gaussian-exp-512seqlen | **0.875** |
241
+
242
+
243
+ <caption>Table 5. Results for PAWS-X.</caption>
 
244
  </figure>
245
 
246
+ **CNLI**
247
+ All models trained with max length 256 and batch size 16.
248
+
249
+ <figure>
250
+
251
+ | Model | Accuracy |
252
+ |----------------------------------------------------|----------|
253
+ | bert-base-multilingual-cased | WIP |
254
+ | dccuchile/bert-base-spanish-wwm-cased | WIP |
255
+ | BSC-TeMU/roberta-base-bne | WIP |
256
+ | bertin-project/bertin-roberta-base-spanish | WIP |
257
+ | bertin-project/bertin-base-random | 0.7745 |
258
+ | bertin-project/bertin-base-stepwise | 0.7820 |
259
+ | bertin-project/bertin-base-gaussian | **0.7942** |
260
+ | bertin-project/bertin-base-random-exp-512seqlen | 0.7723 |
261
+ | bertin-project/bertin-base-gaussian-exp-512seqlen | 0.7878 |
262
+
263
+
264
+ <caption>Table 6. Results for CNLI.</caption>
265
+ </figure>
266
 
267
  # Conclusions
268