This model was trained on a cleaned version of the Dutch part of mC4.
clean directory for the clean script.
- Documents that contained words from a selection of the Dutch and English List of Dirty Naught Obscene and Otherwise Bad Words are removed
- Sentences with less than 3 words are removed
- Sentences with a word of more than 1000 characters are removed
- Documents with less than 5 sentences are removed
Training of the model was resumed from an earlier checkpoint several times, as can be seen in the training metrics tab. (switch to wall time for a better view).
After several hours of training an error would be raised that we haven't been able to identify and solve. As a workaround, the first few resumes would start again at step 0 with a different seeded reshuffling of the data. In the last two resumes the random seed was fixed, and training would resume at the previous step, since a try/except around the failing example would allow training to continue in the case of errors caused by a single example.
The final model was trained for 63000 steps with a batch size of 128, ending with an evaluation loss of 1.79 and accuracy of 0.64. A triangle learning rate schedule was used, with peak learning rate 0.01 for the first few runs, and 0.001 for the last two runs.
Select AutoNLP in the “Train” menu to fine-tune this model automatically.
- Downloads last month