Training scripts ?

#1
by danielschnell - opened

Hi,

we were using this model for training of Icelandic Homographs. The results were quite good. See https://github.com/grammatek/IceHoc.
I'd be interested in the training scripts of this LM. Especially if it comes to dataset preparation and cleaning. Would you share those scripts ?

Kv,
Daniel.

Hi Daniel,

Happy to hear that the model performed so well on homograph classification. When pre-training the model, I followed Stefan Schweter's instructions:

https://github.com/stefan-it/turkish-bert/blob/master/convbert/CHEATSHEET.md
https://github.com/stefan-it/turkish-bert/blob/master/electra/CHEATSHEET.md

I used the pre-training script from the ConvBERT repository. Since the pre-training corpus (i.e., the Icelandic Gigaword Corpus) doesn't contain any web-crawled or noisy documents, I didn't perform any filtering or cleaning beforehand.

Best regards,
Jón

Sign up or log in to comment