Pretraining details/code?
Hi. I came across your models, with you seemingly being the only one on Huggingface to have pretrained a deberta-v3 in a language other than English and successfully uploaded/converted the model to HF format.
I'm curious: Was this model pretrained following the instructions in Microsoft's DeBERTa repository?
I'm also interested to know if you ran into any trouble converting the model to Huggingface after training it. Was it straightforward or difficult? Did you use a custom tokenizer for portuguese?
Hi.
Yes, I trained with the code from their repository using RTD, although I had to make some changes, but mostly regarding the dataloader.
The trained weights should be almost completely compatible with HF besides the projection layer. The code from from Microsoft's DeBERTa will create an embedding layer that has two weights (if you look at the layers' names there will be some with '.weights' and '._weights').
To not loose information when loading in huggingface and mantaining the equivalent math, we just need to preprocess the model to '.weights' = '.weights' + '._weights'. I uploaded already with this.
It's all needed for the convertion. Most of the code that huggingface uses for the model are actually from Microsoft's repository with some adjustments.
The observation here is that the final block layer (the replacement token detection layers) is not implemented in huggingface, so it won't load. It is possible to write the code to extend DeBERTa model with RTD head if you need the RTD. (there is a similar code in electra, just need some modifications)
I trained this large model from scratch, so I did train a custom tokenizer in portuguese trying to replicate the tokenizers from DeBERTa.
The troubles I had were actually with the model diverging during training, be it because of huggingface's code itself or because the model is very sensitive to calculations. That happened A LOT. So you might actually have low loss and good acuraccy on DeBERTa's side, but when you fine-tune through HF it won't work.
This large model was totally was trained scratch, as initializing with the english original model would just diverge soon. Also, I trained only with 55% of the whole dataset (half of C4 in PT), because it diverged and I would need a lot of resources (time and money) just trying to recover from it.
So what I would do is save checkpoints of the model and test them via fine-tuning to know when it stopped working.
The base sized model I trained was initialized with mdeberta weights, which seems to be much more stable and everything worked fine.
Thanks for responding and sharing your experience. Always reassuring when someone has gotten it to work before one ventures out and tries it themself. Only worrying part was the divergence issue!