Warn about config mismatch for pre-training

#2
by nthngdy - opened

The model card did not warn the user about the fact that the config is not scaled properly for pre-training with the corresponding google/electra-small-discriminator.

@lysandre please let me know what you think of this when you have some time!

Google org

So I took a look once again at the original code and checkpoints, and the ELECTRA framework offers some checkpoints to download. These checkpoints contain both the generator and discriminator, and these are the checkpoints that we isolated here.

Here's the table from the original codebase:

Model Layers Hidden Size Params GLUE score Download
ELECTRA-Small 12 256 14M 77.4 link
ELECTRA-Base 12 768 110M 82.7 link
ELECTRA-Large 24 1024 335M 85.2 link

Okay I see! My main concern (and what got me stuck for a while) is that if you load ElectraForMaskedLM(ElectraConfig.from_pretrained("google/electra-small-generator")) you get an architecture that cannot be pre-trained with ElectraForPreTraining(ElectraConfig.from_pretrained("google/electra-small-discriminator")) because it leads to instability. Most of the hyperparameters in the generator are supposed to be 1/4th of the ones in the discriminator, but currently they are all equal.
I don't really get why the generator checkpoint is using the wrong parameters in the first place, but I think it might be helpful to have a warning somewhere to specify this.

Google org

Yes, I definitely understand where you're coming from and I think a warning is warranted, let's try to have it as helpful as possible. Where did you get the notion that the hyperparams of the generator are supposed to be 1/4th of the hyperparams of the discriminator? Is that in the paper? Thanks!

Yes you can see that in the paper! They need to divide hidden size, FFN size and the number of attention heads by some factor so that the model converges (that goes for every size). The intuition is that if the generator is too strong it will fool the discriminator easily and make the learning process impossible.
image.png

Google org

I see, indeed! In that case, would you be willing to edit your warning to mention that:

  • This is the official generator checkpoint as in the ELECTRA original codebase
  • However, the paper recommends a multiplier between the discriminator and generator of 1/4 for this given model, so using this off the shelf will likely result in training instabilities

Would this work for you?

I tried to make it more clear based on your comments. Please let me know if I can still improve the message!

Google org

Yes that's great! Thanks a lot @nthngdy .

lysandre changed pull request status to merged

Sign up or log in to comment