Sao10K/Ramble · Learning rate scheduler

Jul 4

Hi,

I'm currently looking into adding a new learning rate scheduler to Transformers, which I call "staggered linear LR" https://github.com/huggingface/transformers/pull/31742 . The way it works is, it keeps a constant learning rate throughout the entire epoch, and then modifies it linearly at each new epoch, thus giving every part of the dataset an equal learning rate in the training process, while still allowing for LR dropping during training. The only caveat being that you need to train for more than 1 epoch.

Two questions:

What learning rate/scheduler do you usually use? Does it differ depending on the model or dataset? (E.g. different LR for big vs small models, etc)
Do you ever train more than 1 epoch?

Thanks.

Sao10K

Owner Jul 4

I use cosine_with_min_lr,
Max of 0.0004 and min of 0.00004, and around that range for different stheno / euryale variants.

Generally I'd decrease lr for smaller datasets, from intuition mainly, and monitoring loss curves.

With lora rank 64. It's... Pretty damn aggressive. Does its job.

I usually train for 2 epochs?

dg-kalle

Jul 4

Got it, thank you! For 2 epochs, you could almost manually do what I'm proposing, but still, would you be interested in trying it if it was available?

(Also, in case I decide to start harassing software implementations, which software do you train with? Axolotl?)

Sao10K

Owner Jul 4

Hmm, sounds interesting, why not.

And yeah, I use axolotl