HuggingFaceTB/SmolLM-135M · Trapezoidal scheduler with cooldown phase

Jul 17

•

Hi. Thanks for yet another insightful contribution. I am interested in extending this work with a couple of variations that I have in mind.

Can you say a bit more about the trapezoidal LR scheduling? In particular how is it different than OneCycleLR. Secondly is the cooldown phase the same as using the 'three_phase' option of OneCycleLR? And lastly, what is the warmup percentage/steps.

Would it be possible to open-source the training pipeline as well? Training from scratch at these sizes (135M/360M), is within the reach of many practicioners/researchers and having access to complete pipeline will help in reducing confounding factors.

Thanks!

maveriq

Jul 17

•

edited Jul 17

For anyone having same questions, I found most of the answers in this paper, except for the warmup percentage/steps.

Here is a quick implementation of TrapezoidLRScheduler

eliebak

Hugging Face TB Research org Aug 21

•

edited Aug 21

Hey! For the warmup we set it to 5000 steps, to be honest we didn't do much ablation on it, i think it don't have that much impact for very long training (might be wrong). For the training code will post it on github this week! We also have an implementation of WSD in nanotron LrSchedulerArgs.

pietrolesci

Aug 29

Just landed on this discussion as I had the same question regarding the LR schedule. I found the original implementation useful: https://github.com/epfml/schedules-and-scaling/blob/6e8b7f952420c928cc09a0e4bda9678e2bf42e5f/src/optim/utils.py#L55