此时不应降低学习率,warmup 等超参,而是应该放大到Pretrain 规模

#2
by daner - opened

此时不应降低学习率,warmup 等超参,而是应该放大到Pretrain 规模。这句话什么意思呀,理解不了

在常规的模型训练中,如果总是碰到梯度爆炸,一般做法是削减学习率,增大warmup 比率,让模型以一个更加温和的方式进行训练。

但是在实际操作中我发现,常规操作并不会让模型训练稳定下来。相反,增大学习率,降低warmup ratio 才是稳定模型训练的关键。

According to my previous experience, if encountering persistent issues of gradient explosion, the traditional approaches involve reducing the learning rate and increasing the warmup ratio, enabling the model to train in a more stable manner.

However, in my Llama2 practical experience, I have discovered that the conventional approach does not lead to stable model training. Instead, increasing the learning rate and decreasing the warmup ratio have proven to be the key factors in achieving a stable model training process.

理解了,这样子应该需要分阶段设置学习率吧,让embedding层学习到知识之后,然后调小学习率,调整整个网络?

这当然也是一个可行的方案。不过考虑到训练的简洁性和统一性,我一般还是希望采用一本道式的训练方案。

Certainly, this is also a promising approach. However, takingtraining simplicity and uniformity into account, I generally prefer to adopt a "one-path" style training scheme.

Sign up or log in to comment